Journey into Machine Learning

Tuesday, 15 November 2016

Feedforward Artificial Neural Network pt1

Architecture of an ANN

The history of ANNs can be traced back to Warren McCulloch, who proposed a mechanism of how neurons might work by modelling a simple neural network with an electrical circuit. This line of thinking was reinforced by Donald Hebb, who discovered that neural pathways between neurons were strengthened through repetitive use. Understanding the motivation behind ANNs will help us understand why they are structured the way they are, particularly the action potential of a neuron.

An ANN consists of three major components;

Input Layer

This layer takes the inputs as is - they do not modify the data at all; they push it into the hidden layers.

Hidden Layer(s)

In line with the aforementioned action potential of a neuron in the brain, the hidden layer neurons take an input, applies an activation function and pushes the result to the next layer in the network.
The activation function $\phi : \mathbb{R} \rightarrow \mathbb{R}$ takes $\sum_{i=1}^{n} x_i w_i + b$ as an argument, where $n$ is the number of inputs flowing into the particular hidden layer neuron, $w_{i}$ is the associated weight with input $x_{i}$ and $b$ is called the bias. Our task is to find the $w_{i}$ and $b$ for each neuron in the hidden layer which minimize our error function (more on this later).
The choice of activation function is entirely up to you. Obviously there are some standard choices for certain tasks. See below for a (by no means exhaustive) list of activation functions
The activation functions are they key to capturing non-linearities in our decision boundary.

Output Layer

The output of the neural network. In our case (classification) it will be a label classifying our input data to one of two classes.

These three components together with the topology of the network will define the entire structure of our ANN. Below is a stock image of a typical ANN

On the left there are 3 inputs which flow into the hidden layers (more on these below) which then in turn flow into the outputs. There are two hidden layers in this setup, each with 4 neurons - these are hyper parameters we can tune to increase performance and reduce overfitting.

Note that the flow of the network is strictly left to right, none of the neurons in the hidden layer are connected to one another and none of the neurons in the network loop back into themselves. This single direction of information flow and the fact that there are no loops or cycles in the network define a Feedforward Artificial Neural Network. Also note that the number of inputs and outputs are fixed once we define the network under this architecture. Other types of neural networks exist that are able to handle inputs with multiple lengths, but our focus for now will be Feedforward ANNs.

By why this structure?

The above structure of a neural network may seem like an arbitrary construction - how can we guarantee that it can be trained in such a way to produce our desired output?

Well..luckily for us there exists a theorem called the Universal Approximation Theorem which states that under reasonable circumstances, a feedforward network containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^{n}$.

Statement

Let $\phi \left( \cdot \right)$ be a nonconstant, bounded and monotonically-increasing continuous function. Let $I_{m}$ denote the $m$-dimensional unit hypercube $\left[ 0, 1 \right]$. The space of continuous functions on $I_{m}$ is $C \left( I_{m} \right)$. Then $\forall f \in C \left( I_{m} \right)$ and $ \epsilon > 0$ $\exists N \in \mathbb{Z}$, $v_{i}, b_{i} \in \mathbb{R}$ and $w_i \in \mathbb{R}^{m}$ where $ i = 1, \dots, N$ such that we may define

$$F(x) \equiv \sum_{i=1}^{N} v_{i} \phi \left( w_{i}^{T}x + b_{i} \right)$$

such that $$\left| F(x) - f(x) \right| < \epsilon \quad \forall x \in I_{m}$$

Which is quite fascinating really, that we can essentially approximate any continuous function to an arbitrary degree using a linear combination of our activation functions. Please note that this tells us nothing of actually how to approximate the function; what the parameters $N, v_{i}, b_{i}$ or $w_{i}$ are - only that they exist.

Summary

So there it is, a whirlwind introduction to Feedforward Artificial Networks. By no means is this supposed to be a comprehensive introduction, but to set the scene for the next few blog posts. In my next blog I'll define the error functions and consequently the Backpropogation algorithm which we will use to train our Feedforward ANN to find the weights and biases of our hidden layer neurons.

Monday, 14 November 2016

Back.. again! Artificial neural networks have tickled my fancy

I'm really impressed with my foresight when naming this blog all those years ago, it truly does live on the edge of abandon as I haven't touched it in 3 years! I've decided to fire it back up again and will (at least try) to actively post as I delve into the world of statistical / machine learning! I just want a place to document my thoughts/learnings/projects as I delve into this fascinating area. I've decided to use Python as my language of choice as I've always wanted to learn it but never got around to it or had the chance. Now I have no excuses!

So, without further ado...

Artificial Neural Networks

I'm going to assume some familiarity with ANNs - what they are, their general structure/philosophy. See here for a comprehensive introduction.
As a first pass, I'll introduce a motivating example, go through the math and then implement (in Python) an ANN from scratch.

Definition of the problem

We'll look at a classification problem in 2 dimensions (to make visualisation easier, read: possible) where our old friend Logistic Regression struggles. It will highlight a key fact that needs to be understood in any sort of machine learning / statistical problem; understanding when and what tools to use to attack a problem.

The data:

We'll use a toy dataset from sklearn called make_moons, with the following call:


import pandas as pd

import sklearn.datasets
import seaborn as sb
import sklearn.linear_model

X,y = sklearn.datasets.make_moons(200, noise=0.2)
df = pd.DataFrame()
df['x1'] = X[:,0]
df['x2'] = X[:,1]
df['z'] = y

sb.lmplot('x1','x2',hue='z', data=df, fit_reg = False)

The task:

We want to calculate the decision boundary between the two (blue and green) classes. At a glance it's clearly non-linear, so let's see how a Logistic Regression classification goes!

Results:

We can see that the Logistic Regression can only fit a linear decision boundary (see below) - even though the actual decision boundary is decidedly non-linear.

We can see why the Logistic Regression decision boundary is linear from the following:
$$ P_{boundary} \equiv \frac{1}{1+e^{- \theta \cdot x}} = 0.5 $$
$$ \implies 1 = e^{- \theta \cdot x}$$
$$ \implies \theta \cdot x = 0 $$ which defines a plane (line in 2 dimensions).

In part 2 I will define an ANN that we will work through to establish exactly how it works and how it is trained. We will then construct and implement it in python to get an idea of the decision boundaries it can produce.

Sunday, 25 August 2013

(Weak) Law of Large Numbers and the Central Limit Theorem

Following the definitions of convergence in probability and convergence in distribution in the previous post - we now state two fundamental theorems which incorporate these concepts.

(Weak) Law of Large Numbers:
Let $\left\{X_i \right\}$ be a sequence of iid (independent and identically distributed) random variables with $\mathbb{E}\left[X_i\right] = \mu \lt \infty$ and $\text{Var}\left(X_i \right) = \sigma^2 $. Now define $$S_n:=\sum_{i=1}^{n} X_i$$ then $$\frac{S_n}{n}\rightarrow \mu$$ as $n\rightarrow \infty$.

Proof:
Since the random variables are all iid then $$\mathbb{E}\left[\frac{S_n}{n} \right] = \frac{1}{n} \mathbb{E}\left[S_n\right] = \frac{n \mu}{n} = \mu$$ Similarly $$\text{Var}\left( \frac{S_n}{n} \right) = \frac{1}{n^2} \left(\mathbb{E}[S_n^2]- \mathbb{E}[S_n]^2 \right)=\frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}$$ We now call upon the Chebyshev's Inequality which states if $X$ is a random variable with finite expectation $\mu$ and non-zero variance $\sigma^2$ then $$P\left( \left|X-\mu \right| \ge n \sigma \right) \le \frac{1}{n^2}$$It gives a bound on the distributions values in terms of the variance - this is completely distribution agnostic.

We can now use Chebyshev's Inequality and re-write it using $X=S_n$ and $n \sigma = \epsilon$:
$$P\left( \left|S_n-\mu \right| \ge \epsilon \right) \le \frac{\sigma^2}{n \epsilon^2}$$Hence
$$ P\left( \left|S_n-\mu \right| \lt \epsilon \right) = 0$$ as $n \rightarrow \infty$. Which is precisely the definition of convergence in probability.

Central Limit Theorem (CLT):
We shall state a slightly restricted version of the CLT which is fine for illustrative purposes. Taking our $S_n$ from above, then the CLT states:
Let $\left\{X_1, X_2, . . . \right\}$ be a sequence of i.i.d random variables with finite expectation $\mathbb{E}[X_i] = \mu < \infty$ and finite non-zero variance $\text{Var}\left(X_i\right) = \sigma^2 < \infty$. Then

$$ P \left( \frac{S_n-n \mu}{\sigma \sqrt{n}} \leq x \right) \rightarrow \Phi(x)$$
as $n \rightarrow \infty$ and the convergence is in distribution and $\Phi(x)$ is the cumulative density function of of a Standard Normal variable. We shall not prove the CLT as only an understanding of what it says and how it can be applied is required.

This explains why Normal distributions are so common in modelling (and even nature) since under these mild conditions, regardless of distribution - if this combination of random variables is formed then it will tend to the normal distribution for large enough n.

Thursday, 22 August 2013

Convergence of Random Variables

Before moving on to scaling a random walk into a Wiener process, we must understand the notion of convergence of random variables.

We won't cover the notions of uniform and pointwise convergence of a sequence of functions here - these are covered in any undergraduate analysis class. We will delve right into Convergence of random variables.

Convergence in probability:
Let $\left\{X_n\right\}$ be a sequence of of random variables and $X$ be a random variable. $\left\{X_n\right\}$ is said to converge in probability to $X$ if $\forall \epsilon \lt 0$
$$P\left(\left| X_n-X \right| \lt \epsilon \right)=0$$ as $n \rightarrow \infty$.

Another notion of convergence of random variables is Convergence in distribution.

Convergence in distribution:
Given random variables $\{X_1, X_2, ...\}$ and the cumulative distribution functions $F_i(x)$ of $X_i$ then we say that $X_n$ converges to the distribution of $X$ if
$$\lim_{n \rightarrow \infty} F_n(x) = F(x)$$ for every $x \in \mathbb{R}$ at which $F(x)$ is continuous.

Note that only the distributions of the variables matter - so they random variables themselves could in fact be defined on different probability spaces.

These may seem like fairly elementary definitions, but they hold great importance in applications of probability, namely through the (Weak) Law of Large numbers and the Central Limit Theorem - which will be covered in the next post.

Monday, 19 August 2013

I've returned with a random walk

So after two years absent from blogging, I am now back with a rejuvenated interest in mathematics, more specifically financial mathematics. So from now on, this blog will serve as my workbook as I endeavour to learn more about portfolio valuation, solving stochastic differential equations and other such fun. Such posts will not be rigorous in the mathematical sense, but will serve as a heuristic survey of the different tools available and how they are used in a financial context. The posts may also seem scattered content wise as I really haven't planned a syllabus - I'm just learning things as I come across them.

With that in mind - I will start with the very basics; introducing my first blog in over two years, I give a quick overview of Random Walks in $\mathbb{Z}$

Let $X_0 := 0$ be our starting position and for each $i \in \mathbb{N}$ let $$X_i:=X_{i-1}+dX_i$$ where $dX_i = \pm 1$ with equal probability - i.e is a Bernoulli distributed random variable with $p=q=\frac{1}{2}$. Therefore our position at step $N$ is just the sum of the independent Bernoulli variables;
$$X_N= \sum_{i=0}^{N} dX_i$$.
We can calculated the expectation: $$\mathbb{E}[X_N]= \mathbb{E}\left[ \sum_{i=0}^{N} dX_i\right]= \sum_{i=0}^{N} \mathbb{E} [ dX_i]=0$$ by using the definition of the expectation of a discrete random variable and the fact that $dX_i$ and $dX_j$ are independent for $i \neq j$. Now
$$Var(X)=\mathbb{E}[X^2]- \mathbb{E}[X]^2$$ we can calculate $$\mathbb{E}[X_N^2] = \sum_{i=0}^N 1 = N$$That is, the variance of $X_N$ is the jump size times the total steps $N$.
Below I have simulated the aforementioned random walk for $N=100$ and $N=100000$ respectively.

Next we'll look at a specific scaling of these random walks to obtain a Wiener Process and eventually talk about Stochastic Integrals and Ito's lemma.

Thursday, 24 March 2011

I've been missing.

Ok so since my university work has ramped up - I've had little time for blogging/let alone myself; thus the absence of interesting updates.

Hopefully, when my assignments are done I shall devote my time to making some more thought provoking posts.

But for now I'll leave you to ponder Russel's Paradox

Pre-reading: The following assumes that you are all familiar with a set, if not then put simply;

A set is a collection of objects.

Generally there will be a rule discerning whether an object is included in a set or not. for example;

A set of all shoe brands does not have colgate as a member (since colgate is a toothpaste brand).

The members of a set can also be sets themselves, for example;

The set of all forks is a member of the set of all cutlery.

Now make a proposal;

Let F be the set of all forks, F itself is not a fork therefore F does not contain itself as a member.
We can call F as a normal set.

Now suppose we have a set NF - this is the set of all things that are not forks. NF itself is not a fork, therefore it contains itself as a member. We can call NF an abnormal set.

Suppose we have a new set G - with all of its elements normal sets. Now, if G is normal then it should contain itself as a member - but the instant it contains itself, G becomes abnormal.

This is known as Russel's Paradox.

Wednesday, 9 March 2011

Integral of a Gaussian (continued)

Ok, so last post we derived (not so rigorously) that the integral of a standard Gaussian as;

$$\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$$

and using a substitution of

$$x=\sqrt{a}y$$ and $$dx=\sqrt{a}dy$$

we obtain;$$\int_{-\infty}^{\infty} e^{-ax^2} dx = \sqrt{\frac{\pi}{a}}$$.

Now let's extend that to the more general form of;

$$I=\int_{-\infty}^{\infty} x^2 e^{-ax^2} dx $$ where a is some constant.

We can recognise (or at least be told to recognise) that the argument of the integral; $$x^2 e^{-ax^2}$$ is equal to

$$-\frac{\partial}{\partial a} e^{-ax^2}$$

Putting this together, we get;

$$I=\int_{-\infty}^{\infty}-\frac{\partial}{\partial a} e^{-ax^2} dx$$

We can swap the partial derivative and Integration sign (provided the variable we are integrating over is independent of a and the integrand is continuous)

We get; $$I= -\frac{\partial}{\partial a} \int_{-\infty}^{\infty} e^{-ax^2} dx = -\frac{\partial}{\partial a}\sqrt{\frac{\pi}{a}}$$

$$=\frac{1}{2a} \sqrt{\frac{\pi}{a}}$$

Therefore; $$\int_{-\infty}^{\infty} x^2 e^{-ax^2} dx =\frac{1}{2a} \sqrt{\frac{\pi}{a}}$$

This technique can be extended for different powers of x - provided they are even.