Multi-layer perceptrons

Forward pass through a Multi-Layer Perceptron (MLP)

Recap: Perceptron

In the perceptron we have input vector \(\textbf{x}\), and output:

\(a=f(\textbf{wx})=f(\sum_i^n {w_ix_i})\)

Adding additional layers to the perceptron

We can augment the perceptron by adding a hidden layer.

Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.

\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\).

In a \(2\)-layer perceptron we have \(\Theta^0\) and \(\Theta^1\).

If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:

The dimension of \(\Theta^0\) is \((n+1) \times s\)
The dimension of \(\Theta^1\) is \((s+1) \times k\)

These include the offsets for each layer.

The activation function of a multi-layer perceptron

For a perceptron we had:

\(a=f(\textbf{wx})=f(\sum_i^n{w_ix_i})\).

Now we have:

\(a_i^1=f(\boldsymbol{x\Theta^0 })=f(\sum_i^n{x_i\Theta_i^0 })\)

\(a_i^2=f(\boldsymbol{a^1\Theta^1 })=f(\sum_i^n{a_i^1\Theta_i^1 })\)

For additional layers this is:

\(a_i^j=f(\boldsymbol{a^{j-1}\Theta^{j-1} })=f(\sum_i^s{a_i^{j-1}\Theta_i^{j-1} })\)

We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).

Dummies in neural networks

Regressing on unbounded outputs with neural networks

Introduction

Can not have a sigmoid function at the last step.

Alternatively can apply a sigmoid function to the unbounded output to make it bounded.

Deep neural networks can represent complicated functions

More layers allow for more complex function

With additional hidden layers we can map more complex functions.

These allow the effective combination of logic gates.

2 hidden layers can map highly complex functions

With only two hidden layers we can map any function for classification, including discontinuous functions.

Increasing numbers of dimensions in a unit

Topology of layers. Increasing number of units in subsequent layers is like increasing dimension.

We are trying to make data linearly separable. it may be that we need additional dimensions to do this, rather than a series of transformations within the existing number of dimensions.

eg for a circle of data within a circle of data, there is no linear separable line, so no depth without increasing dimensions will split data.