In the perceptron we have input vector \(\textbf{x}\), and output:
\(a=f(\textbf{wx})=f(\sum_i^n {w_ix_i})\)
We can augment the perceptron by adding a hidden layer.
Now the output on the activation function is an input to a second layer. By using different weights, we can create a second vector of inputs to the second layer.
\(\Theta^{j}\) is a matrix of weights for mapping layer \(j\) to \(j+1\).
In a \(2\)-layer perceptron we have \(\Theta^0\) and \(\Theta^1\).
If we have \(s\) units in the hidden layer, \(n\) features and \(k\) classes:
The dimension of \(\Theta^0\) is \((n+1) \times s\)
The dimension of \(\Theta^1\) is \((s+1) \times k\)
These include the offsets for each layer.
For a perceptron we had:
\(a=f(\textbf{wx})=f(\sum_i^n{w_ix_i})\).
Now we have:
\(a_i^1=f(\boldsymbol{x\Theta^0 })=f(\sum_i^n{x_i\Theta_i^0 })\)
\(a_i^2=f(\boldsymbol{a^1\Theta^1 })=f(\sum_i^n{a_i^1\Theta_i^1 })\)
For additional layers this is:
\(a_i^j=f(\boldsymbol{a^{j-1}\Theta^{j-1} })=f(\sum_i^s{a_i^{j-1}\Theta_i^{j-1} })\)
We refer to the value of a node as \(a_i^{j}\), the activation of unit \(i\) in layer \(j\).
Can not have a sigmoid function at the last step.
Alternatively can apply a sigmoid function to the unbounded output to make it bounded.
With additional hidden layers we can map more complex functions.
These allow the effective combination of logic gates.
With only two hidden layers we can map any function for classification, including discontinuous functions.
Topology of layers. Increasing number of units in subsequent layers is like increasing dimension.
We are trying to make data linearly separable. it may be that we need additional dimensions to do this, rather than a series of transformations within the existing number of dimensions.
eg for a circle of data within a circle of data, there is no linear separable line, so no depth without increasing dimensions will split data.