We want to train the parameters \(\boldsymbol{\theta }\).
We can do this with gradient descent, by working out how much the loss function falls as we change each parameter.
The delta rule tells us how to do this.
If we have \(n\) features and \(m\) samples The error of the network is:
\(E=\sum_j^m\dfrac{1}{2}(y_j-a_j)^2\)
We know that \(a_j=f(\boldsymbol{\theta x_j})=f(\sum_i^n\theta^i x_j^i)\) and so:
\(E=\sum_j\dfrac{1}{2}(y_j-f(\boldsymbol{\theta x_j}))^2\)
\(E=\sum_j\dfrac{1}{2}(y_j-f(\sum_i\theta^i x_j^i))^2\)
We can see the change in error as we change the parameter:
\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-f(\boldsymbol{\theta x_j}))^2\)
\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-f(\sum_i\theta^i x_j^i))^2\)
\(\dfrac{\delta E}{\delta \theta^i }=\sum_j(y_j-f(\boldsymbol{\theta x_j}))\dfrac{\delta }{\delta \theta^i}(y_j-f(\boldsymbol{\theta x_j}))\)
\(\dfrac{\delta E}{\delta \theta^i }=\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta }{\delta \theta^i}(y_j-f(\sum_i\theta^i x_j^i))\)
\(\dfrac{\delta E}{\delta \theta^i }=\sum_j(y_j-f(\boldsymbol{\theta x_j}))\dfrac{\delta }{\delta \theta^i}(-f(\boldsymbol{\theta x_j}))\)
\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta }{\delta \theta^i}f(\sum_i\theta^i x_j^i)\)
\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta f(\sum_i\theta^ix_j^i)}{\delta \theta^i}\)
\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))\dfrac{\delta f(\sum_i\theta^ix_j^i)}{\sum_i\theta^ix_j^i}\dfrac{\sum_i\theta^ix_j^i}{\delta \theta^i}\)
\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-f(\sum_i\theta^i x_j^i))f'(\sum_i\theta^ix_j^i)x_j^i\)
By defining \(z_j=\sum_i\theta^ix_j^i\) and \(a=f\) we have:
\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-a(z_j))a'(z_j)x_j^i\)
We can see the change in error as we change the parameter:
\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-f(\boldsymbol{\theta x_j}))^2\)
By defining \(z_j=\sum_i\theta^ix_j^i\) and \(a=f\) we have:
\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta }{\delta \theta^i}\sum_j\dfrac{1}{2}(y_j-a(z_j))^2\)
\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta E}{\delta a_j}\dfrac{\delta a_j}{\delta z_j}\dfrac{\delta z_j}{\delta \theta^i}\)
\(\dfrac{\delta E}{\delta \theta^i }=\dfrac{\delta E}{\delta a}a'(z_j)x_j\)
\(\dfrac{\delta E}{\delta \theta^i }=\sum_ja'(z_j)x_j\)
\(\dfrac{\delta E}{\delta \theta^i }=-\sum_j(y_j-a(z_j))a'(z_j)x_j^i\)
We define delta as:
\(\delta_i=-\dfrac{\delta E}{\delta z_j}=\sum_j(y_j-a_j)a'(z_j)\)
So:
\(\dfrac{\delta E}{\delta \theta^i }=\delta_i x_{ij}\)
We update the parameters using gradient descent:
\(\Delta \theta_i=\alpha \delta^i x_{ij}\)
\(a(z)=z\)
\(a'(z)=1\)
This is the same as ordinary linear regression.
For linear regression our data generating process is:
\(y=\alpha + \beta x +\epsilon\)
For linear classification our data generating process is:
\(z=\alpha + \beta x +\epsilon\)
And set \(y\) to \(1\) if \(z>0\)
Or:
\(y=\mathbf I[\alpha+\beta x+\epsilon >0]\)
The probability that an invididual with characteristics \(x\) is classified as \(1\) is:
\(P_1=P(y=1|x)\)
\(P_1=P(\alpha + \beta x+\epsilon >0)\)
\(P_1=\int \mathbf I [\alpha + \beta x+\epsilon >0]f(\epsilon )d\epsilon\)
\(P_1=\int \mathbf I [\epsilon >-\alpha-\beta x ]f(\epsilon )d\epsilon\)
\(P_1=\int_{\epsilon=-\alpha-\beta x}^\infty f(\epsilon )d\epsilon\)
\(P_1=1-F(-\alpha-\beta x)\)
Depending on the probability distribution of \(epsilon\) we have different classifiers.
For the logistic case we then have
\(P(y=1|x)=\dfrac{e^{\alpha + \beta x}}{1+e^{\alpha + \beta x}}\)
If the sum is above \(0\), \(a(z)=1\). Otherwise, \(a(z)=0\).
This has a differential of \(0\) at all point except \(0\), where it is undefined.
This function is not smooth.
These is the activation function used in the perceptron.
Perceptron data needs to be linearly separable to train.
Even if linearly separable, doesn’t necessarily get the best outcome?
Perceptron: one node neural network. is one or zero depednign if weightd inputs enough. therefore is classiication
If error, update weights
Only works if linearly separable. ie can draw linear line completely separting all inputs
Neural network has more layers
Works if data is linear
How to treat node inputs: raw, sigmoid, 0,1
For all of these want the cost function have only one solution, like least squares doe. not guaranteed for all
For logistic, want to make it convex. loss = -log(f(x)) or -log(1-f(x)) depending on correct y. this is convex
How to create node inputs: sigmoid, binary cutoff
\(\sigma (z)=\dfrac{1}{1+e^{-z}}\)
The range of this activation is between \(0\) and \(1\).
\(\sigma '(z)=\dfrac{e^{-z}}{(1+e^{-z})^2}\)
\(\sigma '(z)=\sigma (z)\dfrac{1+e^{-z}-1}{1+e^{-z}}\)
\(\sigma '(z)=\sigma (z)[1-\sigma (z)]\)
The cumulative distribution function of the normal distribution.
\(\Phi (z)\)
The normal distribution:
\(\Phi'(z)=\phi (z)\)
\(a(x)=\sum_i a_i f(||x-c_i||)\)
\(a(x)=\sum_i a_i fe^{||x-c_i||^2}\)
\(p=xB\). can be outside \([0,1]\).