Addressing gradient problems with the Rectified Linear Unit (ReLU)

Link/activation functions: Regression

\(a(x)=|x|\)

\(a(z)=\max (0,z)\)

Its differential is \(1\) for values of \(z\) above \(0\), and \(0\) for values of \(z\) below \(0\).

The differential is undefined at \(z=0\), however this is unlikely to occur in practice.

The ReLU activation function induces sparcity.

\(a(z)=\ln (1+e^z)\)

Its derivative is the sigmoid function:

\(a'(z)=\dfrac{1}{1+e^{-z}}\)

The softplus function is a smooth approximation of the ReLU function.

Unlike the ReLU function, Softplus does not induce sparcity.

The error function for neural networks is nearly convex.

Gradients can be become small, and so propagation can be very slow.

Gradients can become too large and not converge.

This addresses the unstable gradient problem.

As parameter space gets bigger, data requirements get bigger.

Sparse models can help. Eg ReLU.

This is where the values in nodes are often \(0\), as opposed to just the parameters.