Addressing gradient problems with the Rectified Linear Unit (ReLU)

Link/activation functions: Regression

Absolute value rectification

a(x)=|x|

Rectified Linear Unit (ReLU)

The function

a(z)=max(0,z)

The derivative

Its differential is 1 for values of z above 0, and 0 for values of z below 0.

The differential is undefined at z=0, however this is unlikely to occur in practice.

Notes

The ReLU activation function induces sparcity.

Noisy ReLU

Leaky ReLU

Parametric ReLU

Softplus

The function

a(z)=ln(1+ez)

The derivative

Its derivative is the sigmoid function:

a(z)=11+ez

Notes

The softplus function is a smooth approximation of the ReLU function.

Unlike the ReLU function, Softplus does not induce sparcity.

Exponential Linear Unit (ELU)

Link/activation functions: Regression

Convexity

The error function for neural networks is nearly convex.

Unstable gradient problem

Vanishing gradient problem

Gradients can be become small, and so propagation can be very slow.

Exploding gradient problem

Gradients can become too large and not converge.

ReLU

This addresses the unstable gradient problem.

Curse of dimensionality

As parameter space gets bigger, data requirements get bigger.

Sparse models can help. Eg ReLU.

Representational sparsity

This is where the values in nodes are often 0, as opposed to just the parameters.