Activation

An activation function adds nonlinearity to the output of a layer (linear in most cases) to enhance complexity.

ReLU and Softmax are SOTA.

Notations:

: input (element-wise)

Binary-like

Sigmoid

$$ \sigma(z)=\frac{1}{1+e^{-z}} $$

Idea:

Pros:

imitation of the firing rate of a neuron, 0 if too negative and 1 if too positive.
smooth gradient.

Cons:

vanishing gradient: gradients rapidly shrink to 0 along backprop as long as any input is too positive or too negative.
non-zero centric bias non-zero mean activations.
computationally expensive.

Tanh

$$ \tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}} $$

Idea:

Pros:

zero-centered
imitation of the firing rate of a neuron, -1 if too negative and 1 if too positive.
smooth gradient.

Cons:

vanishing gradient.
computationally expensive.

Linear Units (Rectified)

ReLU

$$ \mathrm{ReLU}(z)=\max{(0,z)} $$

Name: Rectified Linear Unit

Idea:

convert negative linear outputs to 0.

Pros:

no vanishing gradient
activate fewer neurons
much less computationally expensive compared to sigmoid and tanh.

Cons:

dying ReLU: if most inputs are negative, then most neurons output 0 they die. (NOTE: A SOLVABLE DISADVANTAGE)
- Cause 1: high learning rate input for neuron too negative.
- Cause 2: bias too negative input for neuron too negative.
activation explosion as . (NOTE: NOT A SEVERE DISADVANTAGE SO FAR)

LReLU

$$ \mathrm{LReLU}(z)=\max{(\alpha z,z)} $$

Name: Leaky Rectified Linear Unit

Params:

: hyperparam (negative slope), default 0.01.

Idea:

scale negative linear outputs by .

Pros:

no dying ReLU.

Cons:

slightly more computationally expensive than ReLU.
activation explosion as .

PReLU

$$ \mathrm{PReLU}(z)=\max{(\alpha z,z)} $$

Name: Parametric Rectified Linear Unit

Params:

: learnable parameter (negative slope), default 0.25.

Idea:

scale negative linear outputs by a learnable .

Pros:

a variable, adaptive parameter learned from data.

Cons:

slightly more computationally expensive than LReLU.
activation explosion as .

RReLU

$$ \mathrm{RReLU}(z)=\max{(\alpha z,z)} $$

Name: Randomized Rectified Linear Unit

Params:

: a random number sampled from a uniform distribution.
: hyperparams (lower bound, upper bound)

Idea:

scale negative linear outputs by a random .

Pros:

reduce overfitting by randomization.

Cons:

slightly more computationally expensive than LReLU.
activation explosion as .

Linear Units (Exponential)

ELU

$$ \mathrm{ELU}(z)=\begin{cases} z & \mathrm{if}\ z\geq0 \\ \alpha(e^z-1) & \mathrm{if}\ z<0 \end{cases} $$

Name: Exponential Linear Unit

Params:

: hyperparam, default 1.

Idea:

convert negative linear outputs to the non-linear exponential function above.

Pros:

mean unit activation is closer to 0 reduce bias shift (i.e., non-zero mean activation is intrinsically a bias for the next layer.)
lower computational complexity compared to batch normalization.
smooth to slowly with smaller derivatives that decrease forwardprop variation.
faster learning and higher accuracy for image classification in practice.

Cons:

slightly more computationally expensive than ReLU.
activation explosion as .

SELU

$$ \mathrm{SELU}(z)=\lambda\begin{cases} z & \mathrm{if}\ z\geq0 \ \alpha(e^z-1) & \mathrm{if}\ z<0 \end{cases} $$

Name: Scaled Exponential Linear Unit

Params:

: hyperparam, default 1.67326.
: hyperparam (scale), default 1.05070.

Idea:

scale ELU.

Pros:

self-normalization activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance.

Cons:

more computationally expensive than ReLU.
activation explosion as .

CELU

$$ \mathrm{CELU}(z)=\begin{cases} z & \mathrm{if}\ z\geq0\ \alpha(e^{\frac{z}{\alpha}}-1) & \mathrm{if}\ z<0 \end{cases} $$

Name: Continuously Differentiable Exponential Linear Unit

Params:

: hyperparam, default 1.

Idea:

scale the exponential part of ELU with to make it continuously differentiable.

Pros:

smooth gradient due to continuous differentiability (i.e., ).

Cons:

slightly more computationally expensive than ELU.
activation explosion as .

Linear Units (Others)

GELU

$$ \mathrm{GELU}(z)=z*\Phi(z)=0.5z(1+\tanh{[\sqrt{\frac{2}{\pi}}(z+0.044715z^3)]}) $$

Name: Gaussian Error Linear Unit

Idea:

weigh each output value by its Gaussian cdf.

Pros:

throw away gate structure and add probabilistic-ish feature to neuron outputs.
seemingly better performance than the ReLU and ELU families, SOTA in transformers.

Cons:

slightly more computationally expensive than ReLU.
lack of practical testing at the moment.

SiLU

$$ \mathrm{SiLU}(z)=z*\sigma(z) $$

Name: Sigmoid Linear Unit

Idea:

weigh each output value by its sigmoid value.

Pros:

throw away gate structure.
seemingly better performance than the ReLU and ELU families.

Cons:

worse than GELU.

Softplus

$$ \mathrm{softplus}(z)=\frac{1}{\beta}\log{(1+e^{\beta z})} $$

Idea:

smooth approximation of ReLU.

Pros:

differentiable and thus theoretically better than ReLU.

Cons:

empirically far worse than ReLU in terms of computation and performance.

Multiclass

Softmax

$$ \mathrm{softmax}(z_i)=\frac{\exp{(z_i)}}{\sum_j{\exp{(z_j)}}} $$

Idea:

convert each value .

Pros:

your single best choice for multiclass classification.

Cons:

mutually exclusive classes (i.e., one input can only be classified into one class.)

Softmin

$$ \mathrm{softmin}(z_i)=\mathrm{softmax}(-z_i)=\frac{\exp{(-z_i)}}{\sum_j{\exp{(-z_j)}}} $$

Idea:

reverse softmax.

Pros:

suitable for multiclass classification.

Cons:

why not softmax.

Layer

Toolbox

Docs

Q.Docs

Title here

Activation

Binary-like

Sigmoid

Tanh

Linear Units (Rectified)

ReLU

LReLU

PReLU

RReLU

Linear Units (Exponential)

ELU

SELU

CELU

Linear Units (Others)

GELU

SiLU

Softplus

Multiclass

Softmax

Softmin

Activation

Binary-like

Sigmoid#

Tanh#

Linear Units (Rectified)

ReLU#

LReLU#

PReLU#

RReLU#

Linear Units (Exponential)

ELU#

SELU#

CELU#

Linear Units (Others)

GELU#

SiLU#

Softplus#

Multiclass

Softmax#

Softmin#

Sigmoid

Tanh

ReLU

LReLU

PReLU

RReLU

ELU

SELU

CELU

GELU

SiLU

Softplus

Softmax

Softmin