Activation

An activation function adds nonlinearity to the output of a layer (linear in most cases) to enhance complexity.

ReLU and Softmax are SOTA.

Notations:

  • mathematical expression or equation : input (element-wise)

Binary-like

Sigmoid

$$ \sigma(z)=\frac{1}{1+e^{-z}} $$

Idea:

  • mathematical expression or equation .

Pros:

  • imitation of the firing rate of a neuron, 0 if too negative and 1 if too positive.
  • smooth gradient.

Cons:

  • vanishing gradient: gradients rapidly shrink to 0 along backprop as long as any input is too positive or too negative.
  • non-zero centric bias mathematical expression or equation non-zero mean activations.
  • computationally expensive.

Tanh

$$ \tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}} $$

Idea:

  • mathematical expression or equation .

Pros:

  • zero-centered
  • imitation of the firing rate of a neuron, -1 if too negative and 1 if too positive.
  • smooth gradient.

Cons:

  • vanishing gradient.
  • computationally expensive.

Linear Units (Rectified)

ReLU

$$ \mathrm{ReLU}(z)=\max{(0,z)} $$

Name: Rectified Linear Unit

Idea:

  • convert negative linear outputs to 0.

Pros:

  • no vanishing gradient
  • activate fewer neurons
  • much less computationally expensive compared to sigmoid and tanh.

Cons:

  • dying ReLU: if most inputs are negative, then most neurons output 0 mathematical expression or equation they die. (NOTE: A SOLVABLE DISADVANTAGE)

    • Cause 1: high learning rate mathematical expression or equation input for neuron too negative.
    • Cause 2: bias too negative mathematical expression or equation input for neuron too negative.
  • activation explosion as mathematical expression or equation . (NOTE: NOT A SEVERE DISADVANTAGE SO FAR)

LReLU

$$ \mathrm{LReLU}(z)=\max{(\alpha z,z)} $$

Name: Leaky Rectified Linear Unit

Params:

  • mathematical expression or equation : hyperparam (negative slope), default 0.01.

Idea:

  • scale negative linear outputs by mathematical expression or equation .

Pros:

  • no dying ReLU.

Cons:

  • slightly more computationally expensive than ReLU.
  • activation explosion as mathematical expression or equation .

PReLU

$$ \mathrm{PReLU}(z)=\max{(\alpha z,z)} $$

Name: Parametric Rectified Linear Unit

Params:

  • mathematical expression or equation : learnable parameter (negative slope), default 0.25.

Idea:

  • scale negative linear outputs by a learnable mathematical expression or equation .

Pros:

  • a variable, adaptive parameter learned from data.

Cons:

  • slightly more computationally expensive than LReLU.
  • activation explosion as mathematical expression or equation .

RReLU

$$ \mathrm{RReLU}(z)=\max{(\alpha z,z)} $$

Name: Randomized Rectified Linear Unit

Params:

  • mathematical expression or equation : a random number sampled from a uniform distribution.
  • mathematical expression or equation : hyperparams (lower bound, upper bound)

Idea:

  • scale negative linear outputs by a random mathematical expression or equation .

Pros:

  • reduce overfitting by randomization.

Cons:

  • slightly more computationally expensive than LReLU.
  • activation explosion as mathematical expression or equation .

Linear Units (Exponential)

ELU

$$ \mathrm{ELU}(z)=\begin{cases} z & \mathrm{if}\ z\geq0 \\ \alpha(e^z-1) & \mathrm{if}\ z<0 \end{cases} $$

Name: Exponential Linear Unit

Params:

  • mathematical expression or equation : hyperparam, default 1.

Idea:

  • convert negative linear outputs to the non-linear exponential function above.

Pros:

  • mean unit activation is closer to 0 mathematical expression or equation reduce bias shift (i.e., non-zero mean activation is intrinsically a bias for the next layer.)
  • lower computational complexity compared to batch normalization.
  • smooth to mathematical expression or equation slowly with smaller derivatives that decrease forwardprop variation.
  • faster learning and higher accuracy for image classification in practice.

Cons:

  • slightly more computationally expensive than ReLU.
  • activation explosion as mathematical expression or equation .

SELU

$$ \mathrm{SELU}(z)=\lambda\begin{cases} z & \mathrm{if}\ z\geq0 \ \alpha(e^z-1) & \mathrm{if}\ z<0 \end{cases} $$

Name: Scaled Exponential Linear Unit

Params:

  • mathematical expression or equation : hyperparam, default 1.67326.
  • mathematical expression or equation : hyperparam (scale), default 1.05070.

Idea:

  • scale ELU.

Pros:

  • self-normalization mathematical expression or equation activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance.

Cons:

  • more computationally expensive than ReLU.
  • activation explosion as mathematical expression or equation .

CELU

$$ \mathrm{CELU}(z)=\begin{cases} z & \mathrm{if}\ z\geq0\ \alpha(e^{\frac{z}{\alpha}}-1) & \mathrm{if}\ z<0 \end{cases} $$

Name: Continuously Differentiable Exponential Linear Unit

Params:

  • mathematical expression or equation : hyperparam, default 1.

Idea:

  • scale the exponential part of ELU with mathematical expression or equation to make it continuously differentiable.

Pros:

  • smooth gradient due to continuous differentiability (i.e., mathematical expression or equation ).

Cons:

  • slightly more computationally expensive than ELU.
  • activation explosion as mathematical expression or equation .

Linear Units (Others)

GELU

$$ \mathrm{GELU}(z)=z*\Phi(z)=0.5z(1+\tanh{[\sqrt{\frac{2}{\pi}}(z+0.044715z^3)]}) $$

Name: Gaussian Error Linear Unit

Idea:

  • weigh each output value by its Gaussian cdf.

Pros:

  • throw away gate structure and add probabilistic-ish feature to neuron outputs.
  • seemingly better performance than the ReLU and ELU families, SOTA in transformers.

Cons:

  • slightly more computationally expensive than ReLU.
  • lack of practical testing at the moment.

SiLU

$$ \mathrm{SiLU}(z)=z*\sigma(z) $$

Name: Sigmoid Linear Unit

Idea:

  • weigh each output value by its sigmoid value.

Pros:

  • throw away gate structure.
  • seemingly better performance than the ReLU and ELU families.

Cons:

  • worse than GELU.

Softplus

$$ \mathrm{softplus}(z)=\frac{1}{\beta}\log{(1+e^{\beta z})} $$

Idea:

  • smooth approximation of ReLU.

Pros:

  • differentiable and thus theoretically better than ReLU.

Cons:

  • empirically far worse than ReLU in terms of computation and performance.

Multiclass

Softmax

$$ \mathrm{softmax}(z_i)=\frac{\exp{(z_i)}}{\sum_j{\exp{(z_j)}}} $$

Idea:

  • convert each value mathematical expression or equation .

Pros:

  • your single best choice for multiclass classification.

Cons:

  • mutually exclusive classes (i.e., one input can only be classified into one class.)

Softmin

$$ \mathrm{softmin}(z_i)=\mathrm{softmax}(-z_i)=\frac{\exp{(-z_i)}}{\sum_j{\exp{(-z_j)}}} $$

Idea:

  • reverse softmax.

Pros:

  • suitable for multiclass classification.

Cons:

  • why not softmax.