Deep Learning - Activation

Sigmoid, ReLU, GELU Tanh

Posted by Rico's Nerd Cluster on January 8, 2022

Activation Functions

Early work observed that the Rectified Linear Unit (ReLU) often trains faster than sigmoid-like activations because it avoids saturation for positive inputs and has a simple gradient. Modern techniques such as batch normalization reduce some of the original differences, but ReLU and its variants remain popular.

  • ReLU (\mathrm{ReLU}(x)=\max(0,x)):
    • Simple, computationally cheap, and has gradient 1 for positive inputs which helps gradient flow.
    • Advantages:
      • Avoids saturation on positive side; faster convergence in many networks.
      • Sparse activations (many zeros) can act as a regularizer and reduce computation.
    • Disadvantages:
      • “Dead” neurons: if a unit receives only negative inputs it can become inactive (output zero) and stop learning.
      • Unbounded outputs on the positive side.
    • Rule of thumb: avoid unnecessarily placing ReLU immediately before a softmax over logits; use a linear output for logits so relative differences are preserved.

  • Leaky ReLU: allows a small gradient for negative inputs, reducing dead neurons. For $\alpha>0$:
\[\mathrm{LeakyReLU}(x)=\begin{cases} x & x>0,\\ \alpha x & x\le 0. \end{cases}\]
1
2
3
4
5
6
7
Typical choice: $\alpha\approx 0.01$.

<p align="center">
    <figure>
            <img src="https://github.com/user-attachments/assets/0fc12aeb-8daf-4140-b09a-19d6e9b1fd5a" height="300" alt="Leaky ReLU"/>
    </figure>
</p>
  • ReLU6: a clipped ReLU that bounds the activation in [0,6]:
\[\mathrm{ReLU6}(x)=\min(\max(0,x),6).\]
1
2
3
4
5
6
7
This is useful for quantized or mobile networks where a fixed activation range improves robustness to reduced numerical precision.

<p align="center">
    <figure>
            <img src="https://github.com/user-attachments/assets/fd7ff666-de91-411b-82dd-b037b991370c" height="300" alt="ReLU6"/>
    </figure>
</p>
  • GELU (Gaussian Error Linear Unit): smoother alternative used in Transformers. Defined using the Gaussian CDF $\Phi(x)$; a common approximation is:
\[\mathrm{GELU}(x)=x\,\Phi(x)\approx 0.5x\left[1+\tanh\left(\sqrt{\tfrac{2}{\pi}}\,(x+0.044715x^3)\right)\right].\]
1
2
3
4
5
6
7
Advantages: smooth, non-monotonic near zero, and avoids hard zeroing of negative inputs. Slightly slower to compute than ReLU.

<p align="center">
    <figure>
            <img src="https://github.com/user-attachments/assets/8df49272-30cc-4335-9ac1-3cd02c9d37dd" height="300" alt="GELU"/>
    </figure>
</p>
  • tanh: a zero-centered sigmoid-like activation. Equivalent forms:
\[anh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}=\frac{2}{1+e^{-2x}}-1.\]
1
2
3
4
5
6
7
Range: $(-1,1)$. Compared with sigmoid, tanh is zero-centered which can help optimization, but it still saturates for large |x|.

<p align="center">
    <figure>
            <img src="https://github.com/RicoJia/The-Dream-Robot/assets/39393023/22e4e9f7-8a9e-4e3c-9601-4f778281975c" height="300" alt="tanh"/>
    </figure>
</p>
  • Sigmoid (logistic):
\[\sigma(x)=\frac{1}{1+e^{-x}}.\]
1
Range: $(0,1)$. Advantages: interpretable as a probability-like output; disadvantages: saturates for large |x| which leads to vanishing gradients (maximum derivative is $\sigma'(0)=0.25$).