" MicromOne: Focus on Activation functions

Pagine

Focus on Activation functions

Activation functions are a fascinating area of research in machine learning. They are a special class of mathematical functions characterized by several important properties. First of all, activation functions are usually non-linear. If all activation functions were linear, the entire neural network would collapse into a simple linear model, losing its expressive power.

Another key requirement is differentiability. Ideally, an activation function should be continuously differentiable, meaning it has a derivative everywhere. There is, however, one very popular activation function that is differentiable almost everywhere rather than everywhere.

Activation functions should also be monotonic, meaning that as the input increases, the output never decreases. In other words, when moving from left to right along the x-axis, the value of the function does not go down. Many commonly used activation functions satisfy this property.

A further important characteristic is that the activation function should approximate the identity function near the origin. This ensures that, around zero, the input is passed through without significant distortion.

Let us now look at some of the most common activation functions.

The most widely used activation function is the ReLU (Rectified Linear Unit). In many neural networks, ReLU is the default choice for hidden layers because it performs very well in practice and makes gradient computation efficient. However, ReLU lacks one of the desirable properties mentioned earlier: it is not differentiable at zero.

To address this limitation, the Leaky ReLU was introduced. Instead of outputting zero for negative inputs, Leaky ReLU outputs a small fraction of the input, allowing gradients to flow even for negative values.

In older research papers, the sigmoid activation function is frequently encountered. Sigmoid has some appealing properties: it smoothly maps inputs to values between 0 and 1 and has an easily computable derivative. However, in practice, neural networks trained with sigmoid tend to perform worse than those using ReLU, which is why sigmoid is rarely used in hidden layers today.

Another commonly used activation function is the hyperbolic tangent (tanh). Like sigmoid, tanh is bounded, but its outputs range from −1 to 1. One of its advantages is that it is centered at zero and has larger derivatives than sigmoid, making it a better choice in many cases.

There is also the step function, which is encountered in the perceptron model. While simple, it is not suitable for gradient-based learning.

Finally, in more advanced research, more esoteric activation functions may be encountered, such as the Gaussian Error Linear Unit (GELU). GELU has been adopted in several transformer-based models, including OpenAI’s GPT-3.