Exploring Activation Functions: The Building Blocks of Neural Networks

Activation functions in neural networks serve as mapping mechanisms that transform the weighted sum of a neuron’s inputs into another value, determined by the specific type of function used. These functions can be categorised as either linear or non-linear.
Linear activation functions, which directly map input values to outputs, are generally less favoured in neural network training because they limit the network’s ability to perform better than a simple linear regression model by preserving linearity across layers.
On the other hand, Non-linear activation functions are crucial for enabling the network to learn complex patterns within the data. These functions introduce non-linearity to the model, making it capable of handling intricate tasks like image recognition, language processing, and more, which linear models struggle with. Among the most widely used non-linear activation functions are the sigmoid, tanh, ReLU (Rectified Linear Unit) and Leaky ReLU. Each of these functions has distinct characteristics and is explained in detail as follows:
- Sigmoid activation function is one of the most widely used activation functions in neural networks, particularly in logistic regression and binary classification problems. It is a type of squashing function that maps input values to an output range between 0 and 1. This makes it particularly useful for models that need to output probabilities. The sigmoid function has a smooth gradient, which helps during the backpropagation process in neural networks, ensuring that small changes in the input lead to small changes in the output. Although the sigmoid function is non-linear, allowing the model to capture complex relationships, it can suffer from the vanishing gradient problem for very high or very low input values.

2. Tanh (hyperbolic tangent) activation function maps input values to output values between -1 and 1, which is commonly used in neural networks, especially in hidden layers. This range helps centre the data, which can accelerate the learning process. Tanh provides stronger gradients than the sigmoid activation function for inputs that are far from zero, facilitating more effective weight learning. However, for very high or very low input values, tanh can experience the vanishing gradient problem.

3. ReLU known for its computational efficiency, activates only positive inputs, i.e. sets any negative input value to 0 and keeps the positive input value unchanged. ReLU is defined as:

Although ReLU appears to be a linear function due to its piecewise linearity, it introduces non-linearity into the model. This allows the model to learn complex patterns in the data. ReLU is computationally efficient because it involves simple thresholding, which speeds up the training of neural networks compared to sigmoid or tanh. ReLU leads to sparse activation, as it outputs zero for any negative input. This sparsity can make the model more efficient and help mitigate the vanishing gradient problem, which is a common issue with other activation functions in deep neural networks. Moreover, for positive input values, the gradient of ReLU is 1, which ensures that the gradients do not vanish during backpropagation, thus facilitating faster learning. The output of ReLU is not bounded, which can lead to very large activations and potentially unstable models. Proper initialisation and regularisation techniques are necessary to handle this. If a large number of neurons output zero, they effectively become inactive and stop learning. This is known as the “dying ReLU” problem. However, various modifications like Leaky ReLU have been proposed to overcome this problem.
4. Leaky ReLU is a variant of ReLU that allows a small, non-zero gradient when the unit is not active, which helps keep the neurons alive and active during training. Leaky ReLU is defined as

Here, α is a small constant (typically 0.01) which defines the slope for negative input values. The non-zero slope for negative inputs helps maintain the flow of gradients during backpropagation, which can enhance the learning process. Thus, leaky ReLU retains the computational simplicity and efficiency of ReLU while addressing its key limitations. The non-zero gradient for negative values can sometimes lead to negative output values, which might not be desirable in certain applications.
5. ReLU6 is a modified version of the Rectified Linear Unit (ReLU) activation function introduced in the paper of MobileNet, specifically designed to improve the performance of neural networks on mobile and embedded devices by capping the activation at a maximum value of 6. This helps maintain numerical stability and reduces the risk of overflow when using lower-precision arithmetic. The ReLU6 function is defined as:

6. Swish activation function is a smooth, non-monotonic function that has been shown to outperform ReLU on deep networks. It is used in EfficientNet. It is defined as:

where σ(x) is the sigmoid function. Swish allows for small negative values when x < 0, which can help gradient flow and improve training. Unlike ReLU, Swish is smooth and differentiable across the entire input range, which can improve optimisation. The non-monotonic nature of Swish allows for better information flow and potentially richer representation learning. It facilitates small negative values, aiding gradient flow during backpropagation, particularly in deeper networks.
Happy Coding ..!!