Activation - Functions

Source code in EpyNN/epynn/commons/maths.py.

See Appendix - Notations for mathematical conventions.

In the context of a given trainable layer within a Neural Network, an activation function can be defined such as:

  • Any differentiable - non-linear - function that takes a weighted sum of features with respect to samples and returns an output with respect to samples.

  • The same activation function is assigned to each unit within a given layer.

  • The activation function of the output layer should be a logistic function, including softmax.

Note that:

  • For a given layer, several activations of distinct variables can take place. For instance, the Gated Recurrent Unit (GRU) has three distinct activation functions for three distinct variables. Given the activation of one variable, the same activation function is assigned for all units within the layer. Given the activation of another variable, the same logic applies.

  • The activation function for the output layer should be chosen consistently with the Loss - Functions. For instance, the tanh activation outputs belongs to [-1, 1]. Since Binary Cross-Entropy loss functions contains natural logarithm terms, it can not be fed with negative values. While this can easily be handled with a specific procedure, it is not implemented in EpyNN because we think it is important to be aware of such relationships.

  • Activation functions can be modified or implemented from epynn.commons.maths.

Also in mathematical definitions below, derivatives may contain terms referring to their primitive (e.g., sigmoid). While such redaction is widespread, it may be confusing for some people. We use it because it mirrors the Python code in which this writing style is convenient.

Identity

The identity function is not appropriate for backpropagation - This is implemented for testing purposes.

def identity(x, deriv=False):
    """Compute identity activation or derivative.
    """
    if not deriv:
        pass

    elif deriv:
        x = np.ones_like(x)

    return x

Given M and U the number of training examples and units in one layer, the identity function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) & \to \mathcal{M}_{M,U}(\mathbb{R}) \\ X & \to X \\ \end{alignat*}\end{split}\]

The derivative of the identity function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) & \to \mathcal{M}_{M,U}(\{1\}) \\ X & \to 1 \\ \end{alignat*}\end{split}\]
_images/Identity.svg

Because the identity is not differentiable, it can not be used in the backpropagation process. The output of the identity function derivative is always 1.

Rectifier Linear Unit

def relu(x, deriv=False):
    """Compute ReLU activation or derivative.
    """
    if not deriv:
        x = np.maximum(0, x)

    elif deriv:
        x = np.greater(x, 0).astype(int)

    return x

Given M and U the number of training examples and units in one layer, the ReLU function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R_{+}}) \\ X &\to && \begin{cases} X, & X > 0 \\ 0, & X \le 0 \end{cases} \\ \end{alignat*}\end{split}\]

The derivative of the ReLU function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\{0,1\}) \\ X &\to && \begin{cases} 1, & X > 0 \\ 0, & X < 0 \end{cases} \\ \end{alignat*}\end{split}\]
_images/ReLU.svg

The ReLU function is a popular activation function, mostly because its computation time is lower compared to other functions.

The ReLU function is essentially an identity function for positive values while it returns zero for negative values. These properties make the ReLU function differentiable since the ReLU derivative outputs 1 for positive values and 0 for negative values. Note that the ReLU function is differentiable at all points except 0.

Leaky Rectifier Linear Unit

def lrelu(x, deriv=False):
    """Compute LReLU activation or derivative.
    """
    # Retrieve alpha from layers hyperparameters (temporary globals)
    a = layer_hPars['LRELU_alpha']

    if not deriv:
        x = np.maximum(a * x, x)

    elif deriv:
        x = np.where(x > 0, 1, a)

    return x

Given M and U the number of training examples and units in one layer, the LReLU function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\ X &\to && \begin{cases} X, & X > 0 \\ a * X, & X \le 0 \end{cases} \\ with~a \in \mathbb{R}^* \end{alignat*}\end{split}\]

The derivative of the LReLU function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\{a, 1\}) \\ X &\to && \begin{cases} 1, & X > 0 \\ a, & X < 0 \end{cases} \\ with~a \in \mathbb{R}^* \end{alignat*}\end{split}\]
_images/LReLU.svg

The Leaky ReLU is a variant of the ReLU function. Any negative input value passed through a ReLU function yields 0. By contrast, the Leaky ReLU has a coefficient a which is applied to negative input values. Therefore, the co-domain of the Leaky ReLU function is the set of real numbers instead of the set of positive real numbers for the ReLU counterpart.

The derivative of the Leaky ReLU function is differentiable and returns a for negative values while it return 1 for positive values. Note that the Leaky ReLU function is differentiable at all points except 0.

Exponential Linear Unit

def elu(x, deriv=False):
    """Compute ELU activation or derivative.
    """
    # Retrieve alpha from layers hyperparameters (temporary globals)
    a = layer_hPars['ELU_alpha']

    if not deriv:
        x = np.where(x > 0, x, a * (np.exp(x, where=x<=0)-1))

    elif deriv:
        x = np.where(x > 0, 1, elu(x) + a)

    return x

Given M and U the number of training examples and units in one layer, the ELU function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([-a, \infty)) \\ X &\to && \begin{cases} X, & X > 0 \\ a * (e^{X} - 1), & X \le 0 \end{cases} \\ with~a \in \mathbb{R}^* \end{alignat*}\end{split}\]

Given f the same as above, the derivative of the ELU function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 1]) \\ X &\to && \begin{cases} 1, & X > 0 \\ f(X) + a, & X \le 0 \end{cases} \\ with~a \in \mathbb{R}^* \end{alignat*}\end{split}\]
_images/ELU.svg

The ELU function is identical to the ReLU function for positive inputs. The ELU function has a coefficient a which is applied to the difference between the exponential of the input and 1. Because this difference always belongs to [-1, 0] the output of the ELU function for negative values belongs to [-a, 0]. For negative values, the ELU curve is not linear, in contrast to the Leaky ReLU and ReLU counterparts. We say that ELU becomes smooth slowly until its output is equal to -a.

Another difference is that the ELU function is differentiable for all points including 0.

Sigmoid

def sigmoid(x, deriv=False):
    """Compute Sigmoid activation or derivative.
    """
    if not deriv:
        # Numerically stable version of sigmoid function
        x = np.where(
            x >= 0, # condition
            1 / (1+np.exp(-x)), # For positive values
            np.exp(x) / (1+np.exp(x)) # For negative values
        )

    elif deriv:
        x = sigmoid(x) * (1-sigmoid(x))

    return x

Given M and U the number of training examples and units in one layer, the Sigmoid function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 1]) \\ X &\to && \frac{1}{1+e^{-X}} \end{alignat*}\end{split}\]

Given f the same as above, the derivative of the Sigmoid function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 0.25]) \\ X &\to && f(X) * (1-f(X)) \end{alignat*}\end{split}\]
_images/Sigmoid.svg

The logistic sigmoid activation is often used in the output layer of a Neural network because the output of the sigmoid function belongs to [0, 1]. Therefore, when we have to predict a probability as an output, it is naturally well suited.

The sigmoid function is differentiable for all points. The derivative of the sigmoid function is equal to the product of the sigmoid function output by the difference between 1 and the sigmoid function output.

Note that to avoid computational overflow problems, the sigmoid implementation in EpyNN is a numerically stable version. It is a different expression of the same function aiming at not feeding an exponential with a large positive value.

Hyperbolic tangent

def tanh(x, deriv=False):
    """Compute tanh activation or derivative.
    """
    if not deriv:
        x = (np.exp(x)-np.exp(-x)) / (np.exp(x)+np.exp(-x))

    elif deriv:
        x = 1 - tanh(x)**2

    return x

Given M and U the number of training examples and units in one layer, the tanh function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([-1, 1]) \\ X &\to && \frac{e^{X}-e^{-X}}{e^{X}+e^{-X}} \end{alignat*}\end{split}\]

Given f the same as above, the derivative of the tanh function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 1]) \\ X &\to && 1 - f(X)^2 \end{alignat*}\end{split}\]
_images/tanh.svg

The logistic tanh activation function is similar to the sigmoid function excepts that its outputs belongs to [-1, 1]. This function is often used in the output layer of a Neural Network and is also the standard for the output activation in recurrent layers, such as the Recurrent Neural Network (RNN).

The tanh function is differentiable for all points and outputs of the derivative belong to [0, 1].

Softmax

def softmax(x, deriv=False):
    """Compute softmax activation or derivative.

    For Dense layer only.

    For other layers, you can change element-wise matrix multiplication
    operator '*' by :func:`epynn.maths.hadamard` which handles
    the softmax derivative jacobian matrix.

    :param deriv: To compute derivative, defaults to False.
    :type deriv: bool, optional

    :return: Output array passed in function.
    :rtype: :class:`numpy.ndarray`
    """
    # Retrieve temperature from layers hyperparameters (temporary globals)
    T = layer_hPars['softmax_temperature']

    if not deriv:
        # Numerically stable version of softmax function
        x_safe = x - np.max(x, axis=1, keepdims=True)

        x_exp = np.exp(x_safe / T)
        x_sum = np.sum(x_exp, axis=1, keepdims=True)

        x = x_exp / x_sum

    elif deriv:

        x = np.array([np.diag(x) - np.outer(x, x) for x in softmax(x)])

    return x

Given M and U the number of training examples and units in one layer, the Softmax function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 1]) \\ X = \mathop{(x_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && \frac{e^{x_{mu} / T}}{\sum\limits_{u = 1}^U e^{x_{mu} / T}} \end{alignat*}\end{split}\]

Given f the same as above, the derivative of the Softmax function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U,U}([-1, 1]) \\ X = \mathop{(x_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && Y = \mathop{(y_{muu'})}_{\substack{1 \le m \le M \\ 1 \le u \le U \\ 1 \le u'\le U}} \\ \end{alignat*}\end{split}\]
\[\begin{split}\begin{alignat*}{2} & with~\forall m && \in \{1,..,M\} \\ & && \forall u, u' \in \{1,..,U\} \\ & && y_{muu'} = \begin{cases} f(x_{m})_{u} * (1-f(x_{m})_{u'}), & u = u' \\ -f(x_{m})_{u} * f(x_{m})_{u'}, & u \ne u' \end{cases} \\ \end{alignat*}\end{split}\]

The softmax activation function is a special case of logistic function. It is very frequently used and mostly appropriate for the output layer of a Neural Network.

The softmax, by contrast with the logistic sigmoid and tanh, does not output a probability mass but a probability distribution.

Given one sample m and a number of output units U, the sum of all outputs values is always equal to 1 when using the softmax activation in the output layer.

This is because the softmax does normalize outputs for each sample (x_exp) with respect to the sum of outputs (x_sum) for the same sample. Therefore, the output of the softmax always belongs to [0, 1] while the sum of the output for one sample is always equal to 1.

Note that the softmax version implemented in EpyNN is numerically stable. It is a different expression of the same function. For each sample, the maximum input value is subtracted to all input values before passing through the exponential function.

The softmax function is differentiable for all points. The derivative of the softmax is not as simple compared to the other logistic functions. This is because a given output value with respect to one sample depends on the other output values with respect to this same sample.

While the input shape is equal to the output shape for all derivatives seen above, this is not true for the softmax because of the above-mentioned assertion.

The output of the softmax derivative is called a jacobian matrix. This is implemented efficiently as a one-liner in EpyNN and is illustrated using a pure iterative approach below.

# One sample, 10 outputs units
M = 1
U = 10

# Input array
x = np.random.standard_normal((M, U))

# Softmax output - shape (1, 10)
s = softmax(x)

# Initialize softmax derivative jacobian
ds = np.zeros((M, U, U))

for m in range(M):
    for u1 in range(U):
        for u2 in range(U):

            if u1 == u2:
                # This is the diagonal of the jacobian matrix
                u = u1 = u2
                ds[m, u, u] = s[m, u] * (1 - s[m, u])

            elif u1 != u2:
                # Off-diagonal
                ds[m, u1, u2] = -s[m, u1] * s[m, u2]

In words, the derivative of the softmax function given a sample m and the number of units U is:

  • A jacobian matrix of shape (1, U, U).

  • On-diagonal points are equal to softmax(x) * (1-softmax(x)) with respect to u.

  • Off-diagonal points are equal to -softmax(x) * softmax(x) with respect to u1 and u2.