# Activation - Functions

Source code in `EpyNN/epynn/commons/maths.py`

.

See Appendix - Notations for mathematical conventions.

In the context of a given *trainable* layer within a Neural Network, an activation function can be defined such as:

Any

**differentiable**- non-linear - function that takes a weighted sum of features with respect to samples and returns an output with respect to samples.The same activation function is assigned to each unit within a given layer.

The activation function of the

**output**layer should be a**logistic**function, including softmax.

Note that:

For a given layer, several activations of distinct variables can take place. For instance, the Gated Recurrent Unit (GRU) has three distinct activation functions for three distinct variables. Given the activation of one variable, the same activation function is assigned for all units within the layer. Given the activation of another variable, the same logic applies.

The activation function for the output layer should be chosen consistently with the Loss - Functions. For instance, the

*tanh*activation outputs belongs to [-1, 1]. Since Binary Cross-Entropy loss functions contains*natural logarithm*terms, it can not be fed with negative values. While this can easily be handled with a specific procedure, it is not implemented in EpyNN because we think it is important to be aware of such relationships.Activation functions can be modified or implemented from

`epynn.commons.maths`

.

Also in mathematical definitions below, **derivatives may contain terms referring to their primitive** (e.g., sigmoid). While such redaction is widespread, it may be confusing for some people. We use it because it mirrors the Python code in which this writing style is convenient.

## Identity

**The identity function is not appropriate for backpropagation - This is implemented for testing purposes.**

```
def identity(x, deriv=False):
"""Compute identity activation or derivative.
"""
if not deriv:
pass
elif deriv:
x = np.ones_like(x)
return x
```

Given *M* and *U* the number of training examples and units in one layer, the identity function can be defined such as:

The derivative of the identity function can be defined such as:

Because the *identity* is not differentiable, it can not be used in the backpropagation process. The output of the *identity* function derivative is always *1*.

## Rectifier Linear Unit

```
def relu(x, deriv=False):
"""Compute ReLU activation or derivative.
"""
if not deriv:
x = np.maximum(0, x)
elif deriv:
x = np.greater(x, 0).astype(int)
return x
```

Given *M* and *U* the number of training examples and units in one layer, the ReLU function can be defined such as:

The derivative of the ReLU function can be defined such as:

The *ReLU* function is a popular activation function, mostly because its computation time is lower compared to other functions.

The *ReLU* function is essentially an *identity* function for positive values while it returns zero for negative values. These properties make the *ReLU* function differentiable since the *ReLU* derivative outputs *1* for positive values and *0* for negative values. Note that the *ReLU* function is differentiable at all points except *0*.

## Leaky Rectifier Linear Unit

```
def lrelu(x, deriv=False):
"""Compute LReLU activation or derivative.
"""
# Retrieve alpha from layers hyperparameters (temporary globals)
a = layer_hPars['LRELU_alpha']
if not deriv:
x = np.maximum(a * x, x)
elif deriv:
x = np.where(x > 0, 1, a)
return x
```

Given *M* and *U* the number of training examples and units in one layer, the LReLU function can be defined such as:

The derivative of the LReLU function can be defined such as:

The *Leaky ReLU* is a variant of the *ReLU* function. Any negative input value passed through a *ReLU* function yields *0*. By contrast, the *Leaky ReLU* has a coefficient *a* which is applied to negative input values. Therefore, the co-domain of the *Leaky ReLU* function is the set of real numbers instead of the set of positive real numbers for the *ReLU* counterpart.

The derivative of the *Leaky ReLU* function is differentiable and returns *a* for negative values while it return *1* for positive values. Note that the *Leaky ReLU* function is differentiable at all points except *0*.

## Exponential Linear Unit

```
def elu(x, deriv=False):
"""Compute ELU activation or derivative.
"""
# Retrieve alpha from layers hyperparameters (temporary globals)
a = layer_hPars['ELU_alpha']
if not deriv:
x = np.where(x > 0, x, a * (np.exp(x, where=x<=0)-1))
elif deriv:
x = np.where(x > 0, 1, elu(x) + a)
return x
```

Given *M* and *U* the number of training examples and units in one layer, the ELU function can be defined such as:

Given *f* the same as above, the derivative of the ELU function can be defined such as:

The *ELU* function is identical to the *ReLU* function for positive inputs. The *ELU* function has a coefficient *a* which is applied to the difference between the exponential of the input and *1*. Because this difference always belongs to *[-1, 0]* the output of the *ELU* function for negative values belongs to *[-a, 0]*. For negative values, the *ELU* curve is not linear, in contrast to the *Leaky ReLU* and *ReLU* counterparts. We say that *ELU* becomes smooth slowly until its output is equal to *-a*.

Another difference is that the *ELU* function is differentiable for all points including *0*.

## Sigmoid

```
def sigmoid(x, deriv=False):
"""Compute Sigmoid activation or derivative.
"""
if not deriv:
# Numerically stable version of sigmoid function
x = np.where(
x >= 0, # condition
1 / (1+np.exp(-x)), # For positive values
np.exp(x) / (1+np.exp(x)) # For negative values
)
elif deriv:
x = sigmoid(x) * (1-sigmoid(x))
return x
```

Given *M* and *U* the number of training examples and units in one layer, the Sigmoid function can be defined such as:

Given *f* the same as above, the derivative of the Sigmoid function can be defined such as:

The logistic *sigmoid* activation is often used in the output layer of a Neural network because the output of the *sigmoid* function belongs to *[0, 1]*. Therefore, when we have to predict a probability as an output, it is naturally well suited.

The *sigmoid* function is differentiable for all points. The derivative of the *sigmoid* function is equal to the product of the *sigmoid* function output by the difference between *1* and the *sigmoid* function output.

Note that to avoid computational *overflow* problems, the *sigmoid* implementation in EpyNN is a *numerically stable* version. It is a different expression of the same function aiming at not feeding an exponential with a large positive value.

## Hyperbolic tangent

```
def tanh(x, deriv=False):
"""Compute tanh activation or derivative.
"""
if not deriv:
x = (np.exp(x)-np.exp(-x)) / (np.exp(x)+np.exp(-x))
elif deriv:
x = 1 - tanh(x)**2
return x
```

Given *M* and *U* the number of training examples and units in one layer, the tanh function can be defined such as:

Given *f* the same as above, the derivative of the tanh function can be defined such as:

The logistic *tanh* activation function is similar to the *sigmoid* function excepts that its outputs belongs to *[-1, 1]*. This function is often used in the output layer of a Neural Network and is also the standard for the output activation in recurrent layers, such as the Recurrent Neural Network (RNN).

The *tanh* function is differentiable for all points and outputs of the derivative belong to *[0, 1]*.

## Softmax

```
def softmax(x, deriv=False):
"""Compute softmax activation or derivative.
For Dense layer only.
For other layers, you can change element-wise matrix multiplication
operator '*' by :func:`epynn.maths.hadamard` which handles
the softmax derivative jacobian matrix.
:param deriv: To compute derivative, defaults to False.
:type deriv: bool, optional
:return: Output array passed in function.
:rtype: :class:`numpy.ndarray`
"""
# Retrieve temperature from layers hyperparameters (temporary globals)
T = layer_hPars['softmax_temperature']
if not deriv:
# Numerically stable version of softmax function
x_safe = x - np.max(x, axis=1, keepdims=True)
x_exp = np.exp(x_safe / T)
x_sum = np.sum(x_exp, axis=1, keepdims=True)
x = x_exp / x_sum
elif deriv:
x = np.array([np.diag(x) - np.outer(x, x) for x in softmax(x)])
return x
```

Given *M* and *U* the number of training examples and units in one layer, the Softmax function can be defined such as:

Given *f* the same as above, the derivative of the Softmax function can be defined such as:

The *softmax* activation function is a special case of logistic function. It is very frequently used and mostly appropriate for the output layer of a Neural Network.

The *softmax*, by contrast with the logistic *sigmoid* and *tanh*, does not output a probability *mass* but a probability *distribution*.

Given one sample *m* and a number of output units *U*, the sum of all outputs values is always equal to *1* when using the *softmax* activation in the output layer.

This is because the *softmax* does normalize outputs for each sample *(x_exp)* with respect to the sum of outputs *(x_sum)* for the same sample. Therefore, the output of the *softmax* always belongs to *[0, 1]* while the sum of the output for one sample is always equal to *1*.

Note that the *softmax* version implemented in EpyNN is *numerically stable*. It is a different expression of the same function. For each sample, the maximum input value is subtracted to all input values before passing through the exponential function.

The *softmax* function is differentiable for all points. The derivative of the *softmax* is not as simple compared to the other logistic functions. This is because a given output value with respect to one sample depends on the other output values with respect to this same sample.

While the input shape is equal to the output shape for all derivatives seen above, this is not true for the *softmax* because of the above-mentioned assertion.

The output of the *softmax* derivative is called a *jacobian* matrix. This is implemented efficiently as a one-liner in EpyNN and is illustrated using a pure iterative approach below.

```
# One sample, 10 outputs units
M = 1
U = 10
# Input array
x = np.random.standard_normal((M, U))
# Softmax output - shape (1, 10)
s = softmax(x)
# Initialize softmax derivative jacobian
ds = np.zeros((M, U, U))
for m in range(M):
for u1 in range(U):
for u2 in range(U):
if u1 == u2:
# This is the diagonal of the jacobian matrix
u = u1 = u2
ds[m, u, u] = s[m, u] * (1 - s[m, u])
elif u1 != u2:
# Off-diagonal
ds[m, u1, u2] = -s[m, u1] * s[m, u2]
```

In words, the derivative of the *softmax* function given a sample *m* and the number of units *U* is:

A jacobian matrix of shape

*(1, U, U)*.On-diagonal points are equal to

*softmax(x) * (1-softmax(x))*with respect to*u*.Off-diagonal points are equal to

*-softmax(x) * softmax(x)*with respect to*u1*and*u2*.