.. EpyNN documentation master file, created by
   sphinx-quickstart on Tue Jul  6 18:46:11 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

.. toctree::

Activation - Functions
===============================

Source code in ``EpyNN/epynn/commons/maths.py``.

See `Appendix - Notations`_ for mathematical conventions.

.. _Appendix - Notations: glossary.html#notations

In the context of a given *trainable* layer within a Neural Network, an activation function can be defined such as:

* Any **differentiable** - non-linear - function that takes a weighted sum of features with respect to samples and returns an output with respect to samples.
* The same activation function is assigned to each unit within a given layer.
* The activation function of the **output** layer should be a **logistic** function, including softmax.

Note that:

* For a given layer, several activations of distinct variables can take place. For instance, the `Gated Recurrent Unit (GRU)`_ has three distinct activation functions for three distinct variables. Given the activation of one variable, the same activation function is assigned for all units within the layer. Given the activation of another variable, the same logic applies.
* The activation function for the output layer should be chosen consistently with the `Loss - Functions`_. For instance, the *tanh* activation outputs belongs to [-1, 1]. Since `Binary Cross-Entropy`_ loss functions contains *natural logarithm* terms, it can not be fed with negative values. While this can easily be handled with a specific procedure, it is not implemented in EpyNN because we think it is important to be aware of such relationships.
* Activation functions can be modified or implemented from :py:mod:`epynn.commons.maths`.

Also in mathematical definitions below, **derivatives may contain terms referring to their primitive** (e.g., sigmoid). While such redaction is widespread, it may be confusing for some people. We use it because it mirrors the Python code in which this writing style is convenient.

.. _Gated Recurrent Unit (GRU): GRU.html
.. _Loss - Functions: loss.html
.. _Binary Cross-Entropy: loss.html#binary-cross-entropy


Identity
-------------------------------

**The identity function is not appropriate for backpropagation - This is implemented for testing purposes.**

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: identity
    :lines: 1-2,14-25

Given *M* and *U* the number of training examples and units in one layer, the identity function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  & \to \mathcal{M}_{M,U}(\mathbb{R}) \\
    X & \to X \\
  \end{alignat*}

The derivative of the identity function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) & \to \mathcal{M}_{M,U}(\{1\}) \\
    X & \to 1 \\
  \end{alignat*}


.. image:: _static/activation/Identity.svg

Because the *identity* is not differentiable, it can not be used in the backpropagation process. The output of the *identity* function derivative is always *1*.

Rectifier Linear Unit
-------------------------------

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: relu
    :lines: 1-2,12-25

Given *M* and *U* the number of training examples and units in one layer, the ReLU function can be defined such as:


.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  &\to && \mathcal{M}_{M,U}(\mathbb{R_{+}}) \\
    X &\to &&
            \begin{cases}
              X, & X > 0 \\
              0, & X \le 0
            \end{cases} \\
  \end{alignat*}

The derivative of the ReLU function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\{0,1\}) \\
    X &\to &&
            \begin{cases}
              1, & X > 0 \\
              0, & X < 0
            \end{cases} \\
  \end{alignat*}

.. image:: _static/activation/ReLU.svg

The *ReLU* function is a popular activation function, mostly because its computation time is lower compared to other functions.

The *ReLU* function is essentially an *identity* function for positive values while it returns zero for negative values. These properties make the *ReLU* function differentiable since the *ReLU* derivative outputs *1* for positive values and *0* for negative values. Note that the *ReLU* function is differentiable at all points except *0*.

Leaky Rectifier Linear Unit
-------------------------------

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: lrelu
    :lines: 1-2,12-25

Given *M* and *U* the number of training examples and units in one layer, the LReLU function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\
    X &\to &&
            \begin{cases}
              X, & X > 0 \\
              a * X, & X \le 0
            \end{cases} \\
    with~a \in \mathbb{R}^*
  \end{alignat*}

The derivative of the LReLU function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\{a, 1\}) \\
    X &\to &&
            \begin{cases}
              1, & X > 0 \\
              a, & X < 0
            \end{cases} \\
    with~a \in \mathbb{R}^*
  \end{alignat*}


.. image:: _static/activation/LReLU.svg

The *Leaky ReLU* is a variant of the *ReLU* function. Any negative input value passed through a *ReLU* function yields *0*. By contrast, the *Leaky ReLU* has a coefficient *a* which is applied to negative input values. Therefore, the co-domain of the *Leaky ReLU* function is the set of real numbers instead of the set of positive real numbers for the *ReLU* counterpart.

The derivative of the *Leaky ReLU* function is differentiable and returns *a* for negative values while it return *1* for positive values. Note that the *Leaky ReLU* function is differentiable at all points except *0*.

Exponential Linear Unit
-------------------------------

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: elu
    :lines: 1-2,12-25

Given *M* and *U* the number of training examples and units in one layer, the ELU function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  &\to && \mathcal{M}_{M,U}([-a, \infty)) \\
    X &\to &&
            \begin{cases}
              X, & X > 0 \\
              a * (e^{X} - 1), & X \le 0
            \end{cases} \\
    with~a \in \mathbb{R}^*
  \end{alignat*}

Given *f* the same as above, the derivative of the ELU function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 1]) \\
    X &\to &&
            \begin{cases}
              1, & X > 0 \\
              f(X) + a, & X \le 0
            \end{cases} \\
    with~a \in \mathbb{R}^*
  \end{alignat*}

.. image:: _static/activation/ELU.svg

The *ELU* function is identical to the *ReLU* function for positive inputs. The *ELU* function has a coefficient *a* which is applied to the difference between the exponential of the input and *1*. Because this difference always belongs to *[-1, 0]* the output of the *ELU* function for negative values belongs to *[-a, 0]*. For negative values, the *ELU* curve is not linear, in contrast to the *Leaky ReLU* and *ReLU* counterparts. We say that *ELU* becomes smooth slowly until its output is equal to *-a*.

Another difference is that the *ELU* function is differentiable for all points including *0*.

Sigmoid
-------------------------------

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: sigmoid
    :lines: 1-2,12-25

Given *M* and *U* the number of training examples and units in one layer, the Sigmoid function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  &\to && \mathcal{M}_{M,U}([0, 1]) \\
    X &\to && \frac{1}{1+e^{-X}}
  \end{alignat*}

Given *f* the same as above, the derivative of the Sigmoid function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 0.25]) \\
    X &\to && f(X) * (1-f(X))
  \end{alignat*}


.. image:: _static/activation/Sigmoid.svg

The logistic *sigmoid* activation is often used in the output layer of a Neural network because the output of the *sigmoid* function belongs to *[0, 1]*. Therefore, when we have to predict a probability as an output, it is naturally well suited.

The *sigmoid* function is differentiable for all points. The derivative of the *sigmoid* function is equal to the product of the *sigmoid* function output by the difference between *1* and the *sigmoid* function output.

Note that to avoid computational *overflow* problems, the *sigmoid* implementation in EpyNN is a *numerically stable* version. It is a different expression of the same function aiming at not feeding an exponential with a large positive value.

Hyperbolic tangent
-------------------------------

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: tanh
    :lines: 1-2,12-25

Given *M* and *U* the number of training examples and units in one layer, the tanh function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  &\to && \mathcal{M}_{M,U}([-1, 1]) \\
    X &\to && \frac{e^{X}-e^{-X}}{e^{X}+e^{-X}}
  \end{alignat*}

Given *f* the same as above, the derivative of the tanh function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}([0, 1]) \\
    X &\to && 1 - f(X)^2
  \end{alignat*}

.. image:: _static/activation/tanh.svg

The logistic *tanh* activation function is similar to the *sigmoid* function excepts that its outputs belongs to *[-1, 1]*. This function is often used in the output layer of a Neural Network and is also the standard for the output activation in recurrent layers, such as the `Recurrent Neural Network (RNN)`_.

The *tanh* function is differentiable for all points and outputs of the derivative belong to *[0, 1]*.

.. _Recurrent Neural Network (RNN): RNN.html

Softmax
-------------------------------

.. literalinclude:: ../epynn/commons/maths.py
    :language: python
    :pyobject: softmax
    :lines: 1-8,12-35

Given *M* and *U* the number of training examples and units in one layer, the Softmax function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f:\mathcal{M}_{M,U}(\mathbb{R})  &\to && \mathcal{M}_{M,U}([0, 1]) \\
    X = \mathop{(x_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && \frac{e^{x_{mu} / T}}{\sum\limits_{u = 1}^U e^{x_{mu} / T}}
  \end{alignat*}

Given *f* the same as above, the derivative of the Softmax function can be defined such as:

.. math::

  \begin{alignat*}{2}
    f':\mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U,U}([-1, 1]) \\
    X = \mathop{(x_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && Y = \mathop{(y_{muu'})}_{\substack{1 \le m \le M \\ 1 \le u \le U \\ 1 \le u'\le U}} \\
  \end{alignat*}

.. math::

  \begin{alignat*}{2}
    & with~\forall m && \in \{1,..,M\} \\
    & && \forall u, u' \in \{1,..,U\} \\
    & && y_{muu'} =
    \begin{cases}
      f(x_{m})_{u} * (1-f(x_{m})_{u'}), & u = u' \\
      -f(x_{m})_{u} * f(x_{m})_{u'}, & u \ne u'
    \end{cases} \\
  \end{alignat*}

The *softmax* activation function is a special case of logistic function. It is very frequently used and mostly appropriate for the output layer of a Neural Network.

The *softmax*, by contrast with the logistic *sigmoid* and *tanh*, does not output a probability *mass* but a probability *distribution*.

Given one sample *m* and a number of output units *U*, the sum of all outputs values is always equal to *1* when using the *softmax* activation in the output layer.

This is because the *softmax* does normalize outputs for each sample *(x_exp)* with respect to the sum of outputs *(x_sum)* for the same sample. Therefore, the output of the *softmax* always belongs to *[0, 1]* while the sum of the output for one sample is always equal to *1*.

Note that the *softmax* version implemented in EpyNN is *numerically stable*. It is a different expression of the same function. For each sample, the maximum input value is subtracted to all input values before passing through the exponential function.

The *softmax* function is differentiable for all points. The derivative of the *softmax* is not as simple compared to the other logistic functions. This is because a given output value with respect to one sample depends on the other output values with respect to this same sample.

While the input shape is equal to the output shape for all derivatives seen above, this is not true for the *softmax* because of the above-mentioned assertion.

The output of the *softmax* derivative is called a *jacobian* matrix. This is implemented efficiently as a one-liner in EpyNN and is illustrated using a pure iterative approach below.


.. code-block:: python

    # One sample, 10 outputs units
    M = 1
    U = 10

    # Input array
    x = np.random.standard_normal((M, U))

    # Softmax output - shape (1, 10)
    s = softmax(x)

    # Initialize softmax derivative jacobian
    ds = np.zeros((M, U, U))

    for m in range(M):
        for u1 in range(U):
            for u2 in range(U):

                if u1 == u2:
                    # This is the diagonal of the jacobian matrix
                    u = u1 = u2
                    ds[m, u, u] = s[m, u] * (1 - s[m, u])

                elif u1 != u2:
                    # Off-diagonal
                    ds[m, u1, u2] = -s[m, u1] * s[m, u2]

In words, the derivative of the *softmax* function given a sample *m* and the number of units *U* is:

* A jacobian matrix of shape *(1, U, U)*.
* On-diagonal points are equal to *softmax(x) \* (1-softmax(x))* with respect to *u*.
* Off-diagonal points are equal to *-softmax(x) \* softmax(x)* with respect to *u1* and *u2*.