.. EpyNN documentation master file, created by
   sphinx-quickstart on Tue Jul  6 18:46:11 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

.. toctree::

Recurrent Neural Network (RNN)
===============================

Source files in ``EpyNN/epynn/rnn/``.

See `Appendix - Notations`_ for mathematical conventions.

.. _Appendix - Notations: glossary.html#notations

Layer architecture
------------------------------

.. image:: _static/RNN/rnn0-01.svg
   :alt: RNN

A Recurrent Neural Network or *RNN* layer is an object containing a number of *units* - sometimes referred to as cells - and provided with functions for parameters *initialization* and non-linear *activation* of the so-called hidden state *h*.

.. autoclass:: epynn.rnn.models.RNN
    :show-inheritance:

Shapes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    .. automethod:: epynn.rnn.models.RNN.compute_shapes

        .. literalinclude:: ./../epynn/rnn/parameters.py
            :pyobject: rnn_compute_shapes
            :language: python

        Within a *RNN* layer, shapes of interest include:

        * Input *X* of shape *(m, s, e)* with *m* equal to the number of samples, *s* the number of steps in sequence and *e* the number of elements within each step of the sequence.
        * Weight *U* and *V* of shape *(e, u)* and *(u, u)*, respectively, with *e* the number of elements within each step of the sequence and *u* the number of units in the layer.
        * Bias *b* of shape *(1, u)* with *u* the number of units in the layer.
        * Hidden state *h* of shape *(m, 1, u)* or *(m, u)* with with *m* equal to the number of samples and *u* the number of units in the layer. Because there is one hidden state computed for each step in the sequence, the shape of the array containing all hidden states with respect to sequence steps is *(m, s, u)* with *s* the number of steps in the sequence.

        Note that:

        * Parameters shape for *V*, *U* and *b* is independent from the number of samples *m* and the number of steps in the sequence *s*.
        * Recurrent layers including the *RNN* layer are considered appropriate to handle inputs of variable length because parameters definition is independent from input length *s*.

        .. image:: _static/RNN/rnn1-01.svg

Forward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.rnn.models.RNN.forward

        .. literalinclude:: ./../epynn/rnn/forward.py
            :pyobject: rnn_forward

        .. image:: _static/RNN/rnn2-01.svg

        The forward propagation function in a *RNN* layer *k* includes:

        * (1): Input *X* in current layer *k* is equal to the output *A* of previous layer *k-1*. The initial hidden state *h* is a zero array.
        * (2s): For each step, input *X* of the current iteration is retrieved by indexing the layer input with shape *(m, s, e)* to obtain the input for step with shape *(m, e)*.
        * (3s): The previous hidden state *hp* is retrieved at the beginning of each iteration in sequence from the hidden state *h* computed at the end of the previous iteration (4.2s).
        * (4.1s): The hidden state linear activation product *h\_* is equal to the sum of the dot products between (*X* of shape *(m, e)*, *U* of shape *(e, u)*) and (*hp* of shape *(m, u)*, *V* of shape *(u, u)*) to which is added the bias *b*.
        * (4.2s): The hidden state non-linear activation product *h* is computed by applying a non-linear *activation* function on *h\_*.

        Note that:

        * The non-linear activation function for *h* is generally the *tanh* function. While it can technically be any function, one should be advised if not using the *tanh* function.
        * The hidden state *h* for one step has shape *(m, u)* while it has shape *(m, 1)* for one step and one unit. The hidden state *h* is fed back with respect to each unit for each step in the sequence, this is the basis for the internal *memory* of recurrent layers.
        * The concatenated array of hidden states *h* has shape *(m, s, u)*. By default, the *RNN* layer returns the hidden state corresponding to the last step in the input sequence with shape *(m, u)*. If the *sequences* argument is set to *True* when instantiating the *RNN* layer, then it will return the whole array of hidden states with shape *(m, s, u)*.
        * For the sake of code homogeneity, the output of the *RNN* layer is *A* which is equal to *h*.

        Then:

        .. math::

            \begin{alignat*}{2}
              & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\
              \\
              & x^{k~<s>}_{me} &&= x^{k}_{mse}[:, s] \tag{2s} \\
              \\
              & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:, s] \tag{3s} \\
              \\
              & h\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot U^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot V^{k}_{uu} \\
              & &&+ b^{k}_{u} \tag{4.1s} \\
              & h^{k~<s>}_{mu} &&= h_{act}(h\_^{k~<s>}_{mu}) \tag{4.2s} \\
            \end{alignat*}

Backward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.rnn.models.RNN.backward

        .. literalinclude:: ./../epynn/rnn/backward.py
            :pyobject: rnn_backward

        .. image:: _static/RNN/rnn3-01.svg

        The backward propagation function in a *RNN* layer *k* includes:

        * (1): *dA* the gradient of the loss with respect to the output of forward propagation *A* for current layer *k*. It is equal to the gradient of the loss with respect to input of forward propagation for next layer *k+1*. The initial gradient for hidden state *dh* is a zero array.
        * (2s): For each step in the reversed sequence, input *dA* of the current iteration is retrieved by indexing the input with shape *(m, s, u)* to obtain the input for step with shape *(m, u)*.
        * (3s): The next gradient of the loss with respect to hidden state *dhn* is retrieved at the beginning of each iteration from the counterpart *dh* computed at the end of the previous iteration (5s).
        * (4s): *dh\_* is the gradient of the loss with respect to *h\_*. It is computed by applying element-wise multiplication between (the sum of *dA* and *dhn*) and the derivative of the non-linear *activation* function applied on *h\_*.
        * (5s): *dh* is the gradient of the loss with respect to *h* for the current step. It is computed by applying a dot product operation between *dh\_* and the transpose of *V*.
        * (6s): The gradient of the loss *dX* with respect to the input of forward propagation *X* for current step and current layer *k* is computed by applying a dot product operation between *dh\_* and the transpose of *U*.

        Note that:

        * By contrast with the forward propagation, we proceed by iterating over the *reversed sequence*.
        * In the default *RNN* configuration with *sequences* set to *False*, the output of forward propagation has shape *(m, u)* and the input of backward propagation has shape *(m, u)*. In the function :py:func:`epynn.rnn.backward.initialize_backward` this is converted to yield a zero array of shape *(m, s, u)* for all but not coordinates *[:, -1, :]* which are set equal to *dA* of shape *(m, u)*.

        Then:

        .. math::

          \begin{alignat*}{2}
            & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\
            \\
            & \delta^{\kp{<s>}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\
            \\
	        & \frac{\partial \mathcal{L}}{\partial h^{k~<\sp>}_{mu}} &&= \frac{\partial \mathcal{L}}{\partial hn^{k}_{msu}}[:,s] \tag{3s} \\
            \\
            & \frac{\partial \mathcal{L}}{\partial h\_^{k~<s>}_{mu}} &&= (\delta^{\kp{<s>}}_{mu} + \frac{\partial \mathcal{L}}{\partial h^{k~<\sp>}_{mu}}) \\
            & &&* h_{act}'(h\_^{k~<s>}_{mu}) \tag{4s} \\
            & \frac{\partial \mathcal{L}}{\partial h^{k~<s>}_{mu}} &&= \frac{\partial \mathcal{L}}{\partial h\_^{k~<s>}_{mu}} \cdot V^{k~{\intercal}}_{uu}  \tag{5s} \\
            \\
            & \delta^{k~<s>}_{me} &&= \frac{\partial \mathcal{L}}{\partial x^{k~<s>}_{me}} = \frac{\partial \mathcal{L}}{\partial a^{\km~<s>}_{me}} = \frac{\partial \mathcal{L}}{\partial h\_^{k~<s>}_{mu}} \cdot U^{k~{\intercal}}_{vu}  \tag{6s} \\
          \end{alignat*}

Gradients
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.rnn.models.RNN.compute_gradients

        .. literalinclude:: ./../epynn/rnn/parameters.py
            :pyobject: rnn_compute_gradients

        The function to compute parameters gradients in a *RNN* layer *k* includes:

        * (1.1): *dU* is the gradient of the loss with respect to *U*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dh\_*.
        * (1.2): *dV* is the gradient of the loss with respect to *V*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dh\_*.
        * (1.3): *db* is the gradient of the loss with respect to *b*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.

        Note that:

        * What may make things more difficult here is the extra-dimension corresponding to sequence length. While for a sequence of length one the term *"sum with respect to step s"* would be unnecessary, this is not the case in the general situation where there is more than one step in the sequence.
        * To complete, using recurrent layers with one step sequences would not be appropriate, in general, because there is no use of the internal memory of recurrent units along the sequence steps.

        .. math::

            \begin{alignat*}{2}
                & \dLp{U}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{h\_}{s}{mu} \tag{1.1} \\
            	& \dLp{W}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{h\_}{s}{mu} \tag{1.2} \\
            	& \dLp{b}{u} &&= \sumS \sumM \dL{h\_}{s}{mu} \tag{1.3} \\
            \end{alignat*}


Live examples
------------------------------


* `Dummy string - RRN-Dense`_
* `Dummy time - RNN-Dense`_
* `Dummy time - RNN-Dense with SGD`_
* `Author music - RNN(sequence=True)-(Dense)n with Dropout`_

You may also like to browse all `Network training examples`_ provided with EpyNN.

.. _Network training examples: run_examples.html

.. _Dummy string - RRN-Dense: epynnlive/dummy_string/train.html#RNN-Dense

.. _Dummy time - RNN-Dense: epynnlive/dummy_time/train.html#RNN-Dense
.. _Dummy time - RNN-Dense with SGD: epynnlive/dummy_time/train.html#RNN-Dense-with-SGD

.. _Author music - RNN(sequence\=True)-(Dense)n with Dropout: epynnlive/author_music/train.html#RNN(sequences=True)-Flatten-(Dense)n-with-Dropout