.. EpyNN documentation master file, created by
   sphinx-quickstart on Tue Jul  6 18:46:11 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

.. toctree::

Gated Recurrent Unit (GRU)
===============================

Source files in ``EpyNN/epynn/gru/``.

See `Appendix - Notations`_ for mathematical conventions.

.. _Appendix - Notations: glossary.html#notations

Layer architecture
------------------------------

.. image:: _static/GRU/gru0-01.svg
   :alt: GRU

A Gated Recurrent Unit or *GRU* layer is an object containing a number of *units* - sometimes referred to as cells - and provided with functions for parameters *initialization* and non-linear *activation* of the so-called hidden hat *hh*. The latter is a variable to compute the hidden state *h*. The latter is also computed from *gates* products, namely the *reset* gate product *r* and the *update* gate products *z*. Each of these products require a non-linear activation function to be computed.

.. autoclass:: epynn.gru.models.GRU
    :show-inheritance:

Shapes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.gru.models.GRU.compute_shapes

        .. literalinclude:: ./../epynn/gru/parameters.py
            :pyobject: gru_compute_shapes
            :language: python

        Within a *GRU* layer, shapes of interest include:

        * Input *X* of shape *(m, s, e)* with *m* equal to the number of samples, *s* the number of steps in sequence and *e* the number of elements within each step of the sequence.
        * Weight *U* and *V* of shape *(e, u)* and *(u, u)*, respectively, with *e* the number of elements within each step of the sequence and *u* the number of units in the layer.
        * Bias *b* of shape *(1, u)* with *u* the number of units in the layer.
        * Hidden state *h* of shape *(m, 1, u)* or *(m, u)* with with *m* equal to the number of samples and *u* the number of units in the layer. Because there is one hidden state *h* computed for each step in the sequence, the shape of the array containing all hidden states with respect to sequence steps is *(m, s, u)* with *s* the number of steps in the sequence.

        Note that:

        * Parameters shape for *V*, *U* and *b* is independent from the number of samples *m* and the number of steps in the sequence *s*.
        * There are three sets of parameters *{V, U, b}* for each activation: *{Vr, Ur, br}* for the reset gate, *{Vz, Uz, bz}* for the update gate and *{Vhh, Uhh, bhh}* for the activation of hidden hat *hh*.
        * Recurrent layers including the *GRU* layer are considered appropriate to handle inputs of variable length because parameters definition is independent from input length *s*.

        .. image:: _static/GRU/gru1-01.svg

Forward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.gru.models.GRU.forward

        .. literalinclude:: ./../epynn/gru/forward.py
            :pyobject: gru_forward
            :language: python

        The forward propagation function in a *GRU* layer *k* includes:

        * (1): Input *X* in current layer *k* is equal to the output *A* of previous layer *k-1*. The initial hidden state *h* is a zero array.
        * (2s): For each step, input *X* of the current iteration is retrieved by indexing the layer input with shape *(m, s, e)* to obtain the input for step with shape *(m, e)*.
        * (3s): The previous hidden state *hp* is retrieved at the beginning of each iteration in sequence from the hidden state *h* computed at the end of the previous iteration (7s).
        * (4s): The reset gate linear product *r\_* is computed from the sum of the dot products between *X*, *Ur* and *hp*, *Vr* to which *br* is added. The non-linear product *r* is computed by applying the *activate_reset* function on *r\_*.
        * (5s): The update gate linear product *z\_* is computed from the sum of the dot products between *X*, *Uz* and *hp*, *Vz* to which *bz* is added. The non-linear product *z* is computed by applying the *activate_update* function on *z\_*.
        * (6s): The hidden hat linear product *hh\_* is computed from the sum of the dot products between *X*, *Uhh* and *(r \* hp)*, *Vhh* to which *bhh* is added. The non-linear product *hh* is computed by applying the *activate* function on *hh\_*.
        * (7s): The hidden state *h* is the sum of the products between *z*, *hp* and *(1-z)*, *hh*.

        Note that:

        * The non-linear activation function for *hh* is generally the *tanh* function. While it can technically be any function, one should be advised if not using the *tanh* function.
        * The non-linear activation function for *r* and *z* is generally the *sigmoid* function. While it can technically be any function, one should be advised if not using the *sigmoid* function.
        * The concatenated array of hidden states *h* has shape *(m, s, u)*. By default, the *GRU* layer returns the hidden state corresponding to the last step in the input sequence with shape *(m, u)*. If the *sequences* argument is set to *True* when instantiating the *GRU* layer, then it will return the whole array of hidden states with shape *(m, s, u)*.
        * For the sake of code homogeneity, the output of the *GRU* layer is *A* which is equal to *h*.

        .. image:: _static/GRU/gru2-01.svg

        .. math::

            \begin{alignat*}{2}
              & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\
              \\
              & x^{k~<s>}_{me} &&= x^{k}_{mse}[:, s] \tag{2s} \\
              \\
              & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3s} \\
              \\
              & r\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Ur^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot Vr^{k}_{uu} \\
              & &&+ br^{k}_{u} \tag{4.1s} \\
              & r^{k~<s>}_{mu} &&= r_{act}(r\_^{k~<s>}_{mu}) \tag{4.2s} \\
              \\
              & z\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Uz^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot Vz^{k}_{uu} \\
              & &&+ bz^{k}_{u} \tag{5.1s} \\
              & z^{k~<s>}_{mu} &&= z_{act}(z\_^{k~<s>}_{mu}) \tag{5.2s} \\
              \\
              & \hat{h}\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot U\hat{h}^{k}_{vu} \\
              & &&+ (r^{k~<s>}_{mu} * h^{k~<\sm>}_{mu}) \cdot V\hat{h}^{k}_{uu} \\
              & &&+ b\hat{h}^{k}_{u} \tag{6.1s} \\
              & \hat{h}^{k~<s>}_{mu} &&= \hat{h}_{act}(\hat{h}\_^{k~<s>}_{mu}) \tag{6.2s} \\
              \\
              & h^{k~<s>}_{mu} &&= z^{k~<s>}_{mu} * h^{k~<\sm>}_{mu} + (1-z^{k~<s>}_{mu}) * \hat{h}^{k~<s>}_{mu} \tag{7s} \\
            \end{alignat*}


Backward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.gru.models.GRU.backward

        .. literalinclude:: ./../epynn/gru/backward.py
            :pyobject: gru_backward
            :language: python

        The backward propagation function in a *LSTM* layer *k* includes:

        * (1): *dA* the gradient of the loss with respect to the output of forward propagation *A* for current layer *k*. It is equal to the gradient of the loss with respect to input of forward propagation for next layer *k+1*. The initial gradient for hidden state *dh* is a zero array.
        * (2s): For each step in the reversed sequence, input *dA* of the current iteration is retrieved by indexing the input with shape *(m, s, u)* to obtain the input for step with shape *(m, u)*.
        * (3s): The next gradients of the loss with respect to hidden state *dhn* is retrieved at the beginning of each iteration from the counterpart *dh* computed at the end of the previous iteration (8s).
        * (4s): *dh\_* is the sum of *dA* and *dhn*.
        * (5s): *dhh\_* is the gradient of the loss with respect to *hh\_* for the current step. It is the product of *dh\_*, *(1-z)* and the derivative of the *activate* function applied on *hh\_*.
        * (6s): *dz\_* is the gradient of the loss with respect to *z\_* for the current step. It is the product of *dh\_*, *(hp-hh)* and the derivative of the *activate_update* function applied on *z\_*.
        * (7s): *dr\_* is the gradient of the loss with respect to *r\_* for the current step. It is the product of (the dot product between *dhh\_* and the transpose of *Vhh*), previous hidden state *hp* and the derivative of the *activate_reset* function applied on *r\_*.
        * (8s): *dh* is the gradient of the loss with respect to hidden state *h* for the current step. It is the sum of three terms: dot product of *dhh\_* and the transpose of *Vhh* multiplied by the output of the reset gate *r*; dot product between *dz\_* and the transpose of *Vz* to which the element-wise product between *dh\_* and *z* is added; dot product between *dr\_* and the transpose of *Vr*.
        * (9s): *dX* is the gradient of the loss with respect to the input of forward propagation *X* for current step and current layer *k*. It is the sum of the dot products between: *dhh\_* and the transpose of *Uhh*; *dz\_* and the transpose of *Uz*; *dr\_* and the transpose of *Ur*.

        Note that:

        * In contrast to the forward propagation, we proceed by iterating over the *reversed sequence*.
        * In the default *GRU* configuration with *sequences* set to *False*, the output of forward propagation has shape *(m, u)* and the input of backward propagation has shape *(m, u)*. In the function :py:func:`epynn.gru.backward.initialize_backward` this is converted to yield a zero array of shape *(m, s, u)* for all but not coordinates *[:, -1, :]* which are set equal to *dA* of shape *(m, u)*.

        .. image:: _static/GRU/gru3-01.svg

        .. math::
          \begin{alignat*}{2}
            & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\
            \\
            & \delta^{\kp{<s>}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\
            \\
	    & \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3s} \\
	    \\
            & \dL{h\_}{s}{mu} &&=  \delta^{\kp{<s>}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\
            \\
            & \dL{\hat{h}\_}{s}{mu} &&= \gl{dh\_} \\
            & &&* (1 - z^{k~<s>}_{mu}) \\
            & &&* \hat{h}_{act}'(\hat{h}\_^{k~<s>}_{mu}) \tag{5s} \\
            \\
            & \dL{z\_}{s}{mu} &&= \gl{dh\_} \\
            & && \cdot \vTp{W\hat{h}}{uu} \\
            & &&* (h^{k~<\sm>}_{mu} - \hat{h}^{k~<s>}_{mu}) \\
            & &&* z_{act}'(z\_^{k~<s>}_{mu}) \tag{6s} \\
            \\
            & \dL{r\_}{s}{mu} &&= \gl{d\hat{h}\_} \cdot \vTp{W\hat{h}}{uu} \\
            & &&* h^{k~<\sm>}_{mu} \\
            & &&* r_{act}'(r\_^{k~<s>}_{mu}) \tag{7s} \\
            \\
            & \dL{h}{s}{mu} &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{W\hat{h}}{uu} * \gl{r} \\
            & &&+ \dL{z\_}{s}{mu} \cdot \vTp{Wz}{uu} \\
            & &&+ \dL{r\_}{s}{mu} \cdot \vTp{Wr}{uu} \tag{8s} \\
            \\
            & \delta^{k~<s>}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~<s>}_{me}} \\
            & &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{U\hat{h}}{vu} \\
            & &&+ \dL{z\_}{s}{mu} \cdot \vTp{Uz}{vu} \\
            & &&+ \dL{r\_}{s}{mu} \cdot \vTp{Ur}{vu} \tag{9s} \\
          \end{alignat*}

Gradients
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.gru.models.GRU.compute_gradients

        .. literalinclude:: ./../epynn/gru/parameters.py
            :pyobject: gru_compute_gradients
            :language: python

        The function to compute parameters gradients in a *GRU* layer *k* includes:

        * (1.1): *dUhh* is the gradient of the loss with respect to *Uhh*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dhh\_*.
        * (1.2): *dVhh* is the gradient of the loss with respect to *Vhh*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *(r \* hp)* and *dhh\_*.
        * (1.3): *dbhh* is the gradient of the loss with respect to *bhh*. It is computed as a sum with respect to step *s* of the sum of *dhh\_* along the axis corresponding to the number of samples *m*.

        * (2.1): *dUz* is the gradient of the loss with respect to *Uz*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dz\_*.
        * (2.2): *dVz* is the gradient of the loss with respect to *Vz*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dz\_*.
        * (2.3): *dbz* is the gradient of the loss with respect to *bz*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.

        * (3.1): *dUi* is the gradient of the loss with respect to *Ui*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dr\_*.
        * (3.2): *dVi* is the gradient of the loss with respect to *Vi*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dr\_*.
        * (3.3): *dbi* is the gradient of the loss with respect to *bi*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.


        Note that:

        * The same logical operation is applied for sets *{dUr, dVr, dbr}* and *{dUz, dVz, dbz}* but not for set *{dUhh, dVhh, dbhh}*.
        * For the first two sets, this logical operation is identical to the one described for the *RNN* layer.
        * For *{dUhh, dVhh, dbhh}* the logical operation is different because of how *hh\_* is computed during the forward propagation.

      .. math::

          \begin{align}
            & \dLp{U\hat{h}}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.1} \\
            & \dLp{W\hat{h}}{uu} &&= \sumS (r\_^{k~<s>}_{mu} * h^{k~<\sm>}_{mu})^{\intercal} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.2} \\
            & \dLp{b\hat{h}}{u} &&= \sumS \sumM \dL{\hat{h}\_}{s}{mu} \tag{1.3} \\
            \\
            & \dLp{Uz}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{z\_}{s}{mu} \tag{2.1} \\
            & \dLp{Wz}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{z\_}{s}{mu} \tag{2.2} \\
            & \dLp{bz}{u} &&= \sumS \sumM \dL{z\_}{s}{mu} \tag{2.3} \\
            \\
            & \dLp{Ur}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{r\_}{s}{mu} \tag{3.1} \\
            & \dLp{Wr}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{r\_}{s}{mu} \tag{3.2} \\
            & \dLp{br}{u} &&= \sumS \sumM \dL{r\_}{s}{mu} \tag{3.3} \\
          \end{align}


Live examples
------------------------------


* `Dummy string - GRU-Dense`_
* `Author and music - GRU(sequences=True)-Flatten-(Dense)n-with-Dropout`_

You may also like to browse all `Network training examples`_ provided with EpyNN.

.. _Network training examples: run_examples.html

.. _Dummy string - GRU-Dense: epynnlive/dummy_string/train.html#GRU-Dense

.. _Author and music - GRU(sequences\=True)-Flatten-(Dense)n-with-Dropout: epynnlive/author_music/train.html#GRU(sequences=True)-Flatten-(Dense)n-with-Dropout