.. EpyNN documentation master file, created by
sphinx-quickstart on Tue Jul 6 18:46:11 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. toctree::
Gated Recurrent Unit (GRU)
===============================
Source files in ``EpyNN/epynn/gru/``.
See `Appendix - Notations`_ for mathematical conventions.
.. _Appendix - Notations: glossary.html#notations
Layer architecture
------------------------------
.. image:: _static/GRU/gru0-01.svg
:alt: GRU
A Gated Recurrent Unit or *GRU* layer is an object containing a number of *units* - sometimes referred to as cells - and provided with functions for parameters *initialization* and non-linear *activation* of the so-called hidden hat *hh*. The latter is a variable to compute the hidden state *h*. The latter is also computed from *gates* products, namely the *reset* gate product *r* and the *update* gate products *z*. Each of these products require a non-linear activation function to be computed.
.. autoclass:: epynn.gru.models.GRU
:show-inheritance:
Shapes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automethod:: epynn.gru.models.GRU.compute_shapes
.. literalinclude:: ./../epynn/gru/parameters.py
:pyobject: gru_compute_shapes
:language: python
Within a *GRU* layer, shapes of interest include:
* Input *X* of shape *(m, s, e)* with *m* equal to the number of samples, *s* the number of steps in sequence and *e* the number of elements within each step of the sequence.
* Weight *U* and *V* of shape *(e, u)* and *(u, u)*, respectively, with *e* the number of elements within each step of the sequence and *u* the number of units in the layer.
* Bias *b* of shape *(1, u)* with *u* the number of units in the layer.
* Hidden state *h* of shape *(m, 1, u)* or *(m, u)* with with *m* equal to the number of samples and *u* the number of units in the layer. Because there is one hidden state *h* computed for each step in the sequence, the shape of the array containing all hidden states with respect to sequence steps is *(m, s, u)* with *s* the number of steps in the sequence.
Note that:
* Parameters shape for *V*, *U* and *b* is independent from the number of samples *m* and the number of steps in the sequence *s*.
* There are three sets of parameters *{V, U, b}* for each activation: *{Vr, Ur, br}* for the reset gate, *{Vz, Uz, bz}* for the update gate and *{Vhh, Uhh, bhh}* for the activation of hidden hat *hh*.
* Recurrent layers including the *GRU* layer are considered appropriate to handle inputs of variable length because parameters definition is independent from input length *s*.
.. image:: _static/GRU/gru1-01.svg
Forward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automethod:: epynn.gru.models.GRU.forward
.. literalinclude:: ./../epynn/gru/forward.py
:pyobject: gru_forward
:language: python
The forward propagation function in a *GRU* layer *k* includes:
* (1): Input *X* in current layer *k* is equal to the output *A* of previous layer *k-1*. The initial hidden state *h* is a zero array.
* (2s): For each step, input *X* of the current iteration is retrieved by indexing the layer input with shape *(m, s, e)* to obtain the input for step with shape *(m, e)*.
* (3s): The previous hidden state *hp* is retrieved at the beginning of each iteration in sequence from the hidden state *h* computed at the end of the previous iteration (7s).
* (4s): The reset gate linear product *r\_* is computed from the sum of the dot products between *X*, *Ur* and *hp*, *Vr* to which *br* is added. The non-linear product *r* is computed by applying the *activate_reset* function on *r\_*.
* (5s): The update gate linear product *z\_* is computed from the sum of the dot products between *X*, *Uz* and *hp*, *Vz* to which *bz* is added. The non-linear product *z* is computed by applying the *activate_update* function on *z\_*.
* (6s): The hidden hat linear product *hh\_* is computed from the sum of the dot products between *X*, *Uhh* and *(r \* hp)*, *Vhh* to which *bhh* is added. The non-linear product *hh* is computed by applying the *activate* function on *hh\_*.
* (7s): The hidden state *h* is the sum of the products between *z*, *hp* and *(1-z)*, *hh*.
Note that:
* The non-linear activation function for *hh* is generally the *tanh* function. While it can technically be any function, one should be advised if not using the *tanh* function.
* The non-linear activation function for *r* and *z* is generally the *sigmoid* function. While it can technically be any function, one should be advised if not using the *sigmoid* function.
* The concatenated array of hidden states *h* has shape *(m, s, u)*. By default, the *GRU* layer returns the hidden state corresponding to the last step in the input sequence with shape *(m, u)*. If the *sequences* argument is set to *True* when instantiating the *GRU* layer, then it will return the whole array of hidden states with shape *(m, s, u)*.
* For the sake of code homogeneity, the output of the *GRU* layer is *A* which is equal to *h*.
.. image:: _static/GRU/gru2-01.svg
.. math::
\begin{alignat*}{2}
& x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\
\\
& x^{k~~~}_{me} &&= x^{k}_{mse}[:, s] \tag{2s} \\
\\
& h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3s} \\
\\
& r\_^{k~~~~~}_{mu} &&= x^{k~~~~~}_{me} \cdot Ur^{k}_{vu} \\
& &&+ h^{k~<\sm>}_{mu} \cdot Vr^{k}_{uu} \\
& &&+ br^{k}_{u} \tag{4.1s} \\
& r^{k~~~~~}_{mu} &&= r_{act}(r\_^{k~~~~~}_{mu}) \tag{4.2s} \\
\\
& z\_^{k~~~~~}_{mu} &&= x^{k~~~~~}_{me} \cdot Uz^{k}_{vu} \\
& &&+ h^{k~<\sm>}_{mu} \cdot Vz^{k}_{uu} \\
& &&+ bz^{k}_{u} \tag{5.1s} \\
& z^{k~~~~~}_{mu} &&= z_{act}(z\_^{k~~~~~}_{mu}) \tag{5.2s} \\
\\
& \hat{h}\_^{k~~~~~}_{mu} &&= x^{k~~~~~}_{me} \cdot U\hat{h}^{k}_{vu} \\
& &&+ (r^{k~~~~~}_{mu} * h^{k~<\sm>}_{mu}) \cdot V\hat{h}^{k}_{uu} \\
& &&+ b\hat{h}^{k}_{u} \tag{6.1s} \\
& \hat{h}^{k~~~~~}_{mu} &&= \hat{h}_{act}(\hat{h}\_^{k~~~~~}_{mu}) \tag{6.2s} \\
\\
& h^{k~~~~~}_{mu} &&= z^{k~~~~~}_{mu} * h^{k~<\sm>}_{mu} + (1-z^{k~~~~~}_{mu}) * \hat{h}^{k~~~~~}_{mu} \tag{7s} \\
\end{alignat*}
Backward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automethod:: epynn.gru.models.GRU.backward
.. literalinclude:: ./../epynn/gru/backward.py
:pyobject: gru_backward
:language: python
The backward propagation function in a *LSTM* layer *k* includes:
* (1): *dA* the gradient of the loss with respect to the output of forward propagation *A* for current layer *k*. It is equal to the gradient of the loss with respect to input of forward propagation for next layer *k+1*. The initial gradient for hidden state *dh* is a zero array.
* (2s): For each step in the reversed sequence, input *dA* of the current iteration is retrieved by indexing the input with shape *(m, s, u)* to obtain the input for step with shape *(m, u)*.
* (3s): The next gradients of the loss with respect to hidden state *dhn* is retrieved at the beginning of each iteration from the counterpart *dh* computed at the end of the previous iteration (8s).
* (4s): *dh\_* is the sum of *dA* and *dhn*.
* (5s): *dhh\_* is the gradient of the loss with respect to *hh\_* for the current step. It is the product of *dh\_*, *(1-z)* and the derivative of the *activate* function applied on *hh\_*.
* (6s): *dz\_* is the gradient of the loss with respect to *z\_* for the current step. It is the product of *dh\_*, *(hp-hh)* and the derivative of the *activate_update* function applied on *z\_*.
* (7s): *dr\_* is the gradient of the loss with respect to *r\_* for the current step. It is the product of (the dot product between *dhh\_* and the transpose of *Vhh*), previous hidden state *hp* and the derivative of the *activate_reset* function applied on *r\_*.
* (8s): *dh* is the gradient of the loss with respect to hidden state *h* for the current step. It is the sum of three terms: dot product of *dhh\_* and the transpose of *Vhh* multiplied by the output of the reset gate *r*; dot product between *dz\_* and the transpose of *Vz* to which the element-wise product between *dh\_* and *z* is added; dot product between *dr\_* and the transpose of *Vr*.
* (9s): *dX* is the gradient of the loss with respect to the input of forward propagation *X* for current step and current layer *k*. It is the sum of the dot products between: *dhh\_* and the transpose of *Uhh*; *dz\_* and the transpose of *Uz*; *dr\_* and the transpose of *Ur*.
Note that:
* In contrast to the forward propagation, we proceed by iterating over the *reversed sequence*.
* In the default *GRU* configuration with *sequences* set to *False*, the output of forward propagation has shape *(m, u)* and the input of backward propagation has shape *(m, u)*. In the function :py:func:`epynn.gru.backward.initialize_backward` this is converted to yield a zero array of shape *(m, s, u)* for all but not coordinates *[:, -1, :]* which are set equal to *dA* of shape *(m, u)*.
.. image:: _static/GRU/gru3-01.svg
.. math::
\begin{alignat*}{2}
& \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\
\\
& \delta^{\kp{~~~~}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\
\\
& \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3s} \\
\\
& \dL{h\_}{s}{mu} &&= \delta^{\kp{~~~~}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\
\\
& \dL{\hat{h}\_}{s}{mu} &&= \gl{dh\_} \\
& &&* (1 - z^{k~~~~~}_{mu}) \\
& &&* \hat{h}_{act}'(\hat{h}\_^{k~~~~~}_{mu}) \tag{5s} \\
\\
& \dL{z\_}{s}{mu} &&= \gl{dh\_} \\
& && \cdot \vTp{W\hat{h}}{uu} \\
& &&* (h^{k~<\sm>}_{mu} - \hat{h}^{k~~~~~}_{mu}) \\
& &&* z_{act}'(z\_^{k~~~~~}_{mu}) \tag{6s} \\
\\
& \dL{r\_}{s}{mu} &&= \gl{d\hat{h}\_} \cdot \vTp{W\hat{h}}{uu} \\
& &&* h^{k~<\sm>}_{mu} \\
& &&* r_{act}'(r\_^{k~~~~~}_{mu}) \tag{7s} \\
\\
& \dL{h}{s}{mu} &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{W\hat{h}}{uu} * \gl{r} \\
& &&+ \dL{z\_}{s}{mu} \cdot \vTp{Wz}{uu} \\
& &&+ \dL{r\_}{s}{mu} \cdot \vTp{Wr}{uu} \tag{8s} \\
\\
& \delta^{k~~~~~}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~~~~~}_{me}} \\
& &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{U\hat{h}}{vu} \\
& &&+ \dL{z\_}{s}{mu} \cdot \vTp{Uz}{vu} \\
& &&+ \dL{r\_}{s}{mu} \cdot \vTp{Ur}{vu} \tag{9s} \\
\end{alignat*}
Gradients
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. automethod:: epynn.gru.models.GRU.compute_gradients
.. literalinclude:: ./../epynn/gru/parameters.py
:pyobject: gru_compute_gradients
:language: python
The function to compute parameters gradients in a *GRU* layer *k* includes:
* (1.1): *dUhh* is the gradient of the loss with respect to *Uhh*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dhh\_*.
* (1.2): *dVhh* is the gradient of the loss with respect to *Vhh*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *(r \* hp)* and *dhh\_*.
* (1.3): *dbhh* is the gradient of the loss with respect to *bhh*. It is computed as a sum with respect to step *s* of the sum of *dhh\_* along the axis corresponding to the number of samples *m*.
* (2.1): *dUz* is the gradient of the loss with respect to *Uz*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dz\_*.
* (2.2): *dVz* is the gradient of the loss with respect to *Vz*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dz\_*.
* (2.3): *dbz* is the gradient of the loss with respect to *bz*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.
* (3.1): *dUi* is the gradient of the loss with respect to *Ui*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dr\_*.
* (3.2): *dVi* is the gradient of the loss with respect to *Vi*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dr\_*.
* (3.3): *dbi* is the gradient of the loss with respect to *bi*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.
Note that:
* The same logical operation is applied for sets *{dUr, dVr, dbr}* and *{dUz, dVz, dbz}* but not for set *{dUhh, dVhh, dbhh}*.
* For the first two sets, this logical operation is identical to the one described for the *RNN* layer.
* For *{dUhh, dVhh, dbhh}* the logical operation is different because of how *hh\_* is computed during the forward propagation.
.. math::
\begin{align}
& \dLp{U\hat{h}}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.1} \\
& \dLp{W\hat{h}}{uu} &&= \sumS (r\_^{k~~~~~}_{mu} * h^{k~<\sm>}_{mu})^{\intercal} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.2} \\
& \dLp{b\hat{h}}{u} &&= \sumS \sumM \dL{\hat{h}\_}{s}{mu} \tag{1.3} \\
\\
& \dLp{Uz}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{z\_}{s}{mu} \tag{2.1} \\
& \dLp{Wz}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{z\_}{s}{mu} \tag{2.2} \\
& \dLp{bz}{u} &&= \sumS \sumM \dL{z\_}{s}{mu} \tag{2.3} \\
\\
& \dLp{Ur}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{r\_}{s}{mu} \tag{3.1} \\
& \dLp{Wr}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{r\_}{s}{mu} \tag{3.2} \\
& \dLp{br}{u} &&= \sumS \sumM \dL{r\_}{s}{mu} \tag{3.3} \\
\end{align}
Live examples
------------------------------
* `Dummy string - GRU-Dense`_
* `Author and music - GRU(sequences=True)-Flatten-(Dense)n-with-Dropout`_
You may also like to browse all `Network training examples`_ provided with EpyNN.
.. _Network training examples: run_examples.html
.. _Dummy string - GRU-Dense: epynnlive/dummy_string/train.html#GRU-Dense
.. _Author and music - GRU(sequences\=True)-Flatten-(Dense)n-with-Dropout: epynnlive/author_music/train.html#GRU(sequences=True)-Flatten-(Dense)n-with-Dropout
~~