.. EpyNN documentation master file, created by sphinx-quickstart on Tue Jul 6 18:46:11 2021. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. toctree:: Gated Recurrent Unit (GRU) =============================== Source files in ``EpyNN/epynn/gru/``. See `Appendix - Notations`_ for mathematical conventions. .. _Appendix - Notations: glossary.html#notations Layer architecture ------------------------------ .. image:: _static/GRU/gru0-01.svg :alt: GRU A Gated Recurrent Unit or *GRU* layer is an object containing a number of *units* - sometimes referred to as cells - and provided with functions for parameters *initialization* and non-linear *activation* of the so-called hidden hat *hh*. The latter is a variable to compute the hidden state *h*. The latter is also computed from *gates* products, namely the *reset* gate product *r* and the *update* gate products *z*. Each of these products require a non-linear activation function to be computed. .. autoclass:: epynn.gru.models.GRU :show-inheritance: Shapes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.gru.models.GRU.compute_shapes .. literalinclude:: ./../epynn/gru/parameters.py :pyobject: gru_compute_shapes :language: python Within a *GRU* layer, shapes of interest include: * Input *X* of shape *(m, s, e)* with *m* equal to the number of samples, *s* the number of steps in sequence and *e* the number of elements within each step of the sequence. * Weight *U* and *V* of shape *(e, u)* and *(u, u)*, respectively, with *e* the number of elements within each step of the sequence and *u* the number of units in the layer. * Bias *b* of shape *(1, u)* with *u* the number of units in the layer. * Hidden state *h* of shape *(m, 1, u)* or *(m, u)* with with *m* equal to the number of samples and *u* the number of units in the layer. Because there is one hidden state *h* computed for each step in the sequence, the shape of the array containing all hidden states with respect to sequence steps is *(m, s, u)* with *s* the number of steps in the sequence. Note that: * Parameters shape for *V*, *U* and *b* is independent from the number of samples *m* and the number of steps in the sequence *s*. * There are three sets of parameters *{V, U, b}* for each activation: *{Vr, Ur, br}* for the reset gate, *{Vz, Uz, bz}* for the update gate and *{Vhh, Uhh, bhh}* for the activation of hidden hat *hh*. * Recurrent layers including the *GRU* layer are considered appropriate to handle inputs of variable length because parameters definition is independent from input length *s*. .. image:: _static/GRU/gru1-01.svg Forward ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.gru.models.GRU.forward .. literalinclude:: ./../epynn/gru/forward.py :pyobject: gru_forward :language: python The forward propagation function in a *GRU* layer *k* includes: * (1): Input *X* in current layer *k* is equal to the output *A* of previous layer *k-1*. The initial hidden state *h* is a zero array. * (2s): For each step, input *X* of the current iteration is retrieved by indexing the layer input with shape *(m, s, e)* to obtain the input for step with shape *(m, e)*. * (3s): The previous hidden state *hp* is retrieved at the beginning of each iteration in sequence from the hidden state *h* computed at the end of the previous iteration (7s). * (4s): The reset gate linear product *r\_* is computed from the sum of the dot products between *X*, *Ur* and *hp*, *Vr* to which *br* is added. The non-linear product *r* is computed by applying the *activate_reset* function on *r\_*. * (5s): The update gate linear product *z\_* is computed from the sum of the dot products between *X*, *Uz* and *hp*, *Vz* to which *bz* is added. The non-linear product *z* is computed by applying the *activate_update* function on *z\_*. * (6s): The hidden hat linear product *hh\_* is computed from the sum of the dot products between *X*, *Uhh* and *(r \* hp)*, *Vhh* to which *bhh* is added. The non-linear product *hh* is computed by applying the *activate* function on *hh\_*. * (7s): The hidden state *h* is the sum of the products between *z*, *hp* and *(1-z)*, *hh*. Note that: * The non-linear activation function for *hh* is generally the *tanh* function. While it can technically be any function, one should be advised if not using the *tanh* function. * The non-linear activation function for *r* and *z* is generally the *sigmoid* function. While it can technically be any function, one should be advised if not using the *sigmoid* function. * The concatenated array of hidden states *h* has shape *(m, s, u)*. By default, the *GRU* layer returns the hidden state corresponding to the last step in the input sequence with shape *(m, u)*. If the *sequences* argument is set to *True* when instantiating the *GRU* layer, then it will return the whole array of hidden states with shape *(m, s, u)*. * For the sake of code homogeneity, the output of the *GRU* layer is *A* which is equal to *h*. .. image:: _static/GRU/gru2-01.svg .. math:: \begin{alignat*}{2} & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\ \\ & x^{k~}_{me} &&= x^{k}_{mse}[:, s] \tag{2s} \\ \\ & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3s} \\ \\ & r\_^{k~}_{mu} &&= x^{k~}_{me} \cdot Ur^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vr^{k}_{uu} \\ & &&+ br^{k}_{u} \tag{4.1s} \\ & r^{k~}_{mu} &&= r_{act}(r\_^{k~}_{mu}) \tag{4.2s} \\ \\ & z\_^{k~}_{mu} &&= x^{k~}_{me} \cdot Uz^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vz^{k}_{uu} \\ & &&+ bz^{k}_{u} \tag{5.1s} \\ & z^{k~}_{mu} &&= z_{act}(z\_^{k~}_{mu}) \tag{5.2s} \\ \\ & \hat{h}\_^{k~}_{mu} &&= x^{k~}_{me} \cdot U\hat{h}^{k}_{vu} \\ & &&+ (r^{k~}_{mu} * h^{k~<\sm>}_{mu}) \cdot V\hat{h}^{k}_{uu} \\ & &&+ b\hat{h}^{k}_{u} \tag{6.1s} \\ & \hat{h}^{k~}_{mu} &&= \hat{h}_{act}(\hat{h}\_^{k~}_{mu}) \tag{6.2s} \\ \\ & h^{k~}_{mu} &&= z^{k~}_{mu} * h^{k~<\sm>}_{mu} + (1-z^{k~}_{mu}) * \hat{h}^{k~}_{mu} \tag{7s} \\ \end{alignat*} Backward ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.gru.models.GRU.backward .. literalinclude:: ./../epynn/gru/backward.py :pyobject: gru_backward :language: python The backward propagation function in a *LSTM* layer *k* includes: * (1): *dA* the gradient of the loss with respect to the output of forward propagation *A* for current layer *k*. It is equal to the gradient of the loss with respect to input of forward propagation for next layer *k+1*. The initial gradient for hidden state *dh* is a zero array. * (2s): For each step in the reversed sequence, input *dA* of the current iteration is retrieved by indexing the input with shape *(m, s, u)* to obtain the input for step with shape *(m, u)*. * (3s): The next gradients of the loss with respect to hidden state *dhn* is retrieved at the beginning of each iteration from the counterpart *dh* computed at the end of the previous iteration (8s). * (4s): *dh\_* is the sum of *dA* and *dhn*. * (5s): *dhh\_* is the gradient of the loss with respect to *hh\_* for the current step. It is the product of *dh\_*, *(1-z)* and the derivative of the *activate* function applied on *hh\_*. * (6s): *dz\_* is the gradient of the loss with respect to *z\_* for the current step. It is the product of *dh\_*, *(hp-hh)* and the derivative of the *activate_update* function applied on *z\_*. * (7s): *dr\_* is the gradient of the loss with respect to *r\_* for the current step. It is the product of (the dot product between *dhh\_* and the transpose of *Vhh*), previous hidden state *hp* and the derivative of the *activate_reset* function applied on *r\_*. * (8s): *dh* is the gradient of the loss with respect to hidden state *h* for the current step. It is the sum of three terms: dot product of *dhh\_* and the transpose of *Vhh* multiplied by the output of the reset gate *r*; dot product between *dz\_* and the transpose of *Vz* to which the element-wise product between *dh\_* and *z* is added; dot product between *dr\_* and the transpose of *Vr*. * (9s): *dX* is the gradient of the loss with respect to the input of forward propagation *X* for current step and current layer *k*. It is the sum of the dot products between: *dhh\_* and the transpose of *Uhh*; *dz\_* and the transpose of *Uz*; *dr\_* and the transpose of *Ur*. Note that: * In contrast to the forward propagation, we proceed by iterating over the *reversed sequence*. * In the default *GRU* configuration with *sequences* set to *False*, the output of forward propagation has shape *(m, u)* and the input of backward propagation has shape *(m, u)*. In the function :py:func:`epynn.gru.backward.initialize_backward` this is converted to yield a zero array of shape *(m, s, u)* for all but not coordinates *[:, -1, :]* which are set equal to *dA* of shape *(m, u)*. .. image:: _static/GRU/gru3-01.svg .. math:: \begin{alignat*}{2} & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\ \\ & \delta^{\kp{}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\ \\ & \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3s} \\ \\ & \dL{h\_}{s}{mu} &&= \delta^{\kp{}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\ \\ & \dL{\hat{h}\_}{s}{mu} &&= \gl{dh\_} \\ & &&* (1 - z^{k~}_{mu}) \\ & &&* \hat{h}_{act}'(\hat{h}\_^{k~}_{mu}) \tag{5s} \\ \\ & \dL{z\_}{s}{mu} &&= \gl{dh\_} \\ & && \cdot \vTp{W\hat{h}}{uu} \\ & &&* (h^{k~<\sm>}_{mu} - \hat{h}^{k~}_{mu}) \\ & &&* z_{act}'(z\_^{k~}_{mu}) \tag{6s} \\ \\ & \dL{r\_}{s}{mu} &&= \gl{d\hat{h}\_} \cdot \vTp{W\hat{h}}{uu} \\ & &&* h^{k~<\sm>}_{mu} \\ & &&* r_{act}'(r\_^{k~}_{mu}) \tag{7s} \\ \\ & \dL{h}{s}{mu} &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{W\hat{h}}{uu} * \gl{r} \\ & &&+ \dL{z\_}{s}{mu} \cdot \vTp{Wz}{uu} \\ & &&+ \dL{r\_}{s}{mu} \cdot \vTp{Wr}{uu} \tag{8s} \\ \\ & \delta^{k~}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~}_{me}} \\ & &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{U\hat{h}}{vu} \\ & &&+ \dL{z\_}{s}{mu} \cdot \vTp{Uz}{vu} \\ & &&+ \dL{r\_}{s}{mu} \cdot \vTp{Ur}{vu} \tag{9s} \\ \end{alignat*} Gradients ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.gru.models.GRU.compute_gradients .. literalinclude:: ./../epynn/gru/parameters.py :pyobject: gru_compute_gradients :language: python The function to compute parameters gradients in a *GRU* layer *k* includes: * (1.1): *dUhh* is the gradient of the loss with respect to *Uhh*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dhh\_*. * (1.2): *dVhh* is the gradient of the loss with respect to *Vhh*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *(r \* hp)* and *dhh\_*. * (1.3): *dbhh* is the gradient of the loss with respect to *bhh*. It is computed as a sum with respect to step *s* of the sum of *dhh\_* along the axis corresponding to the number of samples *m*. * (2.1): *dUz* is the gradient of the loss with respect to *Uz*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dz\_*. * (2.2): *dVz* is the gradient of the loss with respect to *Vz*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dz\_*. * (2.3): *dbz* is the gradient of the loss with respect to *bz*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*. * (3.1): *dUi* is the gradient of the loss with respect to *Ui*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dr\_*. * (3.2): *dVi* is the gradient of the loss with respect to *Vi*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dr\_*. * (3.3): *dbi* is the gradient of the loss with respect to *bi*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*. Note that: * The same logical operation is applied for sets *{dUr, dVr, dbr}* and *{dUz, dVz, dbz}* but not for set *{dUhh, dVhh, dbhh}*. * For the first two sets, this logical operation is identical to the one described for the *RNN* layer. * For *{dUhh, dVhh, dbhh}* the logical operation is different because of how *hh\_* is computed during the forward propagation. .. math:: \begin{align} & \dLp{U\hat{h}}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.1} \\ & \dLp{W\hat{h}}{uu} &&= \sumS (r\_^{k~}_{mu} * h^{k~<\sm>}_{mu})^{\intercal} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.2} \\ & \dLp{b\hat{h}}{u} &&= \sumS \sumM \dL{\hat{h}\_}{s}{mu} \tag{1.3} \\ \\ & \dLp{Uz}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{z\_}{s}{mu} \tag{2.1} \\ & \dLp{Wz}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{z\_}{s}{mu} \tag{2.2} \\ & \dLp{bz}{u} &&= \sumS \sumM \dL{z\_}{s}{mu} \tag{2.3} \\ \\ & \dLp{Ur}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{r\_}{s}{mu} \tag{3.1} \\ & \dLp{Wr}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{r\_}{s}{mu} \tag{3.2} \\ & \dLp{br}{u} &&= \sumS \sumM \dL{r\_}{s}{mu} \tag{3.3} \\ \end{align} Live examples ------------------------------ * `Dummy string - GRU-Dense`_ * `Author and music - GRU(sequences=True)-Flatten-(Dense)n-with-Dropout`_ You may also like to browse all `Network training examples`_ provided with EpyNN. .. _Network training examples: run_examples.html .. _Dummy string - GRU-Dense: epynnlive/dummy_string/train.html#GRU-Dense .. _Author and music - GRU(sequences\=True)-Flatten-(Dense)n-with-Dropout: epynnlive/author_music/train.html#GRU(sequences=True)-Flatten-(Dense)n-with-Dropout