.. EpyNN documentation master file, created by
   sphinx-quickstart on Tue Jul  6 18:46:11 2021.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

.. toctree::

Long Short-Term Memory (LSTM)
===============================

Source files in ``EpyNN/epynn/lstm/``.

See `Appendix - Notations`_ for mathematical conventions.

.. _Appendix - Notations: glossary.html#notations

Layer architecture
------------------------------

.. image:: _static/LSTM/lstm0-01.svg
   :alt: LSTM

A Long Short-Term Memory or *LSTM* layer is an object containing a number of *units* - sometimes referred to as cells - and provided with functions for parameters *initialization* and non-linear *activation* of the so-called memory state *C*. The latter is a variable to compute the hidden state *h*. Both hidden *h* and memory *C* states are computed from *gates* products, namely the *forget* gate product *f*, the *input* gate products *i* and *g* and the *output* gate product *o*. Each of these products requires a non-linear activation function to be computed.

.. autoclass:: epynn.lstm.models.LSTM
    :show-inheritance:

Shapes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.lstm.models.LSTM.compute_shapes

        .. literalinclude:: ./../epynn/lstm/parameters.py
            :pyobject: lstm_compute_shapes
            :language: python

        Within a *LSTM* layer, shapes of interest include:

        * Input *X* of shape *(m, s, e)* with *m* equal to the number of samples, *s* the number of steps in sequence and *e* the number of elements within each step of the sequence.
        * Weight *U* and *V* of shape *(e, u)* and *(u, u)*, respectively, with *e* the number of elements within each step of the sequence and *u* the number of units in the layer.
        * Bias *b* of shape *(1, u)* with *u* the number of units in the layer.
        * Hidden *h* and memory *C* states, each of shape *(m, 1, u)* or *(m, u)* with *m* equal to the number of samples and *u* the number of units in the layer. Because there is one hidden *h* and memory *C* states computed for each step in the sequence, the shape of each array containing all hidden or memory states with respect to sequence steps is *(m, s, u)* with *s* the number of steps in the sequence.

        Note that:

        * Parameters shape for *V*, *U* and *b* is independent from the number of samples *m* and the number of steps in the sequence *s*.
        * There are four sets of parameters *{V, U, b}* for each activation within gates: *{Vf, Uf, bf}* for the forget gate, *{Vi, Ui, bi}* and *{Vg, Ug, bg}* for the input gate product *i* and *g*, respectively, and *{Vo, Uo, bo}* for the output gate.
        * Recurrent layers including the *LSTM* layer are said appropriate to handle inputs of variable length because parameters definition is independent from input length *s*.

        .. image:: _static/LSTM/lstm1-01.svg

Forward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.lstm.models.LSTM.forward

        .. literalinclude:: ./../epynn/lstm/forward.py
            :pyobject: lstm_forward

        .. image:: _static/LSTM/lstm2-01.svg

        The forward propagation function in a *LSTM* layer *k* includes:

        * (1): Input *X* in current layer *k* is equal to the output *A* of previous layer *k-1*. The initial hidden state *h* is a zero array.
        * (2s): For each step, input *X* of the current iteration is retrieved by indexing the layer input with shape *(m, s, e)* to obtain the input for step with shape *(m, e)*.
        * (3s): The previous hidden *hp* and memory *Cp\_* states are retrieved at the beginning of each iteration in sequence from the hidden *h* and memory *C\_* states computed at the end of the previous iteration (9s, 8.1s). Note that *hp* went through non-linear activation while *Cp\_* is a linear product.
        * (4s): The forget gate linear product *f\_* is computed from the sum of the dot products between *X*, *Uf* and *hp*, *Vf* to which *bf* is added. The non-linear product *f* is computed by applying the *activate_forget* function on *f\_*.
        * (5s): The input gate linear product *i\_* is computed from the sum of the dot products between *X*, *Ui* and *hp*, *Vi* to which *bi* is added. The non-linear product *i* is computed by applying the *activate_input* function on *i\_*.
        * (6s): The input gate linear product *g\_* is computed from the sum of the dot products between *X*, *Ug* and *hp*, *Vg* to which *bg* is added. The non-linear product *g* is computed by applying the *activate_candidate* function on *g\_*.
        * (7s): The input gate linear product *o\_* is computed from the sum of the dot products between *X*, *Uo* and *hp*, *Vo* to which *bo* is added. The non-linear product *o* is computed by applying the *activate_output* function on *o\_*.
        * (8s): The memory state non-linear activation product *C\_* is the sum of products between *Cp\_*, *f* and *i*, *g*. Non-linear activation yielding *C* is achieved by applying the *activate* function on *C\_*.
        * (9s): The hidden state *h* is the product between the output of the forget gate *o* and the memory state *C* activated through non-linearity.

        Note that:

        * The non-linear activation function for *C* and *g* is generally the *tanh* function. While it can technically be any function, one should be advised if not using the *tanh* function.
        * The non-linear activation function for *f*, *i* and *o* is generally the *sigmoid* function. While it can technically be any function, one should be advised if not using the *sigmoid* function.
        * The concatenated array of hidden states *h* has shape *(m, s, u)*. By default, the *LSTM* layer returns the hidden state corresponding to the last step in the input sequence with shape *(m, u)*. If the *sequences* argument is set to *True* when instantiating the *LSTM* layer, then it will return the whole array of hidden states with shape *(m, s, u)*.
        * For the sake of code homogeneity, the output of the *LSTM* layer is *A* which is equal to *h*.

        .. math::

            \begin{alignat*}{2}
              & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\
              \\
              & x^{k~<s>}_{me} &&= x^{k}_{mse}[:,s] \tag{2s} \\
              \\
              & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3.1s} \\
              & C\_^{k~<\sm>}_{mu} &&= Cp\_^{k}_{msu}[:,s] \tag{3.2s} \\
              \\
              & f\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Uf^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot Vf^{k}_{uu} \\
              & &&+ bf^{k}_{u} \tag{4.1s} \\
              & f^{k~<s>}_{mu} &&= f_{act}(f\_^{k~<s>}_{mu}) \tag{4.2s} \\
              \\
              & i\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Ui^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot Vi^{k}_{uu} \\
              & &&+ bi^{k}_{u} \tag{5.1s} \\
              & i^{k~<s>}_{mu} &&= i_{act}(i\_^{k~<s>}_{mu}) \tag{5.2s} \\
              \\
              & g\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Ug^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot Vg^{k}_{uu} \\
              & &&+ bg^{k}_{u} \tag{6.1s} \\
              & g^{k~<s>}_{mu} &&= g_{act}(g\_^{k~<s>}_{mu}) \tag{6.2s} \\
              \\
              & o\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Uo^{k}_{vu} \\
              & &&+ h^{k~<\sm>}_{mu} \cdot Vo^{k}_{uu} \\
              & &&+ bo^{k}_{u} \tag{7.1s} \\
              & o^{k~<s>}_{mu} &&= o_{act}(o\_^{k~<s>}_{mu}) \tag{7.2s} \\
              \\
              & \gl{C\_} &&= \glm{C} * \gl{f} \\
              & &&+ \gl{i} * \gl{g}  \tag{8.1s} \\
              & \gl{C} &&= C_{act}(C\_^{k~<s>}_{mu}) \tag{8.2s} \\
              \\
              & \gl{h} &&= \gl{o} * \gl{C} \tag{9s} \\
            \end{alignat*}


Backward
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.lstm.models.LSTM.backward

        .. literalinclude:: ./../epynn/lstm/backward.py
            :pyobject: lstm_backward

        .. image:: _static/LSTM/lstm3-01.svg

        The backward propagation function in a *LSTM* layer *k* includes:

        * (1): *dA* the gradient of the loss with respect to the output of forward propagation *A* for current layer *k*. It is equal to the gradient of the loss with respect to input of forward propagation for next layer *k+1*. The initial gradient for hidden state *dh* is a zero array.
        * (2s): For each step in the reversed sequence, input *dA* of the current iteration is retrieved by indexing the input with shape *(m, s, u)* to obtain the input for step with shape *(m, u)*.
        * (3s): The next gradients of the loss with respect to memory *dCn* and hidden *dhn* states are retrieved at the beginning of each iteration from the counterpart *dC* and *dh* computed at the end of the previous iteration (10s, 11s).
        * (4s): *dh\_* is the sum of *dA* and *dhn*.
        * (5s): *dC\_* is the gradient of the loss with respect to *C\_* for the current step. It is the product between *dh\_*, output gate product *o* and the derivative of the *activate* function applied on *C\_*. Finally, *dCn* is added to the product.
        * (6s): *do\_* is the gradient of the loss with respect to *o\_* for the current step. It is the product between *dh\_*, memory state *C* and the derivative of the *activate_output* function applied on *o\_*.
        * (7s): *dg\_* is the gradient of the loss with respect to *g\_* for the current step. It is the product between *dC\_*, input gate product *i* and the derivative of the *activate_candidate* function applied on *g\_*.
        * (8s): *di\_* is the gradient of the loss with respect to *i\_* for the current step. It is the product between *dC\_*, input gate product *g* and the derivative of the *activate_input* function applied on *i\_*.
        * (9s): *df\_* is the gradient of the loss with respect to *f\_* for the current step. It is the product between *dC\_*, previous and linear memory state *Cp\_* and the derivative of the *activate_forget* function applied on *f\_*.
        * (9s): *dC* is the gradient of the loss with respect to memory state *C* for the current step. It is the product between *dC\_* and the forget gate product *f*.
        * (9s): *dh* is the gradient of the loss with respect to hidden state *h* for the current step. It is the sum of the dot products between: *do\_* and the transpose of *Vo*; *di\_* and the transpose of *Vi*; *dg\_* and the transpose of *Vg*; *df\_* and the transpose of *Vf*.
        * (9s): *dX* is the gradient of the loss with respect to the input of forward propagation *X* for current step and current layer *k*. It is the sum of the dot products between: *do\_* and the transpose of *Uo*; *di\_* and the transpose of *Ui*; *dg\_* and the transpose of *Ug*; *df\_* and the transpose of *Uf*.

        Note that:

        * In contrast to the forward propagation, we proceed by iterating over the *reversed sequence*.
        * In the default *LSTM* configuration with *sequences* set to *False*, the output of forward propagation has shape *(m, u)* and the input of backward propagation has shape *(m, u)*. In the function :py:func:`epynn.lstm.backward.initialize_backward` this is converted to yield zero arrays of shape *(m, s, u)* for all but not coordinates *[:, -1, :]* which are set equal to *dA* of shape *(m, u)*.

        .. math::

          \begin{alignat*}{2}
            & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\
            \\
            & \delta^{\kp{<s>}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\
            \\
	    & \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3.1s} \\
	    & \dL{C}{\sp}{mu} &&= \dL{Cn}{s}{mu}[:,s] \tag{3.2s} \\
            \\
            & \dL{h\_}{s}{mu} &&= \delta^{\kp{<s>}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\
            \\
            & \dL{o\_}{s}{mu} &&= \dL{h\_}{s}{mu} \\
            & &&* \gl{C\_} \\
            & &&* o_{act}'(o\_^{k~<s>}_{mu}) \tag{5s} \\
            \\
            & \dL{C\_}{s}{mu} &&= \dL{h\_}{s}{mu} \\
            & &&* \gl{o\_} \\
            & &&* C_{act}'(C\_^{k~<s>}_{mu}) \\
            & &&+ \dL{C}{\sp}{mu} \tag{6s} \\
            \\
            & \dL{g\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\
            & &&* \gl{i} \\
            & &&* g_{act}'(g\_^{k~<s>}_{mu}) \tag{7s} \\
            \\
            & \dL{i\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\
            & &&* \gl{g} \\
            & &&* i_{act}'(i\_^{k~<s>}_{mu}) \tag{8s} \\
            \\
            & \dL{f\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\
            & &&* \gl{Cp\_} \\
            & &&* f_{act}'(f\_^{k~<s>}_{mu}) \tag{9s} \\
            \\
            & \dL{C}{s}{mu} &&= \dL{C\_}{s}{mu} \\
            & &&* \gl{f} \tag{10s} \\
            \\
            & \dL{h}{s}{mu} &&= \dL{o\_}{s}{mu} \cdot \vTp{Wo}{uu} \\
            & &&+ \dL{g\_}{s}{mu} \cdot \vTp{Wg}{uu} \\
            & &&+ \dL{i\_}{s}{mu} \cdot \vTp{Wi}{uu} \\
            & &&+ \dL{f\_}{s}{mu} \cdot \vTp{Wf}{uu} \tag{11s} \\
            \\
            & \delta^{k~<s>}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~<s>}_{me}} \\
            & &&= \dL{o\_}{s}{mu} \cdot \vTp{Uo}{vu} \\
            & &&+ \dL{g\_}{s}{mu} \cdot \vTp{Ug}{vu} \\
            & &&+ \dL{i\_}{s}{mu} \cdot \vTp{Ui}{vu} \\
            & &&+ \dL{f\_}{s}{mu} \cdot \vTp{Uf}{vu} \tag{12s}
          \end{alignat*}

Gradients
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    .. automethod:: epynn.lstm.models.LSTM.compute_gradients

        .. literalinclude:: ./../epynn/lstm/parameters.py
            :pyobject: lstm_compute_gradients

        The function to compute parameters gradients in a *LSTM* layer *k* includes:

        * (1.1): *dUo* is the gradient of the loss with respect to *Uo*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *do\_*.
        * (1.2): *dVo* is the gradient of the loss with respect to *Vo*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *do\_*.
        * (1.3): *dbo* is the gradient of the loss with respect to *bo*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.

        * (2.1): *dUg* is the gradient of the loss with respect to *Ug*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dg\_*.
        * (2.2): *dVg* is the gradient of the loss with respect to *Vg*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dg\_*.
        * (2.3): *dbg* is the gradient of the loss with respect to *bg*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.

        * (3.1): *dUi* is the gradient of the loss with respect to *Ui*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *di\_*.
        * (3.2): *dVi* is the gradient of the loss with respect to *Vi*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *di\_*.
        * (3.3): *dbi* is the gradient of the loss with respect to *bi*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.

        * (4.1): *dUf* is the gradient of the loss with respect to *Uf*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *df\_*.
        * (4.2): *dVf* is the gradient of the loss with respect to *Vf*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *df\_*.
        * (4.3): *dbf* is the gradient of the loss with respect to *bf*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*.

        Note that:

        * The same logical operation is applied to each set *{dU, dV, db}*.
        * This logical operation is identical to the one described for the *RNN* layer.

        .. math::

          \begin{align}
            & \dLp{Uo}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{o\_}{s}{mu} \tag{1.1} \\
            & \dLp{Wo}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{o\_}{s}{mu} \tag{1.2} \\
            & \dLp{bo}{u} &&= \sumS \sumM \dL{o\_}{s}{mu} \tag{1.3} \\
            \\
            & \dLp{Ui}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{i\_}{s}{mu} \tag{2.1} \\
            & \dLp{Wi}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{i\_}{s}{mu} \tag{2.2} \\
            & \dLp{bi}{u} &&= \sumS \sumM \dL{i\_}{s}{mu} \tag{2.3} \\
            \\
            & \dLp{Ug}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{g\_}{s}{mu} \tag{3.1} \\
            & \dLp{Wg}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{g\_}{s}{mu} \tag{3.2} \\
            & \dLp{bg}{u} &&= \sumS \sumM \dL{g\_}{s}{mu} \tag{3.3} \\
            \\
            & \dLp{Uf}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{f\_}{s}{mu} \tag{4.1} \\
            & \dLp{Wf}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{f\_}{s}{mu} \tag{4.2} \\
            & \dLp{bf}{u} &&= \sumS \sumM \dL{f\_}{s}{mu} \tag{4.3} \\
          \end{align}


Live examples
------------------------------


* `Dummy string - LSTM-Dense`_
* `Protein Modification - LSTM-Dense`_
* `Protein Modification - LSTM(sequence=True)-Flatten-Dense`_
* `Protein Modification - LSTM(sequence=True)-Flatten-(Dense)n with Dropout`_

You may also like to browse all `Network training examples`_ provided with EpyNN.


.. _Network training examples: run_examples.html

.. _Dummy string - LSTM-Dense: epynnlive/dummy_string/train.html#LSTM-Dense

.. _Protein Modification - LSTM-Dense: epynnlive/ptm_protein/train.html#LSTM-Dense
.. _Protein Modification - LSTM(sequence\=True)-Flatten-Dense: epynnlive/ptm_protein/train.html#LSTM(sequence=True)-Flatten-Dense
.. _Protein Modification - LSTM(sequence\=True)-Flatten-(Dense)n with Dropout: epynnlive/ptm_protein/train.html#LSTM(sequence=True)-Flatten-(Dense)n-with-Dropout