.. EpyNN documentation master file, created by sphinx-quickstart on Tue Jul 6 18:46:11 2021. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. toctree:: Long Short-Term Memory (LSTM) =============================== Source files in ``EpyNN/epynn/lstm/``. See `Appendix - Notations`_ for mathematical conventions. .. _Appendix - Notations: glossary.html#notations Layer architecture ------------------------------ .. image:: _static/LSTM/lstm0-01.svg :alt: LSTM A Long Short-Term Memory or *LSTM* layer is an object containing a number of *units* - sometimes referred to as cells - and provided with functions for parameters *initialization* and non-linear *activation* of the so-called memory state *C*. The latter is a variable to compute the hidden state *h*. Both hidden *h* and memory *C* states are computed from *gates* products, namely the *forget* gate product *f*, the *input* gate products *i* and *g* and the *output* gate product *o*. Each of these products requires a non-linear activation function to be computed. .. autoclass:: epynn.lstm.models.LSTM :show-inheritance: Shapes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.lstm.models.LSTM.compute_shapes .. literalinclude:: ./../epynn/lstm/parameters.py :pyobject: lstm_compute_shapes :language: python Within a *LSTM* layer, shapes of interest include: * Input *X* of shape *(m, s, e)* with *m* equal to the number of samples, *s* the number of steps in sequence and *e* the number of elements within each step of the sequence. * Weight *U* and *V* of shape *(e, u)* and *(u, u)*, respectively, with *e* the number of elements within each step of the sequence and *u* the number of units in the layer. * Bias *b* of shape *(1, u)* with *u* the number of units in the layer. * Hidden *h* and memory *C* states, each of shape *(m, 1, u)* or *(m, u)* with *m* equal to the number of samples and *u* the number of units in the layer. Because there is one hidden *h* and memory *C* states computed for each step in the sequence, the shape of each array containing all hidden or memory states with respect to sequence steps is *(m, s, u)* with *s* the number of steps in the sequence. Note that: * Parameters shape for *V*, *U* and *b* is independent from the number of samples *m* and the number of steps in the sequence *s*. * There are four sets of parameters *{V, U, b}* for each activation within gates: *{Vf, Uf, bf}* for the forget gate, *{Vi, Ui, bi}* and *{Vg, Ug, bg}* for the input gate product *i* and *g*, respectively, and *{Vo, Uo, bo}* for the output gate. * Recurrent layers including the *LSTM* layer are said appropriate to handle inputs of variable length because parameters definition is independent from input length *s*. .. image:: _static/LSTM/lstm1-01.svg Forward ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.lstm.models.LSTM.forward .. literalinclude:: ./../epynn/lstm/forward.py :pyobject: lstm_forward .. image:: _static/LSTM/lstm2-01.svg The forward propagation function in a *LSTM* layer *k* includes: * (1): Input *X* in current layer *k* is equal to the output *A* of previous layer *k-1*. The initial hidden state *h* is a zero array. * (2s): For each step, input *X* of the current iteration is retrieved by indexing the layer input with shape *(m, s, e)* to obtain the input for step with shape *(m, e)*. * (3s): The previous hidden *hp* and memory *Cp\_* states are retrieved at the beginning of each iteration in sequence from the hidden *h* and memory *C\_* states computed at the end of the previous iteration (9s, 8.1s). Note that *hp* went through non-linear activation while *Cp\_* is a linear product. * (4s): The forget gate linear product *f\_* is computed from the sum of the dot products between *X*, *Uf* and *hp*, *Vf* to which *bf* is added. The non-linear product *f* is computed by applying the *activate_forget* function on *f\_*. * (5s): The input gate linear product *i\_* is computed from the sum of the dot products between *X*, *Ui* and *hp*, *Vi* to which *bi* is added. The non-linear product *i* is computed by applying the *activate_input* function on *i\_*. * (6s): The input gate linear product *g\_* is computed from the sum of the dot products between *X*, *Ug* and *hp*, *Vg* to which *bg* is added. The non-linear product *g* is computed by applying the *activate_candidate* function on *g\_*. * (7s): The input gate linear product *o\_* is computed from the sum of the dot products between *X*, *Uo* and *hp*, *Vo* to which *bo* is added. The non-linear product *o* is computed by applying the *activate_output* function on *o\_*. * (8s): The memory state non-linear activation product *C\_* is the sum of products between *Cp\_*, *f* and *i*, *g*. Non-linear activation yielding *C* is achieved by applying the *activate* function on *C\_*. * (9s): The hidden state *h* is the product between the output of the forget gate *o* and the memory state *C* activated through non-linearity. Note that: * The non-linear activation function for *C* and *g* is generally the *tanh* function. While it can technically be any function, one should be advised if not using the *tanh* function. * The non-linear activation function for *f*, *i* and *o* is generally the *sigmoid* function. While it can technically be any function, one should be advised if not using the *sigmoid* function. * The concatenated array of hidden states *h* has shape *(m, s, u)*. By default, the *LSTM* layer returns the hidden state corresponding to the last step in the input sequence with shape *(m, u)*. If the *sequences* argument is set to *True* when instantiating the *LSTM* layer, then it will return the whole array of hidden states with shape *(m, s, u)*. * For the sake of code homogeneity, the output of the *LSTM* layer is *A* which is equal to *h*. .. math:: \begin{alignat*}{2} & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\ \\ & x^{k~}_{me} &&= x^{k}_{mse}[:,s] \tag{2s} \\ \\ & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3.1s} \\ & C\_^{k~<\sm>}_{mu} &&= Cp\_^{k}_{msu}[:,s] \tag{3.2s} \\ \\ & f\_^{k~}_{mu} &&= x^{k~}_{me} \cdot Uf^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vf^{k}_{uu} \\ & &&+ bf^{k}_{u} \tag{4.1s} \\ & f^{k~}_{mu} &&= f_{act}(f\_^{k~}_{mu}) \tag{4.2s} \\ \\ & i\_^{k~}_{mu} &&= x^{k~}_{me} \cdot Ui^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vi^{k}_{uu} \\ & &&+ bi^{k}_{u} \tag{5.1s} \\ & i^{k~}_{mu} &&= i_{act}(i\_^{k~}_{mu}) \tag{5.2s} \\ \\ & g\_^{k~}_{mu} &&= x^{k~}_{me} \cdot Ug^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vg^{k}_{uu} \\ & &&+ bg^{k}_{u} \tag{6.1s} \\ & g^{k~}_{mu} &&= g_{act}(g\_^{k~}_{mu}) \tag{6.2s} \\ \\ & o\_^{k~}_{mu} &&= x^{k~}_{me} \cdot Uo^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vo^{k}_{uu} \\ & &&+ bo^{k}_{u} \tag{7.1s} \\ & o^{k~}_{mu} &&= o_{act}(o\_^{k~}_{mu}) \tag{7.2s} \\ \\ & \gl{C\_} &&= \glm{C} * \gl{f} \\ & &&+ \gl{i} * \gl{g} \tag{8.1s} \\ & \gl{C} &&= C_{act}(C\_^{k~}_{mu}) \tag{8.2s} \\ \\ & \gl{h} &&= \gl{o} * \gl{C} \tag{9s} \\ \end{alignat*} Backward ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.lstm.models.LSTM.backward .. literalinclude:: ./../epynn/lstm/backward.py :pyobject: lstm_backward .. image:: _static/LSTM/lstm3-01.svg The backward propagation function in a *LSTM* layer *k* includes: * (1): *dA* the gradient of the loss with respect to the output of forward propagation *A* for current layer *k*. It is equal to the gradient of the loss with respect to input of forward propagation for next layer *k+1*. The initial gradient for hidden state *dh* is a zero array. * (2s): For each step in the reversed sequence, input *dA* of the current iteration is retrieved by indexing the input with shape *(m, s, u)* to obtain the input for step with shape *(m, u)*. * (3s): The next gradients of the loss with respect to memory *dCn* and hidden *dhn* states are retrieved at the beginning of each iteration from the counterpart *dC* and *dh* computed at the end of the previous iteration (10s, 11s). * (4s): *dh\_* is the sum of *dA* and *dhn*. * (5s): *dC\_* is the gradient of the loss with respect to *C\_* for the current step. It is the product between *dh\_*, output gate product *o* and the derivative of the *activate* function applied on *C\_*. Finally, *dCn* is added to the product. * (6s): *do\_* is the gradient of the loss with respect to *o\_* for the current step. It is the product between *dh\_*, memory state *C* and the derivative of the *activate_output* function applied on *o\_*. * (7s): *dg\_* is the gradient of the loss with respect to *g\_* for the current step. It is the product between *dC\_*, input gate product *i* and the derivative of the *activate_candidate* function applied on *g\_*. * (8s): *di\_* is the gradient of the loss with respect to *i\_* for the current step. It is the product between *dC\_*, input gate product *g* and the derivative of the *activate_input* function applied on *i\_*. * (9s): *df\_* is the gradient of the loss with respect to *f\_* for the current step. It is the product between *dC\_*, previous and linear memory state *Cp\_* and the derivative of the *activate_forget* function applied on *f\_*. * (9s): *dC* is the gradient of the loss with respect to memory state *C* for the current step. It is the product between *dC\_* and the forget gate product *f*. * (9s): *dh* is the gradient of the loss with respect to hidden state *h* for the current step. It is the sum of the dot products between: *do\_* and the transpose of *Vo*; *di\_* and the transpose of *Vi*; *dg\_* and the transpose of *Vg*; *df\_* and the transpose of *Vf*. * (9s): *dX* is the gradient of the loss with respect to the input of forward propagation *X* for current step and current layer *k*. It is the sum of the dot products between: *do\_* and the transpose of *Uo*; *di\_* and the transpose of *Ui*; *dg\_* and the transpose of *Ug*; *df\_* and the transpose of *Uf*. Note that: * In contrast to the forward propagation, we proceed by iterating over the *reversed sequence*. * In the default *LSTM* configuration with *sequences* set to *False*, the output of forward propagation has shape *(m, u)* and the input of backward propagation has shape *(m, u)*. In the function :py:func:`epynn.lstm.backward.initialize_backward` this is converted to yield zero arrays of shape *(m, s, u)* for all but not coordinates *[:, -1, :]* which are set equal to *dA* of shape *(m, u)*. .. math:: \begin{alignat*}{2} & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\ \\ & \delta^{\kp{}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\ \\ & \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3.1s} \\ & \dL{C}{\sp}{mu} &&= \dL{Cn}{s}{mu}[:,s] \tag{3.2s} \\ \\ & \dL{h\_}{s}{mu} &&= \delta^{\kp{}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\ \\ & \dL{o\_}{s}{mu} &&= \dL{h\_}{s}{mu} \\ & &&* \gl{C\_} \\ & &&* o_{act}'(o\_^{k~}_{mu}) \tag{5s} \\ \\ & \dL{C\_}{s}{mu} &&= \dL{h\_}{s}{mu} \\ & &&* \gl{o\_} \\ & &&* C_{act}'(C\_^{k~}_{mu}) \\ & &&+ \dL{C}{\sp}{mu} \tag{6s} \\ \\ & \dL{g\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{i} \\ & &&* g_{act}'(g\_^{k~}_{mu}) \tag{7s} \\ \\ & \dL{i\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{g} \\ & &&* i_{act}'(i\_^{k~}_{mu}) \tag{8s} \\ \\ & \dL{f\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{Cp\_} \\ & &&* f_{act}'(f\_^{k~}_{mu}) \tag{9s} \\ \\ & \dL{C}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{f} \tag{10s} \\ \\ & \dL{h}{s}{mu} &&= \dL{o\_}{s}{mu} \cdot \vTp{Wo}{uu} \\ & &&+ \dL{g\_}{s}{mu} \cdot \vTp{Wg}{uu} \\ & &&+ \dL{i\_}{s}{mu} \cdot \vTp{Wi}{uu} \\ & &&+ \dL{f\_}{s}{mu} \cdot \vTp{Wf}{uu} \tag{11s} \\ \\ & \delta^{k~}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~}_{me}} \\ & &&= \dL{o\_}{s}{mu} \cdot \vTp{Uo}{vu} \\ & &&+ \dL{g\_}{s}{mu} \cdot \vTp{Ug}{vu} \\ & &&+ \dL{i\_}{s}{mu} \cdot \vTp{Ui}{vu} \\ & &&+ \dL{f\_}{s}{mu} \cdot \vTp{Uf}{vu} \tag{12s} \end{alignat*} Gradients ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. automethod:: epynn.lstm.models.LSTM.compute_gradients .. literalinclude:: ./../epynn/lstm/parameters.py :pyobject: lstm_compute_gradients The function to compute parameters gradients in a *LSTM* layer *k* includes: * (1.1): *dUo* is the gradient of the loss with respect to *Uo*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *do\_*. * (1.2): *dVo* is the gradient of the loss with respect to *Vo*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *do\_*. * (1.3): *dbo* is the gradient of the loss with respect to *bo*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*. * (2.1): *dUg* is the gradient of the loss with respect to *Ug*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *dg\_*. * (2.2): *dVg* is the gradient of the loss with respect to *Vg*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *dg\_*. * (2.3): *dbg* is the gradient of the loss with respect to *bg*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*. * (3.1): *dUi* is the gradient of the loss with respect to *Ui*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *di\_*. * (3.2): *dVi* is the gradient of the loss with respect to *Vi*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *di\_*. * (3.3): *dbi* is the gradient of the loss with respect to *bi*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*. * (4.1): *dUf* is the gradient of the loss with respect to *Uf*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *X* and *df\_*. * (4.2): *dVf* is the gradient of the loss with respect to *Vf*. It is computed as a sum with respect to step *s* of the dot products between the transpose of *hp* and *df\_*. * (4.3): *dbf* is the gradient of the loss with respect to *bf*. It is computed as a sum with respect to step *s* of the sum of *dh\_* along the axis corresponding to the number of samples *m*. Note that: * The same logical operation is applied to each set *{dU, dV, db}*. * This logical operation is identical to the one described for the *RNN* layer. .. math:: \begin{align} & \dLp{Uo}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{o\_}{s}{mu} \tag{1.1} \\ & \dLp{Wo}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{o\_}{s}{mu} \tag{1.2} \\ & \dLp{bo}{u} &&= \sumS \sumM \dL{o\_}{s}{mu} \tag{1.3} \\ \\ & \dLp{Ui}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{i\_}{s}{mu} \tag{2.1} \\ & \dLp{Wi}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{i\_}{s}{mu} \tag{2.2} \\ & \dLp{bi}{u} &&= \sumS \sumM \dL{i\_}{s}{mu} \tag{2.3} \\ \\ & \dLp{Ug}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{g\_}{s}{mu} \tag{3.1} \\ & \dLp{Wg}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{g\_}{s}{mu} \tag{3.2} \\ & \dLp{bg}{u} &&= \sumS \sumM \dL{g\_}{s}{mu} \tag{3.3} \\ \\ & \dLp{Uf}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{f\_}{s}{mu} \tag{4.1} \\ & \dLp{Wf}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{f\_}{s}{mu} \tag{4.2} \\ & \dLp{bf}{u} &&= \sumS \sumM \dL{f\_}{s}{mu} \tag{4.3} \\ \end{align} Live examples ------------------------------ * `Dummy string - LSTM-Dense`_ * `Protein Modification - LSTM-Dense`_ * `Protein Modification - LSTM(sequence=True)-Flatten-Dense`_ * `Protein Modification - LSTM(sequence=True)-Flatten-(Dense)n with Dropout`_ You may also like to browse all `Network training examples`_ provided with EpyNN. .. _Network training examples: run_examples.html .. _Dummy string - LSTM-Dense: epynnlive/dummy_string/train.html#LSTM-Dense .. _Protein Modification - LSTM-Dense: epynnlive/ptm_protein/train.html#LSTM-Dense .. _Protein Modification - LSTM(sequence\=True)-Flatten-Dense: epynnlive/ptm_protein/train.html#LSTM(sequence=True)-Flatten-Dense .. _Protein Modification - LSTM(sequence\=True)-Flatten-(Dense)n with Dropout: epynnlive/ptm_protein/train.html#LSTM(sequence=True)-Flatten-(Dense)n-with-Dropout