Long Short-Term Memory (LSTM)

Source files in EpyNN/epynn/lstm/.

See Appendix - Notations for mathematical conventions.

Layer architecture

LSTM

A Long Short-Term Memory or LSTM layer is an object containing a number of units - sometimes referred to as cells - and provided with functions for parameters initialization and non-linear activation of the so-called memory state C. The latter is a variable to compute the hidden state h. Both hidden h and memory C states are computed from gates products, namely the forget gate product f, the input gate products i and g and the output gate product o. Each of these products requires a non-linear activation function to be computed.

class epynn.lstm.models.LSTM(unit_cells=1, activate=<function tanh>, activate_output=<function sigmoid>, activate_candidate=<function tanh>, activate_input=<function sigmoid>, activate_forget=<function sigmoid>, initialization=<function orthogonal>, clip_gradients=False, sequences=False, se_hPars=None)[source]

Bases: epynn.commons.models.Layer

Definition of a LSTM layer prototype.

Parameters
  • units (int, optional) – Number of unit cells in LSTM layer, defaults to 1.

  • activate (function, optional) – Non-linear activation of hidden and memory states, defaults to tanh.

  • activate_output (function, optional) – Non-linear activation of output gate, defaults to sigmoid.

  • activate_candidate (function, optional) – Non-linear activation of candidate, defaults to tanh.

  • activate_input (function, optional) – Non-linear activation of input gate, defaults to sigmoid.

  • activate_forget (function, optional) – Non-linear activation of forget gate, defaults to sigmoid.

  • initialization (function, optional) – Weight initialization function for LSTM layer, defaults to orthogonal.

  • clip_gradients (bool, optional) – May prevent exploding/vanishing gradients, defaults to False.

  • sequences (bool, optional) – Whether to return only the last hidden state or the full sequence, defaults to False.

  • se_hPars (dict[str, str or float] or NoneType, optional) – Layer hyper-parameters, defaults to None and inherits from model.

Shapes

LSTM.compute_shapes(A)[source]

Is a wrapper for epynn.lstm.parameters.lstm_compute_shapes().

Parameters

A (numpy.ndarray) – Output of forward propagation from previous layer.

def lstm_compute_shapes(layer, A):
    """Compute forward shapes and dimensions from input for layer.
    """
    X = A    # Input of current layer

    layer.fs['X'] = X.shape    # (m, s, e)

    layer.d['m'] = layer.fs['X'][0]    # Number of samples (m)
    layer.d['s'] = layer.fs['X'][1]    # Steps in sequence (s)
    layer.d['e'] = layer.fs['X'][2]    # Elements per step (e)

    # Parameter Shapes             Unit cells (u)
    eu = (layer.d['e'], layer.d['u'])    # (v, u)
    uu = (layer.d['u'], layer.d['u'])    # (u, u)
    u1 = (1, layer.d['u'])               # (1, u)
    # Forget gate    Input gate       Candidate        Output gate
    layer.fs['Uf'] = layer.fs['Ui'] = layer.fs['Ug'] = layer.fs['Uo'] = eu
    layer.fs['Vf'] = layer.fs['Vi'] = layer.fs['Vg'] = layer.fs['Vo'] = uu
    layer.fs['bf'] = layer.fs['bi'] = layer.fs['bg'] = layer.fs['bo'] = u1

    # Shape of hidden (h) and memory (C) state with respect to steps (s)
    layer.fs['h'] = layer.fs['C'] = (layer.d['m'], layer.d['s'], layer.d['u'])

    return None

Within a LSTM layer, shapes of interest include:

  • Input X of shape (m, s, e) with m equal to the number of samples, s the number of steps in sequence and e the number of elements within each step of the sequence.

  • Weight U and V of shape (e, u) and (u, u), respectively, with e the number of elements within each step of the sequence and u the number of units in the layer.

  • Bias b of shape (1, u) with u the number of units in the layer.

  • Hidden h and memory C states, each of shape (m, 1, u) or (m, u) with m equal to the number of samples and u the number of units in the layer. Because there is one hidden h and memory C states computed for each step in the sequence, the shape of each array containing all hidden or memory states with respect to sequence steps is (m, s, u) with s the number of steps in the sequence.

Note that:

  • Parameters shape for V, U and b is independent from the number of samples m and the number of steps in the sequence s.

  • There are four sets of parameters {V, U, b} for each activation within gates: {Vf, Uf, bf} for the forget gate, {Vi, Ui, bi} and {Vg, Ug, bg} for the input gate product i and g, respectively, and {Vo, Uo, bo} for the output gate.

  • Recurrent layers including the LSTM layer are said appropriate to handle inputs of variable length because parameters definition is independent from input length s.

_images/lstm1-01.svg

Forward

LSTM.forward(A)[source]

Is a wrapper for epynn.lstm.forward.lstm_forward().

Parameters

A (numpy.ndarray) – Output of forward propagation from previous layer.

Returns

Output of forward propagation for current layer.

Return type

numpy.ndarray

def lstm_forward(layer, A):
    """Forward propagate signal to next layer.
    """
    # (1) Initialize cache, hidden and memory states
    X, h, C_ = initialize_forward(layer, A)

    # Iterate over sequence steps
    for s in range(layer.d['s']):

        # (2s) Slice sequence (m, s, e) w.r.t to step
        X = layer.fc['X'][:, s]

        # (3s) Retrieve previous states
        hp = layer.fc['hp'][:, s] = h       # (3.1s) Hidden
        Cp_ = layer.fc['Cp_'][:, s] = C_    # (3.2s) Memory

        # (4s) Activate forget gate
        f_ = layer.fc['f_'][:, s] = (
            np.dot(X, layer.p['Uf'])
            + np.dot(hp, layer.p['Vf'])
            + layer.p['bf']
        )   # (4.1s)

        f = layer.fc['f'][:, s] = layer.activate_forget(f_)      # (4.2s)

        # (5s) Activate input gate
        i_ = layer.fc['i_'][:, s] = (
            np.dot(X, layer.p['Ui'])
            + np.dot(hp, layer.p['Vi'])
            + layer.p['bi']
        )   # (5.1s)

        i = layer.fc['i'][:, s] = layer.activate_input(i_)       # (5.2s)

        # (6s) Activate candidate
        g_ = layer.fc['g_'][:, s] = (
            np.dot(X, layer.p['Ug'])
            + np.dot(hp, layer.p['Vg'])
            + layer.p['bg']
        )   # (6.1s)

        g = layer.fc['g'][:, s] = layer.activate_candidate(g_)   # (6.2s)

        # (7s) Activate output gate
        o_ = layer.fc['o_'][:, s] = (
            np.dot(X, layer.p['Uo'])
            + np.dot(hp, layer.p['Vo'])
            + layer.p['bo']
        )   # (7.1s)

        o = layer.fc['o'][:, s] = layer.activate_output(o_)      # (7.2s)

        # (8s) Compute current memory state
        C_ = layer.fc['C_'][:, s] = (
            Cp_ * f
            + i * g
        )   # (8.1s)

        C = layer.fc['C'][:, s] = layer.activate(C_)             # (8.2s)

        # (9s) Compute current hidden state
        h = layer.fc['h'][:, s] = o * C

    # Return the last hidden state or the full sequence
    A = layer.fc['h'] if layer.sequences else layer.fc['h'][:, -1]

    return A    # To next layer
_images/lstm2-01.svg

The forward propagation function in a LSTM layer k includes:

  • (1): Input X in current layer k is equal to the output A of previous layer k-1. The initial hidden state h is a zero array.

  • (2s): For each step, input X of the current iteration is retrieved by indexing the layer input with shape (m, s, e) to obtain the input for step with shape (m, e).

  • (3s): The previous hidden hp and memory Cp_ states are retrieved at the beginning of each iteration in sequence from the hidden h and memory C_ states computed at the end of the previous iteration (9s, 8.1s). Note that hp went through non-linear activation while Cp_ is a linear product.

  • (4s): The forget gate linear product f_ is computed from the sum of the dot products between X, Uf and hp, Vf to which bf is added. The non-linear product f is computed by applying the activate_forget function on f_.

  • (5s): The input gate linear product i_ is computed from the sum of the dot products between X, Ui and hp, Vi to which bi is added. The non-linear product i is computed by applying the activate_input function on i_.

  • (6s): The input gate linear product g_ is computed from the sum of the dot products between X, Ug and hp, Vg to which bg is added. The non-linear product g is computed by applying the activate_candidate function on g_.

  • (7s): The input gate linear product o_ is computed from the sum of the dot products between X, Uo and hp, Vo to which bo is added. The non-linear product o is computed by applying the activate_output function on o_.

  • (8s): The memory state non-linear activation product C_ is the sum of products between Cp_, f and i, g. Non-linear activation yielding C is achieved by applying the activate function on C_.

  • (9s): The hidden state h is the product between the output of the forget gate o and the memory state C activated through non-linearity.

Note that:

  • The non-linear activation function for C and g is generally the tanh function. While it can technically be any function, one should be advised if not using the tanh function.

  • The non-linear activation function for f, i and o is generally the sigmoid function. While it can technically be any function, one should be advised if not using the sigmoid function.

  • The concatenated array of hidden states h has shape (m, s, u). By default, the LSTM layer returns the hidden state corresponding to the last step in the input sequence with shape (m, u). If the sequences argument is set to True when instantiating the LSTM layer, then it will return the whole array of hidden states with shape (m, s, u).

  • For the sake of code homogeneity, the output of the LSTM layer is A which is equal to h.

\[\begin{split}\begin{alignat*}{2} & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\ \\ & x^{k~<s>}_{me} &&= x^{k}_{mse}[:,s] \tag{2s} \\ \\ & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3.1s} \\ & C\_^{k~<\sm>}_{mu} &&= Cp\_^{k}_{msu}[:,s] \tag{3.2s} \\ \\ & f\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Uf^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vf^{k}_{uu} \\ & &&+ bf^{k}_{u} \tag{4.1s} \\ & f^{k~<s>}_{mu} &&= f_{act}(f\_^{k~<s>}_{mu}) \tag{4.2s} \\ \\ & i\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Ui^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vi^{k}_{uu} \\ & &&+ bi^{k}_{u} \tag{5.1s} \\ & i^{k~<s>}_{mu} &&= i_{act}(i\_^{k~<s>}_{mu}) \tag{5.2s} \\ \\ & g\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Ug^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vg^{k}_{uu} \\ & &&+ bg^{k}_{u} \tag{6.1s} \\ & g^{k~<s>}_{mu} &&= g_{act}(g\_^{k~<s>}_{mu}) \tag{6.2s} \\ \\ & o\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Uo^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vo^{k}_{uu} \\ & &&+ bo^{k}_{u} \tag{7.1s} \\ & o^{k~<s>}_{mu} &&= o_{act}(o\_^{k~<s>}_{mu}) \tag{7.2s} \\ \\ & \gl{C\_} &&= \glm{C} * \gl{f} \\ & &&+ \gl{i} * \gl{g} \tag{8.1s} \\ & \gl{C} &&= C_{act}(C\_^{k~<s>}_{mu}) \tag{8.2s} \\ \\ & \gl{h} &&= \gl{o} * \gl{C} \tag{9s} \\ \end{alignat*}\end{split}\]

Backward

LSTM.backward(dX)[source]

Is a wrapper for epynn.lstm.backward.lstm_backward().

Parameters

dX (numpy.ndarray) – Output of backward propagation from next layer.

Returns

Output of backward propagation for current layer.

Return type

numpy.ndarray

def lstm_backward(layer, dX):
    """Backward propagate error gradients to previous layer.
    """
    # (1) Initialize cache, hidden and memory state gradients
    dA, dh, dC = initialize_backward(layer, dX)

    # Reverse iteration over sequence steps
    for s in reversed(range(layer.d['s'])):

        # (2s) Slice sequence (m, s, u) w.r.t step
        dA = layer.bc['dA'][:, s]          # dL/dA

        # (3s) Gradient of the loss w.r.t. next states
        dhn = layer.bc['dhn'][:, s] = dh   # (3.1) dL/dhn
        dCn = layer.bc['dCn'][:, s] = dC   # (3.2) dL/dCn

        # (4s) Gradient of the loss w.r.t hidden state h_
        dh_ = layer.bc['dh_'][:, s] = (
            (dA + dhn)
        )   # dL/dh_

        # (5s) Gradient of the loss w.r.t memory state C_
        dC_ = layer.bc['dC_'][:, s] = (
            dh_
            * layer.fc['o'][:, s]
            * layer.activate(layer.fc['C_'][:, s], deriv=True)
            + dCn
        )   # dL/dC_

        # (6s) Gradient of the loss w.r.t output gate o_
        do_ = layer.bc['do_'][:, s] = (
            dh_
            * layer.fc['C'][:, s]
            * layer.activate_output(layer.fc['o_'][:, s], deriv=True)
        )   # dL/do_

        # (7s) Gradient of the loss w.r.t candidate g_
        dg_ = layer.bc['dg_'][:, s] = (
            dC_
            * layer.fc['i'][:, s]
            * layer.activate_candidate(layer.fc['g_'][:, s], deriv=True)
        )   # dL/dg_

        # (8s) Gradient of the loss w.r.t input gate i_
        di_ = layer.bc['di_'][:, s] = (
            dC_
            * layer.fc['g'][:, s]
            * layer.activate_input(layer.fc['i_'][:, s], deriv=True)
        )   # dL/di_

        # (9s) Gradient of the loss w.r.t forget gate f_
        df_ = layer.bc['df_'][:, s] = (
            dC_
            * layer.fc['Cp_'][:, s]
            * layer.activate_forget(layer.fc['f_'][:, s], deriv=True)
        )   # dL/df_

        # (10s) Gradient of the loss w.r.t memory state C
        dC = layer.bc['dC'][:, s] = (
            dC_
            * layer.fc['f'][:, s]
        )   # dL/dC

        # (11s) Gradient of the loss w.r.t hidden state h
        dh = layer.bc['dh'][:, s] = (
            np.dot(do_, layer.p['Vo'].T)
            + np.dot(dg_, layer.p['Vg'].T)
            + np.dot(di_, layer.p['Vi'].T)
            + np.dot(df_, layer.p['Vf'].T)
        )   # dL/dh

        # (12s) Gradient of the loss w.r.t hidden state X
        dX = layer.bc['dX'][:, s] = (
            np.dot(dg_, layer.p['Ug'].T)
            + np.dot(do_, layer.p['Uo'].T)
            + np.dot(di_, layer.p['Ui'].T)
            + np.dot(df_, layer.p['Uf'].T)
        )   # dL/dX

    dX = layer.bc['dX']

    return dX    # To previous layer
_images/lstm3-01.svg

The backward propagation function in a LSTM layer k includes:

  • (1): dA the gradient of the loss with respect to the output of forward propagation A for current layer k. It is equal to the gradient of the loss with respect to input of forward propagation for next layer k+1. The initial gradient for hidden state dh is a zero array.

  • (2s): For each step in the reversed sequence, input dA of the current iteration is retrieved by indexing the input with shape (m, s, u) to obtain the input for step with shape (m, u).

  • (3s): The next gradients of the loss with respect to memory dCn and hidden dhn states are retrieved at the beginning of each iteration from the counterpart dC and dh computed at the end of the previous iteration (10s, 11s).

  • (4s): dh_ is the sum of dA and dhn.

  • (5s): dC_ is the gradient of the loss with respect to C_ for the current step. It is the product between dh_, output gate product o and the derivative of the activate function applied on C_. Finally, dCn is added to the product.

  • (6s): do_ is the gradient of the loss with respect to o_ for the current step. It is the product between dh_, memory state C and the derivative of the activate_output function applied on o_.

  • (7s): dg_ is the gradient of the loss with respect to g_ for the current step. It is the product between dC_, input gate product i and the derivative of the activate_candidate function applied on g_.

  • (8s): di_ is the gradient of the loss with respect to i_ for the current step. It is the product between dC_, input gate product g and the derivative of the activate_input function applied on i_.

  • (9s): df_ is the gradient of the loss with respect to f_ for the current step. It is the product between dC_, previous and linear memory state Cp_ and the derivative of the activate_forget function applied on f_.

  • (9s): dC is the gradient of the loss with respect to memory state C for the current step. It is the product between dC_ and the forget gate product f.

  • (9s): dh is the gradient of the loss with respect to hidden state h for the current step. It is the sum of the dot products between: do_ and the transpose of Vo; di_ and the transpose of Vi; dg_ and the transpose of Vg; df_ and the transpose of Vf.

  • (9s): dX is the gradient of the loss with respect to the input of forward propagation X for current step and current layer k. It is the sum of the dot products between: do_ and the transpose of Uo; di_ and the transpose of Ui; dg_ and the transpose of Ug; df_ and the transpose of Uf.

Note that:

  • In contrast to the forward propagation, we proceed by iterating over the reversed sequence.

  • In the default LSTM configuration with sequences set to False, the output of forward propagation has shape (m, u) and the input of backward propagation has shape (m, u). In the function epynn.lstm.backward.initialize_backward() this is converted to yield zero arrays of shape (m, s, u) for all but not coordinates [:, -1, :] which are set equal to dA of shape (m, u).

\[\begin{split}\begin{alignat*}{2} & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\ \\ & \delta^{\kp{<s>}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\ \\ & \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3.1s} \\ & \dL{C}{\sp}{mu} &&= \dL{Cn}{s}{mu}[:,s] \tag{3.2s} \\ \\ & \dL{h\_}{s}{mu} &&= \delta^{\kp{<s>}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\ \\ & \dL{o\_}{s}{mu} &&= \dL{h\_}{s}{mu} \\ & &&* \gl{C\_} \\ & &&* o_{act}'(o\_^{k~<s>}_{mu}) \tag{5s} \\ \\ & \dL{C\_}{s}{mu} &&= \dL{h\_}{s}{mu} \\ & &&* \gl{o\_} \\ & &&* C_{act}'(C\_^{k~<s>}_{mu}) \\ & &&+ \dL{C}{\sp}{mu} \tag{6s} \\ \\ & \dL{g\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{i} \\ & &&* g_{act}'(g\_^{k~<s>}_{mu}) \tag{7s} \\ \\ & \dL{i\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{g} \\ & &&* i_{act}'(i\_^{k~<s>}_{mu}) \tag{8s} \\ \\ & \dL{f\_}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{Cp\_} \\ & &&* f_{act}'(f\_^{k~<s>}_{mu}) \tag{9s} \\ \\ & \dL{C}{s}{mu} &&= \dL{C\_}{s}{mu} \\ & &&* \gl{f} \tag{10s} \\ \\ & \dL{h}{s}{mu} &&= \dL{o\_}{s}{mu} \cdot \vTp{Wo}{uu} \\ & &&+ \dL{g\_}{s}{mu} \cdot \vTp{Wg}{uu} \\ & &&+ \dL{i\_}{s}{mu} \cdot \vTp{Wi}{uu} \\ & &&+ \dL{f\_}{s}{mu} \cdot \vTp{Wf}{uu} \tag{11s} \\ \\ & \delta^{k~<s>}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~<s>}_{me}} \\ & &&= \dL{o\_}{s}{mu} \cdot \vTp{Uo}{vu} \\ & &&+ \dL{g\_}{s}{mu} \cdot \vTp{Ug}{vu} \\ & &&+ \dL{i\_}{s}{mu} \cdot \vTp{Ui}{vu} \\ & &&+ \dL{f\_}{s}{mu} \cdot \vTp{Uf}{vu} \tag{12s} \end{alignat*}\end{split}\]

Gradients

LSTM.compute_gradients()[source]

Is a wrapper for epynn.lstm.parameters.lstm_compute_gradients().

def lstm_compute_gradients(layer):
    """Compute gradients with respect to weight and bias for layer.
    """
    # Gradients initialization with respect to parameters
    for parameter in layer.p.keys():
        gradient = 'd' + parameter
        layer.g[gradient] = np.zeros_like(layer.p[parameter])

    # Reverse iteration over sequence steps
    for s in reversed(range(layer.d['s'])):

        X = layer.fc['X'][:, s]      # Input for current step
        hp = layer.fc['hp'][:, s]    # Previous hidden state

        # (1) Gradients of the loss with respect to U, V, b
        do_ = layer.bc['do_'][:, s]            # Gradient w.r.t output gate o_
        layer.g['dUo'] += np.dot(X.T, do_)     # (1.1) dL/dUo
        layer.g['dVo'] += np.dot(hp.T, do_)    # (1.2) dL/dVo
        layer.g['dbo'] += np.sum(do_, axis=0)  # (1.3) dL/dbo

        # (2) Gradients of the loss with respect to U, V, b
        dg_ = layer.bc['dg_'][:, s]            # Gradient w.r.t candidate g_
        layer.g['dUg'] += np.dot(X.T, dg_)     # (2.1) dL/dUg
        layer.g['dVg'] += np.dot(hp.T, dg_)    # (2.2) dL/dVg
        layer.g['dbg'] += np.sum(dg_, axis=0)  # (2.3) dL/dbg

        # (3) Gradients of the loss with respect to U, V, b
        di_ = layer.bc['di_'][:, s]            # Gradient w.r.t input gate i_
        layer.g['dUi'] += np.dot(X.T, di_)     # (3.1) dL/dUi
        layer.g['dVi'] += np.dot(hp.T, di_)    # (3.2) dL/dVi
        layer.g['dbi'] += np.sum(di_, axis=0)  # (3.3) dL/dbi

        # (4) Gradients of the loss with respect to U, V, b
        df_ = layer.bc['df_'][:, s]            # Gradient w.r.t forget gate f_
        layer.g['dUf'] += np.dot(X.T, df_)     # (4.1) dL/dUf
        layer.g['dVf'] += np.dot(hp.T, df_)    # (4.2) dL/dVf
        layer.g['dbf'] += np.sum(df_, axis=0)  # (4.3) dL/dbf

    return None

The function to compute parameters gradients in a LSTM layer k includes:

  • (1.1): dUo is the gradient of the loss with respect to Uo. It is computed as a sum with respect to step s of the dot products between the transpose of X and do_.

  • (1.2): dVo is the gradient of the loss with respect to Vo. It is computed as a sum with respect to step s of the dot products between the transpose of hp and do_.

  • (1.3): dbo is the gradient of the loss with respect to bo. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.

  • (2.1): dUg is the gradient of the loss with respect to Ug. It is computed as a sum with respect to step s of the dot products between the transpose of X and dg_.

  • (2.2): dVg is the gradient of the loss with respect to Vg. It is computed as a sum with respect to step s of the dot products between the transpose of hp and dg_.

  • (2.3): dbg is the gradient of the loss with respect to bg. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.

  • (3.1): dUi is the gradient of the loss with respect to Ui. It is computed as a sum with respect to step s of the dot products between the transpose of X and di_.

  • (3.2): dVi is the gradient of the loss with respect to Vi. It is computed as a sum with respect to step s of the dot products between the transpose of hp and di_.

  • (3.3): dbi is the gradient of the loss with respect to bi. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.

  • (4.1): dUf is the gradient of the loss with respect to Uf. It is computed as a sum with respect to step s of the dot products between the transpose of X and df_.

  • (4.2): dVf is the gradient of the loss with respect to Vf. It is computed as a sum with respect to step s of the dot products between the transpose of hp and df_.

  • (4.3): dbf is the gradient of the loss with respect to bf. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.

Note that:

  • The same logical operation is applied to each set {dU, dV, db}.

  • This logical operation is identical to the one described for the RNN layer.

\[\begin{split}\begin{align} & \dLp{Uo}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{o\_}{s}{mu} \tag{1.1} \\ & \dLp{Wo}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{o\_}{s}{mu} \tag{1.2} \\ & \dLp{bo}{u} &&= \sumS \sumM \dL{o\_}{s}{mu} \tag{1.3} \\ \\ & \dLp{Ui}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{i\_}{s}{mu} \tag{2.1} \\ & \dLp{Wi}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{i\_}{s}{mu} \tag{2.2} \\ & \dLp{bi}{u} &&= \sumS \sumM \dL{i\_}{s}{mu} \tag{2.3} \\ \\ & \dLp{Ug}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{g\_}{s}{mu} \tag{3.1} \\ & \dLp{Wg}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{g\_}{s}{mu} \tag{3.2} \\ & \dLp{bg}{u} &&= \sumS \sumM \dL{g\_}{s}{mu} \tag{3.3} \\ \\ & \dLp{Uf}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{f\_}{s}{mu} \tag{4.1} \\ & \dLp{Wf}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{f\_}{s}{mu} \tag{4.2} \\ & \dLp{bf}{u} &&= \sumS \sumM \dL{f\_}{s}{mu} \tag{4.3} \\ \end{align}\end{split}\]