Recurrent Neural Network (RNN)

Source files in EpyNN/epynn/rnn/.

See Appendix - Notations for mathematical conventions.

Layer architecture

RNN

A Recurrent Neural Network or RNN layer is an object containing a number of units - sometimes referred to as cells - and provided with functions for parameters initialization and non-linear activation of the so-called hidden state h.

class epynn.rnn.models.RNN(unit_cells=1, activate=<function tanh>, initialization=<function xavier>, clip_gradients=True, sequences=False, se_hPars=None)[source]

Bases: epynn.commons.models.Layer

Definition of a RNN layer prototype.

Parameters
  • units (int, optional) – Number of unit cells in RNN layer, defaults to 1.

  • activate (function, optional) – Non-linear activation of hidden state, defaults to tanh.

  • initialization (function, optional) – Weight initialization function for RNN layer, defaults to xavier.

  • clip_gradients (bool, optional) – May prevent exploding/vanishing gradients, defaults to False.

  • sequences (bool, optional) – Whether to return only the last hidden state or the full sequence, defaults to False.

  • se_hPars (dict[str, str or float] or NoneType, optional) – Layer hyper-parameters, defaults to None and inherits from model.

Shapes

RNN.compute_shapes(A)[source]

Wrapper for epynn.rnn.parameters.rnn_compute_shapes().

Parameters

A (numpy.ndarray) – Output of forward propagation from previous layer.

def rnn_compute_shapes(layer, A):
    """Compute forward shapes and dimensions from input for layer.
    """
    X = A    # Input of current layer

    layer.fs['X'] = X.shape    # (m, s, e)

    layer.d['m'] = layer.fs['X'][0]    # Number of samples (m)
    layer.d['s'] = layer.fs['X'][1]    # Steps in sequence (s)
    layer.d['e'] = layer.fs['X'][2]    # Elements per step (e)

    # Shapes for trainable parameters         Unit cells (u)
    layer.fs['U'] = (layer.d['e'], layer.d['u'])    # (e, u)
    layer.fs['V'] = (layer.d['u'], layer.d['u'])    # (u, u)
    layer.fs['b'] = (1, layer.d['u'])               # (1, u)

    # Shape of hidden state (h) with respect to steps (s)
    layer.fs['h'] = (layer.d['m'], layer.d['s'], layer.d['u'])

    return None

Within a RNN layer, shapes of interest include:

  • Input X of shape (m, s, e) with m equal to the number of samples, s the number of steps in sequence and e the number of elements within each step of the sequence.

  • Weight U and V of shape (e, u) and (u, u), respectively, with e the number of elements within each step of the sequence and u the number of units in the layer.

  • Bias b of shape (1, u) with u the number of units in the layer.

  • Hidden state h of shape (m, 1, u) or (m, u) with with m equal to the number of samples and u the number of units in the layer. Because there is one hidden state computed for each step in the sequence, the shape of the array containing all hidden states with respect to sequence steps is (m, s, u) with s the number of steps in the sequence.

Note that:

  • Parameters shape for V, U and b is independent from the number of samples m and the number of steps in the sequence s.

  • Recurrent layers including the RNN layer are considered appropriate to handle inputs of variable length because parameters definition is independent from input length s.

_images/rnn1-01.svg

Forward

RNN.forward(A)[source]

Wrapper for epynn.rnn.forward.rnn_forward().

Parameters

A (numpy.ndarray) – Output of forward propagation from previous layer.

Returns

Output of forward propagation for current layer.

Return type

numpy.ndarray

def rnn_forward(layer, A):
    """Forward propagate signal to next layer.
    """
    # (1) Initialize cache and hidden state
    X, h = initialize_forward(layer, A)

    # Iterate over sequence steps
    for s in range(layer.d['s']):

        # (2s) Slice sequence (m, s, e) with respect to step
        X = layer.fc['X'][:, s]

        # (3s) Retrieve previous hidden state
        hp = layer.fc['hp'][:, s] = h

        # (4s) Activate current hidden state
        h_ = layer.fc['h_'][:, s] = (
            np.dot(X, layer.p['U'])
            + np.dot(hp, layer.p['V'])
            + layer.p['b']
        )   # (4.1s) Linear

        h = layer.fc['h'][:, s] = layer.activate(h_)   # (4.2s) Non-linear

    # Return the last hidden state or the full sequence
    A = layer.fc['h'] if layer.sequences else layer.fc['h'][:, -1]

    return A   # To next layer
_images/rnn2-01.svg

The forward propagation function in a RNN layer k includes:

  • (1): Input X in current layer k is equal to the output A of previous layer k-1. The initial hidden state h is a zero array.

  • (2s): For each step, input X of the current iteration is retrieved by indexing the layer input with shape (m, s, e) to obtain the input for step with shape (m, e).

  • (3s): The previous hidden state hp is retrieved at the beginning of each iteration in sequence from the hidden state h computed at the end of the previous iteration (4.2s).

  • (4.1s): The hidden state linear activation product h_ is equal to the sum of the dot products between (X of shape (m, e), U of shape (e, u)) and (hp of shape (m, u), V of shape (u, u)) to which is added the bias b.

  • (4.2s): The hidden state non-linear activation product h is computed by applying a non-linear activation function on h_.

Note that:

  • The non-linear activation function for h is generally the tanh function. While it can technically be any function, one should be advised if not using the tanh function.

  • The hidden state h for one step has shape (m, u) while it has shape (m, 1) for one step and one unit. The hidden state h is fed back with respect to each unit for each step in the sequence, this is the basis for the internal memory of recurrent layers.

  • The concatenated array of hidden states h has shape (m, s, u). By default, the RNN layer returns the hidden state corresponding to the last step in the input sequence with shape (m, u). If the sequences argument is set to True when instantiating the RNN layer, then it will return the whole array of hidden states with shape (m, s, u).

  • For the sake of code homogeneity, the output of the RNN layer is A which is equal to h.

Then:

\[\begin{split}\begin{alignat*}{2} & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\ \\ & x^{k~<s>}_{me} &&= x^{k}_{mse}[:, s] \tag{2s} \\ \\ & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:, s] \tag{3s} \\ \\ & h\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot U^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot V^{k}_{uu} \\ & &&+ b^{k}_{u} \tag{4.1s} \\ & h^{k~<s>}_{mu} &&= h_{act}(h\_^{k~<s>}_{mu}) \tag{4.2s} \\ \end{alignat*}\end{split}\]

Backward

RNN.backward(dX)[source]

Wrapper for epynn.rnn.backward.rnn_backward().

Parameters

dX (numpy.ndarray) – Output of backward propagation from next layer.

Returns

Output of backward propagation for current layer.

Return type

numpy.ndarray

def rnn_backward(layer, dX):
    """Backward propagate error gradients to previous layer.
    """
    # (1) Initialize cache and hidden state gradient
    dA, dh = initialize_backward(layer, dX)

    # Reverse iteration over sequence steps
    for s in reversed(range(layer.d['s'])):

        # (2s) Slice sequence (m, s, u) w.r.t step
        dA = layer.bc['dA'][:, s]          # dL/dA

        # (3s) Gradient of the loss w.r.t. next hidden state
        dhn = layer.bc['dhn'][:, s] = dh   # dL/dhn

        # (4s) Gradient of the loss w.r.t hidden state h_
        dh_ = layer.bc['dh_'][:, s] = (
            (dA + dhn)
            * layer.activate(layer.fc['h_'][:, s], deriv=True)
        )   # dL/dh_ - To parameters gradients

        # (5s) Gradient of the loss w.r.t hidden state h
        dh = layer.bc['dh'][:, s] = (
            np.dot(dh_, layer.p['V'].T)
        )   # dL/dh - To previous step

        # (6s) Gradient of the loss w.r.t X
        dX = layer.bc['dX'][:, s] = (
            np.dot(dh_, layer.p['U'].T)
        )   # dL/dX - To previous layer

    dX = layer.bc['dX']

    return dX    # To previous layer
_images/rnn3-01.svg

The backward propagation function in a RNN layer k includes:

  • (1): dA the gradient of the loss with respect to the output of forward propagation A for current layer k. It is equal to the gradient of the loss with respect to input of forward propagation for next layer k+1. The initial gradient for hidden state dh is a zero array.

  • (2s): For each step in the reversed sequence, input dA of the current iteration is retrieved by indexing the input with shape (m, s, u) to obtain the input for step with shape (m, u).

  • (3s): The next gradient of the loss with respect to hidden state dhn is retrieved at the beginning of each iteration from the counterpart dh computed at the end of the previous iteration (5s).

  • (4s): dh_ is the gradient of the loss with respect to h_. It is computed by applying element-wise multiplication between (the sum of dA and dhn) and the derivative of the non-linear activation function applied on h_.

  • (5s): dh is the gradient of the loss with respect to h for the current step. It is computed by applying a dot product operation between dh_ and the transpose of V.

  • (6s): The gradient of the loss dX with respect to the input of forward propagation X for current step and current layer k is computed by applying a dot product operation between dh_ and the transpose of U.

Note that:

  • By contrast with the forward propagation, we proceed by iterating over the reversed sequence.

  • In the default RNN configuration with sequences set to False, the output of forward propagation has shape (m, u) and the input of backward propagation has shape (m, u). In the function epynn.rnn.backward.initialize_backward() this is converted to yield a zero array of shape (m, s, u) for all but not coordinates [:, -1, :] which are set equal to dA of shape (m, u).

Then:

\[\begin{split}\begin{alignat*}{2} & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\ \\ & \delta^{\kp{<s>}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\ \\ & \frac{\partial \mathcal{L}}{\partial h^{k~<\sp>}_{mu}} &&= \frac{\partial \mathcal{L}}{\partial hn^{k}_{msu}}[:,s] \tag{3s} \\ \\ & \frac{\partial \mathcal{L}}{\partial h\_^{k~<s>}_{mu}} &&= (\delta^{\kp{<s>}}_{mu} + \frac{\partial \mathcal{L}}{\partial h^{k~<\sp>}_{mu}}) \\ & &&* h_{act}'(h\_^{k~<s>}_{mu}) \tag{4s} \\ & \frac{\partial \mathcal{L}}{\partial h^{k~<s>}_{mu}} &&= \frac{\partial \mathcal{L}}{\partial h\_^{k~<s>}_{mu}} \cdot V^{k~{\intercal}}_{uu} \tag{5s} \\ \\ & \delta^{k~<s>}_{me} &&= \frac{\partial \mathcal{L}}{\partial x^{k~<s>}_{me}} = \frac{\partial \mathcal{L}}{\partial a^{\km~<s>}_{me}} = \frac{\partial \mathcal{L}}{\partial h\_^{k~<s>}_{mu}} \cdot U^{k~{\intercal}}_{vu} \tag{6s} \\ \end{alignat*}\end{split}\]

Gradients

RNN.compute_gradients()[source]

Wrapper for epynn.rnn.parameters.rnn_compute_gradients().

def rnn_compute_gradients(layer):
    """Compute gradients with respect to weight and bias for layer.
    """
    # Gradients initialization with respect to parameters
    for parameter in layer.p.keys():
        gradient = 'd' + parameter
        layer.g[gradient] = np.zeros_like(layer.p[parameter])

    # Reverse iteration over sequence steps
    for s in reversed(range(layer.d['s'])):

        dh_ = layer.bc['dh_'][:, s]  # Gradient w.r.t hidden state h_
        X = layer.fc['X'][:, s]      # Input for current step
        hp = layer.fc['hp'][:, s]    # Previous hidden state

        # (1) Gradients of the loss with respect to U, V, b
        layer.g['dU'] += np.dot(X.T, dh_)     # (1.1) dL/dU
        layer.g['dV'] += np.dot(hp.T, dh_)    # (1.2) dL/dV
        layer.g['db'] += np.sum(dh_, axis=0)  # (1.3) dL/db

    return None

The function to compute parameters gradients in a RNN layer k includes:

  • (1.1): dU is the gradient of the loss with respect to U. It is computed as a sum with respect to step s of the dot products between the transpose of X and dh_.

  • (1.2): dV is the gradient of the loss with respect to V. It is computed as a sum with respect to step s of the dot products between the transpose of hp and dh_.

  • (1.3): db is the gradient of the loss with respect to b. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.

Note that:

  • What may make things more difficult here is the extra-dimension corresponding to sequence length. While for a sequence of length one the term “sum with respect to step s” would be unnecessary, this is not the case in the general situation where there is more than one step in the sequence.

  • To complete, using recurrent layers with one step sequences would not be appropriate, in general, because there is no use of the internal memory of recurrent units along the sequence steps.

\[\begin{split}\begin{alignat*}{2} & \dLp{U}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{h\_}{s}{mu} \tag{1.1} \\ & \dLp{W}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{h\_}{s}{mu} \tag{1.2} \\ & \dLp{b}{u} &&= \sumS \sumM \dL{h\_}{s}{mu} \tag{1.3} \\ \end{alignat*}\end{split}\]