Gated Recurrent Unit (GRU)
Source files in EpyNN/epynn/gru/
.
See Appendix - Notations for mathematical conventions.
Layer architecture
A Gated Recurrent Unit or GRU layer is an object containing a number of units - sometimes referred to as cells - and provided with functions for parameters initialization and non-linear activation of the so-called hidden hat hh. The latter is a variable to compute the hidden state h. The latter is also computed from gates products, namely the reset gate product r and the update gate products z. Each of these products require a non-linear activation function to be computed.
- class epynn.gru.models.GRU(unit_cells=1, activate=<function tanh>, activate_update=<function sigmoid>, activate_reset=<function sigmoid>, initialization=<function orthogonal>, clip_gradients=False, sequences=False, se_hPars=None)[source]
Bases:
epynn.commons.models.Layer
Definition of a GRU layer prototype.
- Parameters
units (int, optional) – Number of unit cells in GRU layer, defaults to 1.
activate (function, optional) – Non-linear activation of hidden hat (hh) state, defaults to tanh.
activate_output (function, optional) – Non-linear activation of update gate, defaults to sigmoid.
activate_candidate (function, optional) – Non-linear activation of reset gate, defaults to sigmoid.
initialization (function, optional) – Weight initialization function for GRU layer, defaults to orthogonal.
clip_gradients (bool, optional) – May prevent exploding/vanishing gradients, defaults to False.
sequences (bool, optional) – Whether to return only the last hidden state or the full sequence, defaults to False.
se_hPars (dict[str, str or float] or NoneType, optional) – Layer hyper-parameters, defaults to None and inherits from model.
Shapes
- GRU.compute_shapes(A)[source]
Wrapper for
epynn.gru.parameters.gru_compute_shapes()
.
- Parameters
A (
numpy.ndarray
) – Output of forward propagation from previous layer.def gru_compute_shapes(layer, A): """Compute forward shapes and dimensions from input for layer. """ X = A # Input of current layer layer.fs['X'] = X.shape # (m, s, e) layer.d['m'] = layer.fs['X'][0] # Number of samples (m) layer.d['s'] = layer.fs['X'][1] # Steps in sequence (s) layer.d['e'] = layer.fs['X'][2] # Elements per step (e) # Parameter Shapes Unit cells (u) eu = (layer.d['e'], layer.d['u']) # (e, u) uu = (layer.d['u'], layer.d['u']) # (u, u) u1 = (1, layer.d['u']) # (1, u) # Update gate Reset gate Hidden hat layer.fs['Uz'] = layer.fs['Ur'] = layer.fs['Uhh'] = eu layer.fs['Vz'] = layer.fs['Vr'] = layer.fs['Vhh'] = uu layer.fs['bz'] = layer.fs['br'] = layer.fs['bhh'] = u1 # Shape of hidden state (h) with respect to steps (s) layer.fs['h'] = (layer.d['m'], layer.d['s'], layer.d['u']) return NoneWithin a GRU layer, shapes of interest include:
Input X of shape (m, s, e) with m equal to the number of samples, s the number of steps in sequence and e the number of elements within each step of the sequence.
Weight U and V of shape (e, u) and (u, u), respectively, with e the number of elements within each step of the sequence and u the number of units in the layer.
Bias b of shape (1, u) with u the number of units in the layer.
Hidden state h of shape (m, 1, u) or (m, u) with with m equal to the number of samples and u the number of units in the layer. Because there is one hidden state h computed for each step in the sequence, the shape of the array containing all hidden states with respect to sequence steps is (m, s, u) with s the number of steps in the sequence.
Note that:
Parameters shape for V, U and b is independent from the number of samples m and the number of steps in the sequence s.
There are three sets of parameters {V, U, b} for each activation: {Vr, Ur, br} for the reset gate, {Vz, Uz, bz} for the update gate and {Vhh, Uhh, bhh} for the activation of hidden hat hh.
Recurrent layers including the GRU layer are considered appropriate to handle inputs of variable length because parameters definition is independent from input length s.
Forward
- GRU.forward(A)[source]
Wrapper for
epynn.gru.forward.gru_forward()
.
- Parameters
A (
numpy.ndarray
) – Output of forward propagation from previous layer.- Returns
Output of forward propagation for current layer.
- Return type
numpy.ndarray
def gru_forward(layer, A): """Forward propagate signal to next layer. """ # (1) Initialize cache and hidden state X, h = initialize_forward(layer, A) # Iterate over sequence steps for s in range(layer.d['s']): # (2s) Slice sequence (m, s, e) with respect to step X = layer.fc['X'][:, s] # (3s) Retrieve previous hidden state hp = layer.fc['hp'][:, s] = h # (4s) Activate reset gate r_ = layer.fc['r_'][:, s] = ( np.dot(X, layer.p['Ur']) + np.dot(hp, layer.p['Vr']) + layer.p['br'] ) # (4.1s) r = layer.fc['r'][:, s] = layer.activate_reset(r_) # (4.2s) # (5s) Activate update gate z_ = layer.fc['z_'][:, s] = ( np.dot(X, layer.p['Uz']) + np.dot(hp, layer.p['Vz']) + layer.p['bz'] ) # (5.1s) z = layer.fc['z'][:, s] = layer.activate_update(z_) # (5.2s) # (6s) Activate hidden hat hh_ = layer.fc['hh_'][:, s] = ( np.dot(X, layer.p['Uhh']) + np.dot(r * hp, layer.p['Vhh']) + layer.p['bhh'] ) # (6.1s) hh = layer.fc['hh'][:, s] = layer.activate(hh_) # (6.2s) # (7s) Compute current hidden state h = layer.fc['h'][:, s] = ( z*hp + (1-z)*hh ) # Return the last hidden state or the full sequence A = layer.fc['h'] if layer.sequences else layer.fc['h'][:, -1] return A # To next layerThe forward propagation function in a GRU layer k includes:
(1): Input X in current layer k is equal to the output A of previous layer k-1. The initial hidden state h is a zero array.
(2s): For each step, input X of the current iteration is retrieved by indexing the layer input with shape (m, s, e) to obtain the input for step with shape (m, e).
(3s): The previous hidden state hp is retrieved at the beginning of each iteration in sequence from the hidden state h computed at the end of the previous iteration (7s).
(4s): The reset gate linear product r_ is computed from the sum of the dot products between X, Ur and hp, Vr to which br is added. The non-linear product r is computed by applying the activate_reset function on r_.
(5s): The update gate linear product z_ is computed from the sum of the dot products between X, Uz and hp, Vz to which bz is added. The non-linear product z is computed by applying the activate_update function on z_.
(6s): The hidden hat linear product hh_ is computed from the sum of the dot products between X, Uhh and (r * hp), Vhh to which bhh is added. The non-linear product hh is computed by applying the activate function on hh_.
(7s): The hidden state h is the sum of the products between z, hp and (1-z), hh.
Note that:
The non-linear activation function for hh is generally the tanh function. While it can technically be any function, one should be advised if not using the tanh function.
The non-linear activation function for r and z is generally the sigmoid function. While it can technically be any function, one should be advised if not using the sigmoid function.
The concatenated array of hidden states h has shape (m, s, u). By default, the GRU layer returns the hidden state corresponding to the last step in the input sequence with shape (m, u). If the sequences argument is set to True when instantiating the GRU layer, then it will return the whole array of hidden states with shape (m, s, u).
For the sake of code homogeneity, the output of the GRU layer is A which is equal to h.
\[\begin{split}\begin{alignat*}{2} & x^{k}_{mse} &&= a^{\km}_{mse} \tag{1} \\ \\ & x^{k~<s>}_{me} &&= x^{k}_{mse}[:, s] \tag{2s} \\ \\ & h^{k~<\sm>}_{mu} &&= hp^{k}_{msu}[:,s] \tag{3s} \\ \\ & r\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Ur^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vr^{k}_{uu} \\ & &&+ br^{k}_{u} \tag{4.1s} \\ & r^{k~<s>}_{mu} &&= r_{act}(r\_^{k~<s>}_{mu}) \tag{4.2s} \\ \\ & z\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot Uz^{k}_{vu} \\ & &&+ h^{k~<\sm>}_{mu} \cdot Vz^{k}_{uu} \\ & &&+ bz^{k}_{u} \tag{5.1s} \\ & z^{k~<s>}_{mu} &&= z_{act}(z\_^{k~<s>}_{mu}) \tag{5.2s} \\ \\ & \hat{h}\_^{k~<s>}_{mu} &&= x^{k~<s>}_{me} \cdot U\hat{h}^{k}_{vu} \\ & &&+ (r^{k~<s>}_{mu} * h^{k~<\sm>}_{mu}) \cdot V\hat{h}^{k}_{uu} \\ & &&+ b\hat{h}^{k}_{u} \tag{6.1s} \\ & \hat{h}^{k~<s>}_{mu} &&= \hat{h}_{act}(\hat{h}\_^{k~<s>}_{mu}) \tag{6.2s} \\ \\ & h^{k~<s>}_{mu} &&= z^{k~<s>}_{mu} * h^{k~<\sm>}_{mu} + (1-z^{k~<s>}_{mu}) * \hat{h}^{k~<s>}_{mu} \tag{7s} \\ \end{alignat*}\end{split}\]
Backward
- GRU.backward(dX)[source]
Wrapper for
epynn.gru.backward.gru_backward()
.
- Parameters
dX (
numpy.ndarray
) – Output of backward propagation from next layer.- Returns
Output of backward propagation for current layer.
- Return type
numpy.ndarray
def gru_backward(layer, dX): """Backward propagate error gradients to previous layer. """ # (1) Initialize cache and hidden state gradients dA, dh = initialize_backward(layer, dX) # Reverse iteration over sequence steps for s in reversed(range(layer.d['s'])): # (2s) Slice sequence (m, s, u) w.r.t step dA = layer.bc['dA'][:, s] # dL/dA # (3s) Retrieve previous hidden state dhn = layer.bc['dhn'][:, s] = dh # dL/dhn # (4s) Gradient of the loss w.r.t hidden state h_ dh_ = layer.bc['dh_'][:, s] = ( (dA + dhn) ) # dL/dh_ # (5s) Gradient of the loss w.r.t hidden hat hh_ dhh_ = layer.bc['dhh_'][:, s] = ( dh_ * (1-layer.fc['z'][:, s]) * layer.activate(layer.fc['hh_'][:, s], deriv=True) ) # dL/dhh_ # (6s) Gradient of the loss w.r.t update gate z_ dz_ = layer.bc['dz_'][:, s] = ( dh_ * (layer.fc['hp'][:, s]-layer.fc['hh'][:, s]) * layer.activate_update(layer.fc['z_'][:, s], deriv=True) ) # dL/dz_ # (7s) Gradient of the loss w.r.t reset gate dr_ = layer.bc['dr_'][:, s] = ( np.dot(dhh_, layer.p['Vhh'].T) * layer.fc['hp'][:, s] * layer.activate_reset(layer.fc['r_'][:, s], deriv=True) ) # dL/dr_ # (8s) Gradient of the loss w.r.t previous hidden state dh = layer.bc['dh'][:, s] = ( np.dot(dhh_, layer.p['Vhh'].T) * layer.fc['r'][:, s] + np.dot(dz_, layer.p['Vz'].T) + dh_ * layer.fc['z'][:, s] + np.dot(dr_, layer.p['Vr'].T) ) # dL/dh # (9s) Gradient of the loss w.r.t to X dX = layer.bc['dX'][:, s] = ( np.dot(dhh_, layer.p['Uhh'].T) + np.dot(dz_, layer.p['Uz'].T) + np.dot(dr_, layer.p['Ur'].T) ) # dL/dX dX = layer.bc['dX'] return dX # To previous layerThe backward propagation function in a LSTM layer k includes:
(1): dA the gradient of the loss with respect to the output of forward propagation A for current layer k. It is equal to the gradient of the loss with respect to input of forward propagation for next layer k+1. The initial gradient for hidden state dh is a zero array.
(2s): For each step in the reversed sequence, input dA of the current iteration is retrieved by indexing the input with shape (m, s, u) to obtain the input for step with shape (m, u).
(3s): The next gradients of the loss with respect to hidden state dhn is retrieved at the beginning of each iteration from the counterpart dh computed at the end of the previous iteration (8s).
(4s): dh_ is the sum of dA and dhn.
(5s): dhh_ is the gradient of the loss with respect to hh_ for the current step. It is the product of dh_, (1-z) and the derivative of the activate function applied on hh_.
(6s): dz_ is the gradient of the loss with respect to z_ for the current step. It is the product of dh_, (hp-hh) and the derivative of the activate_update function applied on z_.
(7s): dr_ is the gradient of the loss with respect to r_ for the current step. It is the product of (the dot product between dhh_ and the transpose of Vhh), previous hidden state hp and the derivative of the activate_reset function applied on r_.
(8s): dh is the gradient of the loss with respect to hidden state h for the current step. It is the sum of three terms: dot product of dhh_ and the transpose of Vhh multiplied by the output of the reset gate r; dot product between dz_ and the transpose of Vz to which the element-wise product between dh_ and z is added; dot product between dr_ and the transpose of Vr.
(9s): dX is the gradient of the loss with respect to the input of forward propagation X for current step and current layer k. It is the sum of the dot products between: dhh_ and the transpose of Uhh; dz_ and the transpose of Uz; dr_ and the transpose of Ur.
Note that:
In contrast to the forward propagation, we proceed by iterating over the reversed sequence.
In the default GRU configuration with sequences set to False, the output of forward propagation has shape (m, u) and the input of backward propagation has shape (m, u). In the function
epynn.gru.backward.initialize_backward()
this is converted to yield a zero array of shape (m, s, u) for all but not coordinates [:, -1, :] which are set equal to dA of shape (m, u).\[\begin{split}\begin{alignat*}{2} & \delta^{\kp}_{msu} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{msu}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{msu}} \tag{1} \\ \\ & \delta^{\kp{<s>}}_{mu} &&= \delta^{\kp}_{msu}[:, s] \tag{2s} \\ \\ & \dL{h}{\sp}{mu} &&= \dL{hn}{s}{mu}[:,s] \tag{3s} \\ \\ & \dL{h\_}{s}{mu} &&= \delta^{\kp{<s>}}_{mu} + \dL{h}{\sp}{mu} \tag{4s} \\ \\ & \dL{\hat{h}\_}{s}{mu} &&= \gl{dh\_} \\ & &&* (1 - z^{k~<s>}_{mu}) \\ & &&* \hat{h}_{act}'(\hat{h}\_^{k~<s>}_{mu}) \tag{5s} \\ \\ & \dL{z\_}{s}{mu} &&= \gl{dh\_} \\ & && \cdot \vTp{W\hat{h}}{uu} \\ & &&* (h^{k~<\sm>}_{mu} - \hat{h}^{k~<s>}_{mu}) \\ & &&* z_{act}'(z\_^{k~<s>}_{mu}) \tag{6s} \\ \\ & \dL{r\_}{s}{mu} &&= \gl{d\hat{h}\_} \cdot \vTp{W\hat{h}}{uu} \\ & &&* h^{k~<\sm>}_{mu} \\ & &&* r_{act}'(r\_^{k~<s>}_{mu}) \tag{7s} \\ \\ & \dL{h}{s}{mu} &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{W\hat{h}}{uu} * \gl{r} \\ & &&+ \dL{z\_}{s}{mu} \cdot \vTp{Wz}{uu} \\ & &&+ \dL{r\_}{s}{mu} \cdot \vTp{Wr}{uu} \tag{8s} \\ \\ & \delta^{k~<s>}_{me} &&= \dL{x}{s}{me} = \frac{\partial \mathcal{L}}{\partial a^{\km~<s>}_{me}} \\ & &&= \dL{\hat{h}\_}{s}{mu} \cdot \vTp{U\hat{h}}{vu} \\ & &&+ \dL{z\_}{s}{mu} \cdot \vTp{Uz}{vu} \\ & &&+ \dL{r\_}{s}{mu} \cdot \vTp{Ur}{vu} \tag{9s} \\ \end{alignat*}\end{split}\]
Gradients
- GRU.compute_gradients()[source]
Wrapper for
epynn.gru.parameters.gru_compute_gradients()
.def gru_compute_gradients(layer): """Compute gradients with respect to weight and bias for layer. """ # Gradients initialization with respect to parameters for parameter in layer.p.keys(): gradient = 'd' + parameter layer.g[gradient] = np.zeros_like(layer.p[parameter]) # Reverse iteration over sequence steps for s in reversed(range(layer.d['s'])): X = layer.fc['X'][:, s] # Input for current step hp = layer.fc['hp'][:, s] # Previous hidden state # (1) Gradients of the loss with respect to U, V, b dhh_ = layer.bc['dhh_'][:, s] # Gradient w.r.t hidden hat hh_ layer.g['dUhh'] += np.dot(X.T, dhh_) # (1.1) dL/dUhh layer.g['dVhh'] += np.dot((layer.fc['r'][:, s] * hp).T, dhh_) layer.g['dbhh'] += np.sum(dhh_, axis=0) # (1.3) dL/dbhh # (2) Gradients of the loss with respect to U, V, b dz_ = layer.bc['dz_'][:, s] # Gradient w.r.t update gate z_ layer.g['dUz'] += np.dot(X.T, dz_) # (2.1) dL/dUz layer.g['dVz'] += np.dot(hp.T, dz_) # (2.2) dL/dVz layer.g['dbz'] += np.sum(dz_, axis=0) # (2.3) dL/dbz # (3) Gradients of the loss with respect to U, V, b dr_ = layer.bc['dr_'][:, s] # Gradient w.r.t reset gate r_ layer.g['dUr'] += np.dot(X.T, dr_) # (3.1) dL/dUr layer.g['dVr'] += np.dot(hp.T, dr_) # (3.2) dL/dVr layer.g['dbr'] += np.sum(dr_, axis=0) # (3.3) dL/dbr return NoneThe function to compute parameters gradients in a GRU layer k includes:
(1.1): dUhh is the gradient of the loss with respect to Uhh. It is computed as a sum with respect to step s of the dot products between the transpose of X and dhh_.
(1.2): dVhh is the gradient of the loss with respect to Vhh. It is computed as a sum with respect to step s of the dot products between the transpose of (r * hp) and dhh_.
(1.3): dbhh is the gradient of the loss with respect to bhh. It is computed as a sum with respect to step s of the sum of dhh_ along the axis corresponding to the number of samples m.
(2.1): dUz is the gradient of the loss with respect to Uz. It is computed as a sum with respect to step s of the dot products between the transpose of X and dz_.
(2.2): dVz is the gradient of the loss with respect to Vz. It is computed as a sum with respect to step s of the dot products between the transpose of hp and dz_.
(2.3): dbz is the gradient of the loss with respect to bz. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.
(3.1): dUi is the gradient of the loss with respect to Ui. It is computed as a sum with respect to step s of the dot products between the transpose of X and dr_.
(3.2): dVi is the gradient of the loss with respect to Vi. It is computed as a sum with respect to step s of the dot products between the transpose of hp and dr_.
(3.3): dbi is the gradient of the loss with respect to bi. It is computed as a sum with respect to step s of the sum of dh_ along the axis corresponding to the number of samples m.
Note that:
The same logical operation is applied for sets {dUr, dVr, dbr} and {dUz, dVz, dbz} but not for set {dUhh, dVhh, dbhh}.
For the first two sets, this logical operation is identical to the one described for the RNN layer.
For {dUhh, dVhh, dbhh} the logical operation is different because of how hh_ is computed during the forward propagation.
\[\begin{split}\begin{align} & \dLp{U\hat{h}}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.1} \\ & \dLp{W\hat{h}}{uu} &&= \sumS (r\_^{k~<s>}_{mu} * h^{k~<\sm>}_{mu})^{\intercal} \cdot \dL{\hat{h}\_}{s}{mu} \tag{1.2} \\ & \dLp{b\hat{h}}{u} &&= \sumS \sumM \dL{\hat{h}\_}{s}{mu} \tag{1.3} \\ \\ & \dLp{Uz}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{z\_}{s}{mu} \tag{2.1} \\ & \dLp{Wz}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{z\_}{s}{mu} \tag{2.2} \\ & \dLp{bz}{u} &&= \sumS \sumM \dL{z\_}{s}{mu} \tag{2.3} \\ \\ & \dLp{Ur}{vu} &&= \sumS \vT{x}{s}{me} \cdot \dL{r\_}{s}{mu} \tag{3.1} \\ & \dLp{Wr}{uu} &&= \sumS \vT{h}{\sm}{mu} \cdot \dL{r\_}{s}{mu} \tag{3.2} \\ & \dLp{br}{u} &&= \sumS \sumM \dL{r\_}{s}{mu} \tag{3.3} \\ \end{align}\end{split}\]
Live examples
You may also like to browse all Network training examples provided with EpyNN.