Convolution 2D (CNN)
Source files in EpyNN/epynn/convolution/
.
See Appendix - Notations for mathematical conventions.
Layer architecture
A Convolution layer is an object containing a number of units - often referred to as filters - and provided with functions for parameters initialization, non-linear activation of inputs. Importantly, the Convolution layer is provided with a filter_size and strides argument upon instantiation. They define, respectively, the dimension of the window used for the convolution operation and the steps by which the window is moved between each operation.
- class epynn.convolution.models.Convolution(unit_filters=1, filter_size=(3, 3), strides=None, padding=0, activate=<function relu>, initialization=<function xavier>, use_bias=True, se_hPars=None)[source]
Bases:
epynn.commons.models.Layer
Definition of a convolution layer prototype.
- Parameters
unit_filters (int, optional) – Number of unit filters in convolution layer, defaults to 1.
filter_size (int or tuple[int], optional) – Height and width for convolution window, defaults to (3, 3).
strides (int or tuple[int], optional) – Height and width to shift the convolution window by, defaults to None which equals filter_size.
padding (int, optional) – Number of zeros to pad each features plane with, defaults to 0.
activate (function, optional) – Non-linear activation of unit filters, defaults to relu.
initialization (function, optional) – Weight initialization function for convolution layer, defaults to xavier.
use_bias (bool, optional) – Whether the layer uses bias, defaults to True.
se_hPars (dict[str, str or float] or NoneType, optional) – Layer hyper-parameters, defaults to None and inherits from model.
Shapes
- Convolution.compute_shapes(A)[source]
Wrapper for
epynn.convolution.parameters.convolution_compute_shapes()
.
- Parameters
A (
numpy.ndarray
) – Output of forward propagation from previous layer.def convolution_compute_shapes(layer, A): """Compute forward shapes and dimensions for layer. """ X = A # Input of current layer layer.fs['X'] = X.shape # (m, h, w, d) layer.d['m'] = layer.fs['X'][0] # Number of samples (m) layer.d['h'] = layer.fs['X'][1] # Height of features (h) layer.d['w'] = layer.fs['X'][2] # Width of features (w) layer.d['d'] = layer.fs['X'][3] # Depth of features (d) # Output height (oh) and width (ow) layer.d['oh'] = math.floor((layer.d['h']-layer.d['fh']) / layer.d['sh']) + 1 layer.d['ow'] = math.floor((layer.d['w']-layer.d['fw']) / layer.d['sw']) + 1 # Shapes for trainable parameters # filter_height (fh), filter_width (fw), features_depth (d), unit_filters (u) layer.fs['W'] = (layer.d['fh'], layer.d['fw'], layer.d['d'], layer.d['u']) layer.fs['b'] = (layer.d['u'], ) return NoneWithin a Convolution layer, shapes of interest include:
Input X of shape (m, h, w, d) with m equal to the number of samples, h the height of features, w the width of features and d the depth of features.
Output height oh defined by the ratio of the difference between h and filter height fh over the stride height sh. The float ratio is diminished to the nearest lower integer on which the value one is added.
Output width ow defined by the ratio of the difference between w and filter width fw over the stride width sw. The float ratio is diminished to the nearest lower integer on which the value one is added.
Weight W of shape (fh, fw, d, u) with fh the filter height, fw the filter width, d the depth of features and u the number of units - or filters - in the current layer k.
Bias b of shape (1, u) with u the number of units - or filters - in the layer.
Note that:
The input of a Convolution layer is often an image of features map. Therefore, we would more generally say that h is the image height, w the image width and d the image depth.
Parameters shape for W and b is independent from the number of samples m.
Forward
- Convolution.forward(A)[source]
Wrapper for
epynn.convolution.forward.convolution_forward()
.
- Parameters
A (
numpy.ndarray
) – Output of forward propagation from previous layer.- Returns
Output of forward propagation for current layer.
- Return type
numpy.ndarray
def convolution_forward(layer, A): """Forward propagate signal to next layer. """ # (1) Initialize cache and pad image X = initialize_forward(layer, A) # (m, h, w, d) # (2) Slice input w.r.t. filter size (fh, fw) and strides (sh, sw) Xb = np.array([[X[ :, h:h + layer.d['fh'], w:w + layer.d['fw'], :] # Inner loop # (m, h, w, d) -> # (ow, m, h, fw, d) for w in range(layer.d['w'] - layer.d['fw'] + 1) if w % layer.d['sw'] == 0] # Outer loop # (ow, m, h, fw, d) -> # (oh, ow, m, fh, fw, d) for h in range(layer.d['h'] - layer.d['fh'] + 1) if h % layer.d['sh'] == 0]) # (3) Bring back m along axis 0 Xb = np.moveaxis(Xb, 2, 0) # (oh, ow, m, fh, fw, d) -> # (m, oh, ow, fh, fw, d) # (4) Add dimension for filter units (u) on axis 6 Xb = layer.fc['Xb'] = np.expand_dims(Xb, axis=6) # (m, oh, ow, fh, fw, d) -> # (m, oh, ow, fh, fw, d, 1) # (5.1) Linear activation Xb -> Zb Zb = Xb * layer.p['W'] # (m, oh, ow, fh, fw, d, 1) - Xb # (fh, fw, d, u) - W # (5.2) Sum block products Z = np.sum(Zb, axis=(5, 4, 3)) # (m, oh, ow, fh, fw, d, u) - Zb # (m, oh, ow, fh, fw, u) - np.sum(Zb, axis=(5)) # (m, oh, mw, fh, u) - np.sum(Zb, axis=(5, 4)) # (m, oh, ow, u) - np.sum(Zb, axis=(5, 4, 3)) # (5.3) Add bias to linear activation product Z = layer.fc['Z'] = Z + layer.p['b'] # (6) Non-linear activation A = layer.fc['A'] = layer.activate(Z) return A # To next layerThe forward propagation function in a Convolution layer k includes:
(1): Input X in current layer k with shape (m, h, w, d) is equal to the output A of previous layer k-1.
(2): Xb is an array of blocks with shape (oh, ow, m, fh, fw, d) made by iterative slicing of X with respect to fh, fw and sh, sw.
(3): Given Xb with shape (oh, ow, m, fh, fw, d) the operation moves the axis 2 to the position 0, yielding Xb with shape (m, oh, ow, fh, fw, d).
(4): The operation adds a new axis in position 6 of the array Xb, yielding a shape of (m, oh, ow, fh, fw, d, u).
(5): The linear activation block product Zb is computed by multiplying the input blocks Xb with shape (m, oh, ow, fh, fw, d, u) by the weight W with shape (fh, fw, d, u).
(6): The linear activation product Z with shape (m, oh, ow, u) is computed by summing the block product Zb with shape (m, oh, ow, fh, fw, d, u) over the blocks dimension on axis 5, 4 and 3.
(7): Computation of Z is completed by the addition of the bias b.
(8): Output A is computed by applying a non-linear activation function on Z.
Note that:
This implementation is not the naive implementation of a Convolution layer. In the latter, the process is fully iterative with a nested loop that includes four levels for each input dimension. Because this naive implementation is prohibitively slow, here is depicted a stride groups optimized version that takes advantage of NumPy arrays.
\[\begin{split}\begin{alignat*}{2} & x^{k}_{mhwd} &&= a^{\km}_{mhwd} \tag{1} \\ & \_\_Xb^{k} &&= blocks(X^{k}) \tag{2} \\ & \_Xb^{k} &&= moveaxis(\_\_Xb^{k}) \tag{3} \\ & Xb^{k} &&= expand\_dims(\_Xb^{k}) \tag{4} \\ & Zb^{k} &&= Xb^{k} * W^{k}_{f_{h}f_{w}du} \tag{5.1} \\ & \_z^{k}_{mo_{h}o_{w}u} &&= \sum_{f_{h} = 1}^{Fh} \sum_{f_{w} = 1}^{Fw} \sum_{d = 1}^{D} Zb^{k}_{mo_{h}o_{w}f_{h}f_{w}du} \tag{5.2} \\ & z^{k}_{mo_{h}o_{w}u} &&= \_z^{k}_{mo_{h}o_{w}u} + b^{k}_{111u} \tag{5.3} \\ & a^{k}_{mo_{h}o_{w}u} &&= a_{act}(z^{k}_{mo_{h}o_{w}u}) \tag{6} \end{alignat*}\end{split}\]\[\begin{split}\begin{alignat*}{2} & where~blocks~is~defined~as: \\ &~~~~~~~~~~~~~~~~~~~~~~~~blocks:\mathcal{M}_{M,H,W,D}(\mathbb{R}) &&\to \mathcal{M}_{Oh,Ow,M,Fh,Fw,D}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X = \mathop{(x_{mhwd})}_{\substack{1 \le m \le M \\ 1 \le h \le H \\ 1 \le w \le W \\ 1 \le d \le D}} &&\to Y = \mathop{(y_{o_{h}o_{w}mf_hf_wd})}_{\substack{1 \le o_h \le Oh \\ 1 \le o_w \le Ow \\ 1 \le m \le M \\ 1 \le f_h \le Fh \\ 1 \le f_w \le Fw \\ 1 \le d \le D}} \\ &~~~~~~~~~~~~~~~~~~~~~~~with~Fh, Fw, Sh, Sw \in &&~\mathbb{N^*_+} \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\forall{h} \in &&~\{1,..,H-Fh~|~h \pmod{Sh} = 0\} \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\forall{w} \in &&~\{1,..,W-Fw~|~h \pmod{Sw} = 0\} \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Y = && X[:, h:h+Fh, w:w+Fw:, :] \\ \\ & where~moveaxis~is~defined~as: \\ &~~~~~~~~moveaxis:\mathcal{M}_{Oh,Ow,M,Fh,Fw,D}(\mathbb{R}) &&\to \mathcal{M}_{M,Oh,Ow,Fh,Fw,D}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X &&\to Y \\ \\ & where~expand\_dims~is~defined~as: \\ &~~~~expand\_dims:\mathcal{M}_{M,Oh,Ow,Fh,Fw,D}(\mathbb{R}) &&\to \mathcal{M}_{M,Oh,Ow,Fh,Fw,D,1}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X &&\to Y \end{alignat*}\end{split}\]
Backward
- Convolution.backward(dX)[source]
Wrapper for
epynn.convolution.backward.convolution_backward()
.
- Parameters
dX (
numpy.ndarray
) – Output of backward propagation from next layer.- Returns
Output of backward propagation for current layer.
- Return type
numpy.ndarray
def convolution_backward(layer, dX): """Backward propagate error gradients to previous layer. """ # (1) Initialize cache dA = initialize_backward(layer, dX) # (m, oh, ow, u) # (2) Gradient of the loss w.r.t. Z dZ = layer.bc['dZ'] = ( dA * layer.activate(layer.fc['Z'], deriv=True) ) # dL/dZ # (3) Restore kernel dimensions dZb = dZ dZb = np.expand_dims(dZb, axis=3) dZb = np.expand_dims(dZb, axis=3) dZb = np.expand_dims(dZb, axis=3) # (m, oh, ow, 1, u) -> # (m, oh, ow, 1, 1, u) -> # (m, oh, ow, 1, 1, 1, u) # (4) Initialize backward output dL/dX dX = np.zeros_like(layer.fc['X']) # (m, h, w, d) # Iterate over forward output height for oh in range(layer.d['oh']): hs = oh * layer.d['sh'] he = hs + layer.d['fh'] # Iterate over forward output width for ow in range(layer.d['ow']): ws = ow * layer.d['sw'] we = ws + layer.d['fw'] # (5hw) Gradient of the loss w.r.t Xb dXb = dZb[:, oh, ow, :] * layer.p['W'] # (m, oh, ow, 1, 1, 1, u) - dZb # (m, 1, 1, 1, u) - dZb[:, h, w, :] # (fh, fw, d, u) - W # (6hw) Sum over units axis dX[:, hs:he, ws:we, :] += np.sum(dXb, axis=4) # (m, fh, fw, d, u) - dXb # (m, fh, fw, d) - np.sum(dXb, axis=4) layer.bc['dX'] = dX return dXThe backward propagation function in a Convolution layer k includes:
(1): dA with shape (m, oh, ow, u) is the gradient of the loss with respect to the output of forward propagation A for current layer k. It is equal to the gradient of the loss with respect to input of forward propagation for next layer k+1.
(2): dZ is the gradient of the loss with respect to Z. It is computed by applying element-wise multiplication between dA and the derivative of the non-linear activation function applied on Z.
(3): Three new axes are added before the last axis of dZ with shape (m, oh, ow, u) to reintroduce the dimensions corresponding to Xb with shape (m, oh, ow, fh, fw, d, u).
(4): The gradient of the loss dX with respect to the input of forward propagation X for current layer k is initialized as a zeros array of the same shape as X.
(5hw): For each row h with respect to oh and for each column w with respect to ow, the gradient of the loss for one block dXb with respect to the one block input of forward propagation is equal to the product between the corresponding block in dZb and the weight W.
(6hw): For each row h with respect to oh and for each column w with respect to ow, the corresponding window coordinates in X are computed: hs, he for height start and height end; ws, we for width start and width end. The gradient of the loss dX with respect to the current window in X is the sum of dXb over the last axis 4.
Note that:
One window represents an ensemble of coordinates. One value with given coordinates may be part of more than one window. This is why the operation dX[:, hs:he, ws:we, :] += np.sum(dXb, axis=4) is used instead of dX[:, hs:he, ws:we, :] = np.sum(dXb, axis=4).
\[\begin{split}\begin{alignat*}{2} & \delta^{\kp}_{mo_{h}o_{w}u} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{mo_{h}o_{w}u}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{mo_{h}o_{w}u}} \tag{1} \\ & \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}u}} &&= \delta^{\kp}_{mo_{h}o_{w}u} * a_{act}'(z^{k}_{mo_{h}o_{w}u}) \tag{2} \\ & \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}111u}} &&= expand\_dims(\frac{\partial \mathcal{L}}{\partial a^{k}_{mo_{h}o_{w}d}}) \tag{3} \\ & \frac{\partial \mathcal{L}}{\partial X^{k}} &&= [\frac{\partial \mathcal{L}}{\partial x^{k}_{mhwd}}] \in \{0\}^{M \times H \times W \times D} \tag{4} \\ \\ & \frac{\partial \mathcal{L}}{\partial Xb^{k<o_{h}o_{w}>}} &&= \frac{\partial \mathcal{L}}{\partial Zb^{k}_{mo_{h}o_{w}111u}} * W^{k}_{f_{h}f_{w}du} \tag{5hw} \\ \\ & \Delta\frac{\partial \mathcal{L}}{\partial X^{k<h_s:h_e,w_s:w_e>}_{mf_hf_wd}} &&= \sum_{u = 1}^{U} \frac{\partial \mathcal{L}}{\partial Xb^{k<o_{h}o_{w}>}_{mf_hf_wdu}} \tag{6hw} \end{alignat*}\end{split}\]\[\begin{split}\begin{alignat*}{2} & where~expand\_dims~is~defined~as: \\ &~~~~expand\_dims:\mathcal{M}_{M,Oh,Ow,U}(\mathbb{R}) &&\to \mathcal{M}_{M,Oh,Ow,1,1,1,U}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X &&\to Y \end{alignat*}\end{split}\]
Gradients
- Convolution.compute_gradients()[source]
Wrapper for
epynn.convolution.parameters.convolution_compute_gradients()
.def convolution_compute_gradients(layer): """Compute gradients with respect to weight and bias for layer. """ # Gradients initialization with respect to parameters for parameter in layer.p.keys(): gradient = 'd' + parameter layer.g[gradient] = np.zeros_like(layer.p[parameter]) Xb = layer.fc['Xb'] # Input blocks of forward propagation dZ = layer.bc['dZ'] # Gradient of the loss with respect to Z # Expand dZ dimensions with respect to Xb dZb = dZ dZb = np.expand_dims(dZb, axis=3) # (m, oh, ow, 1, u) dZb = np.expand_dims(dZb, axis=3) # (m, oh, ow, 1, 1, u) dZb = np.expand_dims(dZb, axis=3) # (m, oh, ow, 1, 1, 1, u) # (1) Gradient of the loss with respect to W, b dW = layer.g['dW'] = np.sum(dZb * Xb, axis=(2, 1, 0)) # (1.1) dL/dW db = layer.g['db'] = np.sum(dZb, axis=(2, 1, 0)) # (1.2) dL/db layer.g['db'] = db.squeeze() if layer.use_bias else 0. return NoneThe function to compute parameters gradients in a Convolution layer k includes:
(1.1): dW is the gradient of the loss with respect to W. It is computed by element-wise multiplication between Xb and dZb followed by summing over axes 2, 1 and 0. Axes correspond to the output width ow, output height oh and number of samples m dimensions, respectively.
(1.2): db is the gradient of the loss with respect to b. It is computed by summing dZb along axes 2, 1 and 0 followed by a squeeze which removes axes of length one.
\[\begin{split}\begin{alignat*}{2} & \frac{\partial \mathcal{L}}{\partial W^{k}_{f_{h}f_{w}du}} &&= \sum_{o_{w} = 1}^{Ow} \sum_{o_{h} = 1}^{Oh} \sum_{m = 1}^{M} xb^{k}_{mo_{h}o_{w}f_{h}f_{w}du} * \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}111u}} \tag{1.1} \\ & \frac{\partial \mathcal{L}}{\partial b^{k}_{u}} &&= \sum_{o_{w} = 1}^{Ow} \sum_{o_{h} = 1}^{Oh} \sum_{m = 1}^{M} \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}111u}} \tag{1.2} \end{alignat*}\end{split}\]
Live examples
You may also like to browse all Network training examples provided with EpyNN.