Convolution 2D (CNN)

Source files in EpyNN/epynn/convolution/.

See Appendix - Notations for mathematical conventions.

Layer architecture

Convolution

A Convolution layer is an object containing a number of units - often referred to as filters - and provided with functions for parameters initialization, non-linear activation of inputs. Importantly, the Convolution layer is provided with a filter_size and strides argument upon instantiation. They define, respectively, the dimension of the window used for the convolution operation and the steps by which the window is moved between each operation.

class epynn.convolution.models.Convolution(unit_filters=1, filter_size=(3, 3), strides=None, padding=0, activate=<function relu>, initialization=<function xavier>, use_bias=True, se_hPars=None)[source]

Bases: epynn.commons.models.Layer

Definition of a convolution layer prototype.

Parameters
  • unit_filters (int, optional) – Number of unit filters in convolution layer, defaults to 1.

  • filter_size (int or tuple[int], optional) – Height and width for convolution window, defaults to (3, 3).

  • strides (int or tuple[int], optional) – Height and width to shift the convolution window by, defaults to None which equals filter_size.

  • padding (int, optional) – Number of zeros to pad each features plane with, defaults to 0.

  • activate (function, optional) – Non-linear activation of unit filters, defaults to relu.

  • initialization (function, optional) – Weight initialization function for convolution layer, defaults to xavier.

  • use_bias (bool, optional) – Whether the layer uses bias, defaults to True.

  • se_hPars (dict[str, str or float] or NoneType, optional) – Layer hyper-parameters, defaults to None and inherits from model.

Shapes

Convolution.compute_shapes(A)[source]

Wrapper for epynn.convolution.parameters.convolution_compute_shapes().

Parameters

A (numpy.ndarray) – Output of forward propagation from previous layer.

def convolution_compute_shapes(layer, A):
    """Compute forward shapes and dimensions for layer.
    """
    X = A    # Input of current layer

    layer.fs['X'] = X.shape    # (m, h, w, d)

    layer.d['m'] = layer.fs['X'][0]     # Number of samples  (m)
    layer.d['h'] = layer.fs['X'][1]     # Height of features (h)
    layer.d['w'] = layer.fs['X'][2]     # Width of features  (w)
    layer.d['d'] = layer.fs['X'][3]     # Depth of features  (d)

    # Output height (oh) and width (ow)
    layer.d['oh'] = math.floor((layer.d['h']-layer.d['fh']) / layer.d['sh']) + 1
    layer.d['ow'] = math.floor((layer.d['w']-layer.d['fw']) / layer.d['sw']) + 1

    # Shapes for trainable parameters
    # filter_height (fh), filter_width (fw), features_depth (d), unit_filters (u)
    layer.fs['W'] = (layer.d['fh'], layer.d['fw'], layer.d['d'], layer.d['u'])
    layer.fs['b'] = (layer.d['u'], )

    return None

Within a Convolution layer, shapes of interest include:

  • Input X of shape (m, h, w, d) with m equal to the number of samples, h the height of features, w the width of features and d the depth of features.

  • Output height oh defined by the ratio of the difference between h and filter height fh over the stride height sh. The float ratio is diminished to the nearest lower integer on which the value one is added.

  • Output width ow defined by the ratio of the difference between w and filter width fw over the stride width sw. The float ratio is diminished to the nearest lower integer on which the value one is added.

  • Weight W of shape (fh, fw, d, u) with fh the filter height, fw the filter width, d the depth of features and u the number of units - or filters - in the current layer k.

  • Bias b of shape (1, u) with u the number of units - or filters - in the layer.

Note that:

  • The input of a Convolution layer is often an image of features map. Therefore, we would more generally say that h is the image height, w the image width and d the image depth.

  • Parameters shape for W and b is independent from the number of samples m.

_images/convolution1-01.svg

Forward

Convolution.forward(A)[source]

Wrapper for epynn.convolution.forward.convolution_forward().

Parameters

A (numpy.ndarray) – Output of forward propagation from previous layer.

Returns

Output of forward propagation for current layer.

Return type

numpy.ndarray

def convolution_forward(layer, A):
    """Forward propagate signal to next layer.
    """
    # (1) Initialize cache and pad image
    X = initialize_forward(layer, A)    # (m, h, w, d)

    # (2) Slice input w.r.t. filter size (fh, fw) and strides (sh, sw)
    Xb = np.array([[X[ :, h:h + layer.d['fh'], w:w + layer.d['fw'], :]
                    # Inner loop
                    # (m, h, w, d) ->
                    # (ow, m, h, fw, d)
                    for w in range(layer.d['w'] - layer.d['fw'] + 1)
                    if w % layer.d['sw'] == 0]
                # Outer loop
                # (ow, m, h, fw, d) ->
                # (oh, ow, m, fh, fw, d)
                for h in range(layer.d['h'] - layer.d['fh'] + 1)
                if h % layer.d['sh'] == 0])

    # (3) Bring back m along axis 0
    Xb = np.moveaxis(Xb, 2, 0)
    # (oh, ow, m, fh, fw, d) ->
    # (m, oh, ow, fh, fw, d)

    # (4) Add dimension for filter units (u) on axis 6
    Xb = layer.fc['Xb'] = np.expand_dims(Xb, axis=6)
    # (m, oh, ow, fh, fw, d) ->
    # (m, oh, ow, fh, fw, d, 1)

    # (5.1) Linear activation Xb -> Zb
    Zb = Xb * layer.p['W']
    # (m, oh, ow, fh, fw, d, 1) - Xb
    #            (fh, fw, d, u) - W

    # (5.2) Sum block products
    Z = np.sum(Zb, axis=(5, 4, 3))
    # (m, oh, ow, fh, fw, d, u) - Zb
    # (m, oh, ow, fh, fw, u)    - np.sum(Zb, axis=(5))
    # (m, oh, mw, fh, u)        - np.sum(Zb, axis=(5, 4))
    # (m, oh, ow, u)            - np.sum(Zb, axis=(5, 4, 3))

    # (5.3) Add bias to linear activation product
    Z = layer.fc['Z'] = Z + layer.p['b']

    # (6) Non-linear activation
    A = layer.fc['A'] = layer.activate(Z)

    return A    # To next layer
_images/convolution2-01.svg

The forward propagation function in a Convolution layer k includes:

  • (1): Input X in current layer k with shape (m, h, w, d) is equal to the output A of previous layer k-1.

  • (2): Xb is an array of blocks with shape (oh, ow, m, fh, fw, d) made by iterative slicing of X with respect to fh, fw and sh, sw.

  • (3): Given Xb with shape (oh, ow, m, fh, fw, d) the operation moves the axis 2 to the position 0, yielding Xb with shape (m, oh, ow, fh, fw, d).

  • (4): The operation adds a new axis in position 6 of the array Xb, yielding a shape of (m, oh, ow, fh, fw, d, u).

  • (5): The linear activation block product Zb is computed by multiplying the input blocks Xb with shape (m, oh, ow, fh, fw, d, u) by the weight W with shape (fh, fw, d, u).

  • (6): The linear activation product Z with shape (m, oh, ow, u) is computed by summing the block product Zb with shape (m, oh, ow, fh, fw, d, u) over the blocks dimension on axis 5, 4 and 3.

  • (7): Computation of Z is completed by the addition of the bias b.

  • (8): Output A is computed by applying a non-linear activation function on Z.

Note that:

  • This implementation is not the naive implementation of a Convolution layer. In the latter, the process is fully iterative with a nested loop that includes four levels for each input dimension. Because this naive implementation is prohibitively slow, here is depicted a stride groups optimized version that takes advantage of NumPy arrays.

\[\begin{split}\begin{alignat*}{2} & x^{k}_{mhwd} &&= a^{\km}_{mhwd} \tag{1} \\ & \_\_Xb^{k} &&= blocks(X^{k}) \tag{2} \\ & \_Xb^{k} &&= moveaxis(\_\_Xb^{k}) \tag{3} \\ & Xb^{k} &&= expand\_dims(\_Xb^{k}) \tag{4} \\ & Zb^{k} &&= Xb^{k} * W^{k}_{f_{h}f_{w}du} \tag{5.1} \\ & \_z^{k}_{mo_{h}o_{w}u} &&= \sum_{f_{h} = 1}^{Fh} \sum_{f_{w} = 1}^{Fw} \sum_{d = 1}^{D} Zb^{k}_{mo_{h}o_{w}f_{h}f_{w}du} \tag{5.2} \\ & z^{k}_{mo_{h}o_{w}u} &&= \_z^{k}_{mo_{h}o_{w}u} + b^{k}_{111u} \tag{5.3} \\ & a^{k}_{mo_{h}o_{w}u} &&= a_{act}(z^{k}_{mo_{h}o_{w}u}) \tag{6} \end{alignat*}\end{split}\]
\[\begin{split}\begin{alignat*}{2} & where~blocks~is~defined~as: \\ &~~~~~~~~~~~~~~~~~~~~~~~~blocks:\mathcal{M}_{M,H,W,D}(\mathbb{R}) &&\to \mathcal{M}_{Oh,Ow,M,Fh,Fw,D}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X = \mathop{(x_{mhwd})}_{\substack{1 \le m \le M \\ 1 \le h \le H \\ 1 \le w \le W \\ 1 \le d \le D}} &&\to Y = \mathop{(y_{o_{h}o_{w}mf_hf_wd})}_{\substack{1 \le o_h \le Oh \\ 1 \le o_w \le Ow \\ 1 \le m \le M \\ 1 \le f_h \le Fh \\ 1 \le f_w \le Fw \\ 1 \le d \le D}} \\ &~~~~~~~~~~~~~~~~~~~~~~~with~Fh, Fw, Sh, Sw \in &&~\mathbb{N^*_+} \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\forall{h} \in &&~\{1,..,H-Fh~|~h \pmod{Sh} = 0\} \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\forall{w} \in &&~\{1,..,W-Fw~|~h \pmod{Sw} = 0\} \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Y = && X[:, h:h+Fh, w:w+Fw:, :] \\ \\ & where~moveaxis~is~defined~as: \\ &~~~~~~~~moveaxis:\mathcal{M}_{Oh,Ow,M,Fh,Fw,D}(\mathbb{R}) &&\to \mathcal{M}_{M,Oh,Ow,Fh,Fw,D}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X &&\to Y \\ \\ & where~expand\_dims~is~defined~as: \\ &~~~~expand\_dims:\mathcal{M}_{M,Oh,Ow,Fh,Fw,D}(\mathbb{R}) &&\to \mathcal{M}_{M,Oh,Ow,Fh,Fw,D,1}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X &&\to Y \end{alignat*}\end{split}\]

Backward

Convolution.backward(dX)[source]

Wrapper for epynn.convolution.backward.convolution_backward().

Parameters

dX (numpy.ndarray) – Output of backward propagation from next layer.

Returns

Output of backward propagation for current layer.

Return type

numpy.ndarray

def convolution_backward(layer, dX):
    """Backward propagate error gradients to previous layer.
    """
    # (1) Initialize cache
    dA = initialize_backward(layer, dX)    # (m, oh, ow, u)

    # (2) Gradient of the loss w.r.t. Z
    dZ = layer.bc['dZ'] = (
        dA
        * layer.activate(layer.fc['Z'], deriv=True)
    )   # dL/dZ

    # (3) Restore kernel dimensions
    dZb = dZ
    dZb = np.expand_dims(dZb, axis=3)
    dZb = np.expand_dims(dZb, axis=3)
    dZb = np.expand_dims(dZb, axis=3)
    # (m, oh, ow, 1, u)         ->
    # (m, oh, ow, 1, 1, u)     ->
    # (m, oh, ow, 1, 1, 1, u)

    # (4) Initialize backward output dL/dX
    dX = np.zeros_like(layer.fc['X'])      # (m, h, w, d)

    # Iterate over forward output height
    for oh in range(layer.d['oh']):

        hs = oh * layer.d['sh']
        he = hs + layer.d['fh']

        # Iterate over forward output width
        for ow in range(layer.d['ow']):

            ws = ow * layer.d['sw']
            we = ws + layer.d['fw']

            # (5hw) Gradient of the loss w.r.t Xb
            dXb = dZb[:, oh, ow, :] * layer.p['W']
            # (m, oh, ow,  1,  1, 1, u) - dZb
            #         (m,  1,  1, 1, u) - dZb[:, h, w, :]
            #            (fh, fw, d, u) - W

            # (6hw) Sum over units axis
            dX[:, hs:he, ws:we, :] += np.sum(dXb, axis=4)
            # (m, fh, fw, d, u) - dXb
            # (m, fh, fw, d)    - np.sum(dXb, axis=4)

    layer.bc['dX'] = dX

    return dX
_images/convolution3-01.svg

The backward propagation function in a Convolution layer k includes:

  • (1): dA with shape (m, oh, ow, u) is the gradient of the loss with respect to the output of forward propagation A for current layer k. It is equal to the gradient of the loss with respect to input of forward propagation for next layer k+1.

  • (2): dZ is the gradient of the loss with respect to Z. It is computed by applying element-wise multiplication between dA and the derivative of the non-linear activation function applied on Z.

  • (3): Three new axes are added before the last axis of dZ with shape (m, oh, ow, u) to reintroduce the dimensions corresponding to Xb with shape (m, oh, ow, fh, fw, d, u).

  • (4): The gradient of the loss dX with respect to the input of forward propagation X for current layer k is initialized as a zeros array of the same shape as X.

  • (5hw): For each row h with respect to oh and for each column w with respect to ow, the gradient of the loss for one block dXb with respect to the one block input of forward propagation is equal to the product between the corresponding block in dZb and the weight W.

  • (6hw): For each row h with respect to oh and for each column w with respect to ow, the corresponding window coordinates in X are computed: hs, he for height start and height end; ws, we for width start and width end. The gradient of the loss dX with respect to the current window in X is the sum of dXb over the last axis 4.

Note that:

  • One window represents an ensemble of coordinates. One value with given coordinates may be part of more than one window. This is why the operation dX[:, hs:he, ws:we, :] += np.sum(dXb, axis=4) is used instead of dX[:, hs:he, ws:we, :] = np.sum(dXb, axis=4).

\[\begin{split}\begin{alignat*}{2} & \delta^{\kp}_{mo_{h}o_{w}u} &&= \frac{\partial \mathcal{L}}{\partial a^{k}_{mo_{h}o_{w}u}} = \frac{\partial \mathcal{L}}{\partial x^{\kp}_{mo_{h}o_{w}u}} \tag{1} \\ & \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}u}} &&= \delta^{\kp}_{mo_{h}o_{w}u} * a_{act}'(z^{k}_{mo_{h}o_{w}u}) \tag{2} \\ & \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}111u}} &&= expand\_dims(\frac{\partial \mathcal{L}}{\partial a^{k}_{mo_{h}o_{w}d}}) \tag{3} \\ & \frac{\partial \mathcal{L}}{\partial X^{k}} &&= [\frac{\partial \mathcal{L}}{\partial x^{k}_{mhwd}}] \in \{0\}^{M \times H \times W \times D} \tag{4} \\ \\ & \frac{\partial \mathcal{L}}{\partial Xb^{k<o_{h}o_{w}>}} &&= \frac{\partial \mathcal{L}}{\partial Zb^{k}_{mo_{h}o_{w}111u}} * W^{k}_{f_{h}f_{w}du} \tag{5hw} \\ \\ & \Delta\frac{\partial \mathcal{L}}{\partial X^{k<h_s:h_e,w_s:w_e>}_{mf_hf_wd}} &&= \sum_{u = 1}^{U} \frac{\partial \mathcal{L}}{\partial Xb^{k<o_{h}o_{w}>}_{mf_hf_wdu}} \tag{6hw} \end{alignat*}\end{split}\]
\[\begin{split}\begin{alignat*}{2} & where~expand\_dims~is~defined~as: \\ &~~~~expand\_dims:\mathcal{M}_{M,Oh,Ow,U}(\mathbb{R}) &&\to \mathcal{M}_{M,Oh,Ow,1,1,1,U}(\mathbb{R}) \\ &~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X &&\to Y \end{alignat*}\end{split}\]

Gradients

Convolution.compute_gradients()[source]

Wrapper for epynn.convolution.parameters.convolution_compute_gradients().

def convolution_compute_gradients(layer):
    """Compute gradients with respect to weight and bias for layer.
    """
    # Gradients initialization with respect to parameters
    for parameter in layer.p.keys():
        gradient = 'd' + parameter
        layer.g[gradient] = np.zeros_like(layer.p[parameter])

    Xb = layer.fc['Xb']     # Input blocks of forward propagation
    dZ = layer.bc['dZ']     # Gradient of the loss with respect to Z

    # Expand dZ dimensions with respect to Xb
    dZb = dZ
    dZb = np.expand_dims(dZb, axis=3)    # (m, oh, ow, 1, u)
    dZb = np.expand_dims(dZb, axis=3)    # (m, oh, ow, 1, 1, u)
    dZb = np.expand_dims(dZb, axis=3)    # (m, oh, ow, 1, 1, 1, u)

    # (1) Gradient of the loss with respect to W, b
    dW = layer.g['dW'] = np.sum(dZb * Xb, axis=(2, 1, 0))   # (1.1) dL/dW
    db = layer.g['db'] = np.sum(dZb, axis=(2, 1, 0))        # (1.2) dL/db

    layer.g['db'] = db.squeeze() if layer.use_bias else 0.

    return None

The function to compute parameters gradients in a Convolution layer k includes:

  • (1.1): dW is the gradient of the loss with respect to W. It is computed by element-wise multiplication between Xb and dZb followed by summing over axes 2, 1 and 0. Axes correspond to the output width ow, output height oh and number of samples m dimensions, respectively.

  • (1.2): db is the gradient of the loss with respect to b. It is computed by summing dZb along axes 2, 1 and 0 followed by a squeeze which removes axes of length one.

\[\begin{split}\begin{alignat*}{2} & \frac{\partial \mathcal{L}}{\partial W^{k}_{f_{h}f_{w}du}} &&= \sum_{o_{w} = 1}^{Ow} \sum_{o_{h} = 1}^{Oh} \sum_{m = 1}^{M} xb^{k}_{mo_{h}o_{w}f_{h}f_{w}du} * \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}111u}} \tag{1.1} \\ & \frac{\partial \mathcal{L}}{\partial b^{k}_{u}} &&= \sum_{o_{w} = 1}^{Ow} \sum_{o_{h} = 1}^{Oh} \sum_{m = 1}^{M} \frac{\partial \mathcal{L}}{\partial zb^{k}_{mo_{h}o_{w}111u}} \tag{1.2} \end{alignat*}\end{split}\]