Loss - Functions

Source code in EpyNN/epynn/commons/loss.py.

See Appendix - Notations for mathematical conventions.

In the context of error backpropagation, a loss function represents:

Any differentiable function used to evaluate differences between true values Y and predicted values A.
The loss function is used to compute the cost.
The derivative of the loss function is used to compute losses for each sample and for each output probability from the output layer.

Note that:

The training of a Neural Network is driven by the derivative of the loss function. The target is to minimize losses for each sample and for each output probability.
The cost is computed from the loss function, not the derivative. Computed for each sample and for each output probability, it is most frequently averaged for each sample. A single scalar can be computed by averaging per-sample costs.
The cost is an absolute difference between true values Y and predicted values A.
The loss qualifies the direction of the difference between true values Y and predicted values A.
Loss functions can be modified or implemented from epynn.commons.loss.

Mean Squared Error

def MSE(Y, A, deriv=False):
    """Mean Squared Error.
    """
    U = A.shape[1]    # Number of output nodes

    if not deriv:
        loss =  1. / U * np.sum((Y - A)**2, axis=1)

    elif deriv:
        loss = -2. / U * (Y-A)

    return loss

Given M and U the number of training examples and output units, the MSE function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}_{+}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && \frac{1}{U} * \sum\limits_{u = 1}^U (y_{mu}-a_{mu})^2 \end{alignat*}\end{split}\]

The derivative of the MSE function with respect to A can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && - \frac{2}{U} * (y_{mu}-a_{mu}) \end{alignat*}\end{split}\]

Note that the output of the MSE function is always positive and increases along with the difference between true values Y and predicted values A.

By contrast, the output of the derivative of the MSE function is positive or negative and therefore contains information on the direction of the difference between true values Y and predicted values A.

Mean Absolute Error

def MAE(Y, A, deriv=False):
    """Mean Absolute Error.
    """
    U = A.shape[1]    # Number of output nodes

    if not deriv:
        loss =  1. / U * np.sum(np.abs(Y-A), axis=1)

    elif deriv:
        loss = -1. / U * (Y-A) / (np.abs(Y-A)+E_SAFE)

    return loss

Given M and U the number of training examples and output units, the MAE function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}_{+}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && \frac{1}{U} * \sum\limits_{u = 1}^U |y_{mu}-a_{mu}| \end{alignat*}\end{split}\]

The derivative of the MAE function with respect to A can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && - \frac{1}{U} * \frac{y_{mu}-a_{mu}}{|y_{mu}-a_{mu}|} \end{alignat*}\end{split}\]

Note that the output of the MAE function is always positive and increases along with the difference between true values Y and predicted values A.

By contrast, the output of the derivative of the MAE function is positive - one - or negative - minus one - and therefore contains information on the direction of the difference between true values Y and predicted values A.

Mean Squared Logarithmic Error

def MSLE(Y, A, deriv=False):
    """Mean Squared Logarythmic Error.
    """
    U = A.shape[1]    # Number of output nodes

    if not deriv:
        loss = 1. / U * np.sum(np.square(np.log1p(Y) - np.log1p(A)), axis=1)

    elif deriv:
        loss = -2. / U * (np.log1p(Y) - np.log1p(A)) / (A + 1.)

    return loss

Given M and U the number of training examples and output units, the MSLE function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(]-1, \infty)), \mathcal{M}_{M,U}(]-1, \infty)) &\to && \mathcal{M}_{M,U}(\mathbb{R}_{+}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && \frac{1}{U} * \sum\limits_{u = 1}^U (\ln(y_{mu}+1) - \ln(a_{mu}+1))^2 \end{alignat*}\end{split}\]

The derivative of the MSLE function with respect to A can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(]-1, \infty)), \mathcal{M}_{M,U}(]-1, \infty)) &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && - \frac{2}{U} * \frac{\ln(y_{mu}+1)-\ln(a_{mu}+1)}{a_{mu}+1} \end{alignat*}\end{split}\]

Note that the output of the MSLE function is always positive and increases along with the difference between true values Y and predicted values A.

By contrast, the output of the derivative of the MSLE function is positive or negative and therefore contains information on the direction of the difference between true values Y and predicted values A.

Binary Cross-Entropy

def BCE(Y, A, deriv=False):
    """Binary Cross-Entropy.
    """
    U = A.shape[1]    # Number of output nodes

    if not deriv:
        loss = -1. / U * np.sum(Y*np.log(A+E_SAFE) + (1-Y)*np.log((1-A)+E_SAFE), axis=1)

    elif deriv:
        loss = 1. / U * (A-Y) / (A - A*A + E_SAFE)

    return loss

Given M and U the number of training examples and output units, the BCE function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\{a \in \mathbb{R}_{+} | a \not\in \{0, 1\}\}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}_{+}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && - \frac{1}{U} * \sum\limits_{u = 1}^U y_{mu} * \ln(a_{mu}) + (1-y_{mu}) * \ln(1-a_{mu}) \end{alignat*}\end{split}\]

The derivative of the BCE function with respect to A can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\{a \in \mathbb{R}_{+} | a \not\in \{0, 1\}\}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && \frac{1}{U} * \frac{A-y_{mu}}{a_{mu} - a_{mu}^2} \end{alignat*}\end{split}\]

The BCE function is part of the categorical loss functions. This is because it is relevant for classification problems. It means that the values Y should belong to the set {0, 1} because otherwise the output of the BCE and derivative is always zero.

If this requirement is satisfied, then the output of the BCE function is always positive and increases along with the difference between true values Y and predicted values A.

By contrast, the output of the derivative of the BCE function is positive or negative and therefore contains information on the direction of the difference between true values Y and predicted values A.

Categorical Cross-Entropy

def CCE(Y, A, deriv=False):
    """Categorical Cross-Entropy.
    """
    U = A.shape[1]    # Number of output nodes

    if not deriv:
        loss = -1. * np.sum(Y * np.log(A+E_SAFE), axis=1)

    elif deriv:
        loss = -1. * (Y / A)

    return loss

Given M and U the number of training examples and output units, the CCE function can be defined such as:

\[\begin{split}\begin{alignat*}{2} f:\mathcal{M}_{M,U}(\mathbb{R}_{+}^{*}), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}_{+}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && - \sum\limits_{u = 1}^U y_{mu} * \ln(a_{mu}) \end{alignat*}\end{split}\]

The derivative of the CCE function with respect to A can be defined such as:

\[\begin{split}\begin{alignat*}{2} f':\mathcal{M}_{M,U}(\mathbb{R}^*), \mathcal{M}_{M,U}(\mathbb{R}) &\to && \mathcal{M}_{M,U}(\mathbb{R}) \\ A = \mathop{(a_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}}, Y = \mathop{(y_{mu})}_{\substack{1 \le m \le M \\ 1 \le u \le U}} &\to && - \frac{y_{mu}}{a_{mu}} \end{alignat*}\end{split}\]

The CCE function is part of the categorical loss functions. This is because it is relevant for classification problems. It means that the values Y should belong to the set {0, 1} because otherwise the output of the CCE and derivative is always zero.

If this requirement is satisfied, then the output of the CCE function is always positive and increases along with the difference between true values Y and predicted values A.

By contrast, the output of the derivative of the CCE function is positive or negative and therefore contains information on the direction of the difference between true values Y and predicted values A.