Distinguish author-specific patterns in music

  • Find this notebook at EpyNN/epynnlive/author_music/train.ipynb.

  • Regular python code at EpyNN/epynnlive/author_music/train.py.

Run the notebook online with Google Colab.

Level: Advanced

In this notebook we will review:

  • Handling univariate time series that represent a huge amount of data points.

  • Taking advantage of recurrent architectures (RNN, GRU) over Feed-Forward architectures.

  • Introducing recall and precision along with accuracy when dealing with unbalanced datasets.

It is assumed that all basics notebooks were already reviewed:

This notebook does not enhance, extend or replace EpyNN’s documentation.

Relevant documentation pages for the current notebook:

Environment and data

Follow this link for details about data preparation.

Briefly, raw data are instrumental guitar music from the True author and the False author. These are raw .wav files that were normalized, digitalized using a 4-bits encoder and clipped.

Commonly, music .wav files have a sampling rate of 44100 Hz. This means that each second of music represents a numerical time series of length 44100.

[1]:
# EpyNN/epynnlive/author_music/train.ipynb
# Install dependencies
!pip3 install --upgrade-strategy only-if-needed epynn

# Standard library imports
import random

# Related third party imports
import numpy as np

# Local application/library specific imports
import epynn.initialize
from epynn.commons.maths import relu, softmax
from epynn.commons.library import (
    configure_directory,
    read_model,
)
from epynn.network.models import EpyNN
from epynn.embedding.models import Embedding
from epynn.rnn.models import RNN
from epynn.gru.models import GRU
from epynn.flatten.models import Flatten
from epynn.dropout.models import Dropout
from epynn.dense.models import Dense
from epynnlive.author_music.prepare_dataset import (
    prepare_dataset,
    download_music,
)
from epynnlive.author_music.settings import se_hPars


########################## CONFIGURE ##########################
random.seed(1)

np.set_printoptions(threshold=10)

np.seterr(all='warn')
np.seterr(under='ignore')

configure_directory()


############################ DATASET ##########################
download_music()

X_features, Y_label = prepare_dataset(N_SAMPLES=256)

Let’s inspect.

[2]:
print(len(X_features))
print(X_features[0].shape)
print(X_features[0])
print(np.min(X_features[0]), np.max(X_features[0]))
256
(10000,)
[10  7  7 ...  9  9  9]
1 15

We clipped the original .wav files in 1 second clips and thus we could retrieve 256 samples. We did that because we do not have an infinite number of data. Since we want more training examples, we need to split the data.

Below other problems are discussed:

  • Arrays size in memory: One second represents 44100 data points for each clip and thus 44100 * 256 = 11.2896e6 data points in total. More than ten million of these is more likely to overload your RAM or to raise a memory allocation error on most laptops. This is why we resampled the original .wav files content to 10000 Hz. When doing that, we lose the patterns associated with frequencies greater than 5000 Hz. Alternatively, we could have made clips of shorther duration but then we would miss patterns associated with lower frequencies. Because the guitar emission spectrum is essentially entirely below 5000 Hz, we preferred to apply the resampling method.

  • Signal normalization: Original signals were sequences of 16-bits integers ranging from -32768 to 32767. Feeding a neural network which such big values will most likely result in floating point errors. This is why we normalized the original data from each .wav file within the range [0, 1].

  • Signal digitalization: While the original signal was a digital signal encoded over 16-bits integers, this results in 3e-5 difference between each digit after normalization within the range [0, 1]. Such thin differences may be difficult to evaluate for the network and convergence in the training phase could turn prohibitively slow. In the context of this notebook, we digitalized from 16-bits to 4-bits integers ranging from 0 to 15 for a total of 16 bins instead of 65536.

  • One-hot encoding: To simplify the problem and focus on patterns, we will eliminate explicit amplitudes by performing one-hot encoding of the univariate, 4-bits encoded time series.

All things being said, we can go ahead.

Feed-Forward (FF)

We first start with our reference, a Feed-Forward network with dropout regularization.

Embedding

We scaled input data for each .wav file before, so we do not need to provide the argument to the class constructor of the embedding layer. Note that when X_scale=True it applies a global scaling over the whole training set. Here we work with independant .wav files which should be normalized separately.

For the embedding, we will one-hot encode time series. See One-hot encoding of string features for details about the process. Note that while one-hot encoding is mandatory when dealing with string input data, it can also be done with digitized numerical data as is the case here.

[3]:
embedding = Embedding(X_data=X_features,
                      Y_data=Y_label,
                      X_encode=True,
                      Y_encode=True,
                      batch_size=16,
                      relative_size=(2, 1, 0))

Let’s inspect the shape of the data.

[4]:
print(embedding.dtrain.X.shape)
print(embedding.dtrain.b)
(171, 10000, 16)
{0: 71, 1: 100}

We note that we have an unbalanced dataset, with about 2/3 of negative samples.

Flatten-(Dense)n with Dropout

Let’s proceed with the network design and training.

[5]:
name = 'Flatten_Dense-64-relu_Dropout05_Dense-2-softmax'

se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    Flatten(),
    Dense(64, relu),
    Dropout(0.5),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

We can initialize the model.

[6]:
model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')
--- EpyNN Check OK! ---                                                                             

Train it for 10 epochs.

[7]:
model.train(epochs=10, init_logs=False)
Epoch 1 - Batch 0/9 - Accuracy: 0.5 Cost: 0.49991 - TIME: 1.97s RATE: 1.01e+00e/s TTC: 10s          
/media/synthase/beta/EpyNN/epynn/commons/metrics.py:100: RuntimeWarning: invalid value encountered in long_scalars
  precision = (tp / (tp+fp))
Epoch 9 - Batch 9/9 - Accuracy: 0.625 Cost: 0.37297 - TIME: 13.26s RATE: 7.54e-01e/s TTC: 3s        

+-------+----------+----------+----------+-------+--------+-------+-----------+-------+--------+-------+------------------------------------------------------------+
| epoch |  lrate   |  lrate   | accuracy |       | recall |       | precision |       |  MSE   |       |                         Experiment                         |
|       |  Dense   |  Dense   |  dtrain  | dval  | dtrain | dval  |  dtrain   | dval  | dtrain | dval  |                                                            |
+-------+----------+----------+----------+-------+--------+-------+-----------+-------+--------+-------+------------------------------------------------------------+
|   0   | 1.00e-02 | 1.00e-02 |  0.591   | 0.541 | 0.014  | 0.000 |   1.000   |  nan  | 0.409  | 0.458 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   1   | 1.00e-02 | 1.00e-02 |  0.544   | 0.588 | 0.704  | 0.615 |   0.467   | 0.545 | 0.434  | 0.381 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   2   | 1.00e-02 | 1.00e-02 |  0.643   | 0.565 | 0.268  | 0.256 |   0.679   | 0.556 | 0.339  | 0.428 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   3   | 1.00e-02 | 1.00e-02 |  0.637   | 0.541 | 0.141  | 0.128 |   0.909   | 0.500 | 0.358  | 0.447 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   4   | 1.00e-02 | 1.00e-02 |  0.673   | 0.600 | 0.225  | 0.154 |   0.941   | 0.857 | 0.317  | 0.390 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   5   | 1.00e-02 | 1.00e-02 |  0.690   | 0.624 | 0.408  | 0.359 |   0.725   | 0.667 | 0.303  | 0.354 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   6   | 1.00e-02 | 1.00e-02 |  0.585   | 0.459 | 0.915  | 0.795 |   0.500   | 0.449 | 0.400  | 0.495 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   7   | 1.00e-02 | 1.00e-02 |  0.550   | 0.494 | 0.986  | 0.974 |   0.479   | 0.475 | 0.445  | 0.493 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   8   | 1.00e-02 | 1.00e-02 |  0.702   | 0.635 | 0.761  | 0.615 |   0.614   | 0.600 | 0.289  | 0.366 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   9   | 1.00e-02 | 1.00e-02 |  0.772   | 0.694 | 0.549  | 0.462 |   0.848   | 0.783 | 0.219  | 0.280 | 1635107656_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
+-------+----------+----------+----------+-------+--------+-------+-----------+-------+--------+-------+------------------------------------------------------------+

The model could not reproduce the training data very well while it is almost not representative at all of the validation data.

We can still comment the recall and precision metrics:

  • Recall: This represents the fraction of positive instances retrieved by the model.

  • Precision: This represents the fraction of positive instances within the labels predicted as positive.

Said differently:

  • Given tp the true positive samples.

  • Given tn the true negative samples.

  • Given fp the false positive samples.

  • Given fn the false negative samples.

  • Then recall = tp / (tp+fn) and precision = tp / (tp+fp).

For code, maths and pictures behind the Dense layer, follow this link:

Recurrent Architectures

Recurrent architectures can make a difference here because they process time series one measurement by one measurement.

Impotantly, the number of time steps does not define the size of parameters (weight/bias) array while in the Feed-Forward network this is the case.

For the dense, the shape of W is n, u given n the number of nodes in the previous layer and u in the current layer. So when a dense layer follows the embedding layer, the number of nodes in the embedding layer is equal to the number of features, herein the number of time steps 10 000.

By contrast, the RNN layer has parameters shape which depends on the number of units and the uni/multivariate nature of each measurement, but not depending of the number of time steps. In the previous situation there are likely too many parameters and the computation does not converge.

Of note, this is because recurrent layers parameters are not defined with respect to sequence length, as they can handle data of variable length.

Embedding

For the embedding, we will one-hot encode time series. See One-hot encoding of string features for details about the process which follows the same logic and requirements regardless the data-type.

[8]:
embedding = Embedding(X_data=X_features,
                      Y_data=Y_label,
                      X_encode=True,
                      Y_encode=True,
                      batch_size=16,
                      relative_size=(2, 1, 0))

Let’s inspect the data shape.

[9]:
print(embedding.dtrain.X.shape)
(171, 10000, 16)

RNN(sequences=True)-Flatten-(Dense)n with Dropout

Time to clarify a point:

  • We have a multivariate like time series (one-hot encoded univariate series) with 10000 time steps.

  • The 10000 or length of sequence is unrelated to the number of units in the RNN layer. The number of units may be anything, the whole sequence will be processed in its entirety.

  • In recurrent layers, parameters shape is related to the number of units and the vocabulary size, not to the length of the sequence. That’s why such architectures can handle input sequences of variable length.

[10]:
name = 'RNN-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax'

se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    RNN(1, sequences=True),
    Flatten(),
    Dense(64, relu),
    Dropout(0.5),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

We initialize the model.

[11]:
model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')
--- EpyNN Check OK! ---                                                                             

We will only train for 3 epochs.

[12]:
model.train(epochs=3, init_logs=False)
Epoch 2 - Batch 9/9 - Accuracy: 1.0 Cost: 0.01575 - TIME: 20.68s RATE: 1.45e-01e/s TTC: 14s         

+-------+----------+----------+----------+----------+-------+--------+-------+-----------+-------+--------+-------+----------------------------------------------------------------------+
| epoch |  lrate   |  lrate   |  lrate   | accuracy |       | recall |       | precision |       |  MSE   |       |                              Experiment                              |
|       |   RNN    |  Dense   |  Dense   |  dtrain  | dval  | dtrain | dval  |  dtrain   | dval  | dtrain | dval  |                                                                      |
+-------+----------+----------+----------+----------+-------+--------+-------+-----------+-------+--------+-------+----------------------------------------------------------------------+
|   0   | 1.00e-02 | 1.00e-02 | 1.00e-02 |  0.906   | 0.753 | 0.831  | 0.744 |   0.937   | 0.725 | 0.079  | 0.167 | 1635107675_RNN-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   1   | 1.00e-02 | 1.00e-02 | 1.00e-02 |  0.953   | 0.729 | 0.944  | 0.641 |   0.944   | 0.735 | 0.037  | 0.168 | 1635107675_RNN-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
|   2   | 1.00e-02 | 1.00e-02 | 1.00e-02 |  0.971   | 0.741 | 0.972  | 0.744 |   0.958   | 0.707 | 0.020  | 0.189 | 1635107675_RNN-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax |
+-------+----------+----------+----------+----------+-------+--------+-------+-----------+-------+--------+-------+----------------------------------------------------------------------+

While we still observe overfitting, it is reduced compared to the Feed-Forward network and the accuracy on the validation set is higher. This model seems, so far, more appropriate to the problem.

For code, maths and pictures behind the RNN layer, follow this link:

GRU(sequences=True)-Flatten-(Dense)n with Dropout

Let’s now try a more evolved recurrent architecture.

[ ]:
name = 'GRU-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax'

se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    GRU(1, sequences=True),
    Flatten(),
    Dense(64, relu),
    Dropout(0.5),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

model.train(epochs=3, init_logs=False)
Epoch 0 - Batch 2/9 - Accuracy: 0.438 Cost: 0.24173 - TIME: 9.35s RATE: 1.07e-01e/s TTC: 37s        

In the context of this example, the GRU-based network performs poorly compared to both Feed-Forward and RNN-based networks. While we may attempt to optimize using more samples, different batch size, decaying learning rate and many other things, this teaches us that the more complex architecture is not necessarily the more appropriate one. This always depends on the context and most importantly computational and human time required with respect to anticipated possibility of improvements.

For code, maths and pictures behind the GRU layer, follow this link:

Write, read & Predict

A trained model can be written on disk such as:

[ ]:
model.write()

# model.write(path=/your/custom/path)

A model can be read from disk such as:

[ ]:
model = read_model()

# model = read_model(path=/your/custom/path)

We can retrieve new features and predict on them.

[ ]:
X_features, _ = prepare_dataset(N_SAMPLES=10)

dset = model.predict(X_features, X_encode=True)

Results can be extracted such as:

[ ]:
for n, pred, probs in zip(dset.ids, dset.P, dset.A):
    print(n, pred, probs)

Note that we wrote to the disk the last model we computed, which was the poorest in terms on performance. Therefore, predictions achieved here are not expected to be appropriate. The RNN-based network should be saved instead and will provide more accurate results.