»
Data preparation - Examples »
Dummy dataset
View page source

Dummy dataset

Find this notebook at EpyNN/epynnlive/dummy_boolean/prepare_dataset.ipynb.
Regular python code at EpyNN/epynnlive/dummy_boolean/prepare_dataset.py.

Run the notebook online with Google Colab.

Level: Beginner

This notebook is part of the series on preparing data for Neural Network regression with EpyNN.

In addition to the topic-specific content, it contains several explanations about basics or general concepts in programming that are important in the context.

What is a Boolean data-type?

Boolean data type is a form of data with only two possible values, namely True and False in most programming languages. In Python, these values evaluate to 1 and 0 behind the scene. Calculations using Boolean data are very quick and performance gain also arises from easier data and output processing compared to other data types.

Examples of real world topics well suited for the Boolean data type may include: molecular interactions, gene regulation, disease prediction and diagnosis, among many others.

Why preparing a dummy dataset with Boolean features?

A dummy dataset means an ensemble of data having no interest in the real world. However, dummy datasets can be prepared in a way that results from Neural Network regression are made predictable. When having the a priori knowledge of the law we want to model, it is easier to evaluate if the Neural Network is working in an optimal way. When dealing with samples described by Boolean features, it is then a good and time-saving practice to test the code and settings on simple problems from dummy data that should work if no mistake was introduced in the procedure.

Prepare a set of Boolean sample features and related label

In a Python script, the first thing is to import libraries. Drawing the analogy between common sense and programming: “a building or room **[a python file]* containing collections of [objects] books, periodicals, and sometimes films and recorded music for people [for developers] to read, borrow, or refer to”*.

We need to import libraries for the usable content it provides.

Imports

[1]:

# EpyNN/epynnlive/dummy_boolean/prepare_dataset.ipynb
# Standard library imports
import random

According to PEP 8 – Style Guide for Python Code, imports should be grouped. Herein, we imported the library random which is a Python built-in library or standard library.

Seeding

The random library provides functions to generate pseudo-random numbers. It is said pseudo-random because the numbers are generated by an algorithm, rather than physical processes expected to be truly-random. See “True” vs. pseudo-random numbers on Wikipedia for a brief introduction.

Still, pseudo-random generators are convenient because they offer the possibility to reproduce sequences of numbers. This is important because when modifying code in a program which includes such generators, we want to evaluate the impact of the code modification, not the noise from the difference in the input sequence of pseudo-random numbers.

To introduce reproducibility in pseudo-random numbers generated by the random library, we need to seed the generator.

[2]:

random.seed(1)

print(random.randint(0, 10))

Each time you run this code, the same number will be generated.

Generate features

We need to define a function which will generate pseudo-random Boolean features.

[3]:

def features_boolean(N_FEATURES=11):
    """Generate dummy string features.

    :param N_FEATURES: Number of features, defaults to 11.
    :type N_FEATURES: int

    :return: random Boolean features of length N_FEATURES.
    :rtype: list[bool]
    """
    # Random choice True or False for N_FEATURES iterations
    features = [random.choice([True, False]) for j in range(N_FEATURES)]

    return features

The code is commented and quite self-explaining. Let’s proceed with a call.

[4]:

features = features_boolean()
print(features, len(features))

[True, False, True, False, False, False, False, True, True, False, True] 11

When no argument is provided, the features_boolean() function which takes N_FEATURES as a parameter will consider the default argument value.

To change this behavior, the function features_boolean() could also be called as follows

[5]:

features = features_boolean(10)
print(features, len(features))

features = features_boolean(12)
print(features, len(features))

[False, False, True, False, False, True, True, False, True, True] 10
[True, True, False, True, False, True, True, False, False, True, False, True] 12

This generates a features list of length 10 and 12. You may have observed that Boolean values are different from one list to the next.

Let’s have a bit of fun to depict the previous comments about pseudo-random numbers and generator seeding.

[6]:

random.seed(1)
features = features_boolean(N_FEATURES=10)
print(features, len(features))

random.seed(1)
features = features_boolean(10)
print(features, len(features))

random.seed(1)
features = features_boolean(12)
print(features, len(features))

random.seed(1)
features = features_boolean(5)
print(features, len(features))

[True, True, False, True, False, False, False, False, True, True] 10
[True, True, False, True, False, False, False, False, True, True] 10
[True, True, False, True, False, False, False, False, True, True, False, True] 12
[True, True, False, True, False] 5

The author of these lines is pleased to let you draw your own conclusion.

Now that we have a function to generate dummy Boolean features to describe samples, we need another function to generate the associated sample label.

Generate label

In the context of Supervised Machine Learning (SML), there are three things: sample features, sample label, and the law which makes the link between the two. Among those three, two are known a priori and before the actual training procedure: sample features and label.

What is unknown is the law, or mathematical function, which takes sample features as input and returns the correct sample label.

As already mentioned, in this dummy example we have the a priori knowledge of the law we want to model.

Herein, the law which associates samples features and label should be simple, because this is a sort of positive control: if the further regression using our Neural Network cannot model this simple law, it likely means we did a big mistake.

[7]:

def label_features(features):
    """Prepare label associated with features.

    The dummy law is:

    More True = positive.
    More False = negative.

    :param features: random Boolean features of length N_FEATURES.
    :type features: list[bool]

    :return: Single-digit label with respect to features.
    :rtype: int
    """
    # Single-digit positive and negative labels
    p_label = 0
    n_label = 1

    # Test if features contains more True (0)
    if features.count(True) > features.count(False):
        label = p_label

    # Test if features contains more False (1)
    elif features.count(True) < features.count(False):
        label = n_label

    return label

If one features object contains more True than False then we associate a positive label p_label, and vice versa. Note that the default argument of the features_boolean() is set to 11 because this is an odd number. So features must fall within one or the other condition with no predictable imbalance between positive and negative label associated with sample features.

Let’s check the function we made for a few iterations.

[8]:

for i in range(5):
    features = features_boolean()
    label = label_features(features)

    print(label, (features.count(True) > features.count(False)), features)

The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.1
 False [False, False, False, True, True, False, True, False, False, True, False]
0 True [False, True, True, False, True, True, True, True, False, True, False]
0 True [True, True, False, False, True, False, True, True, False, False, True]
1 False [False, True, True, False, True, False, False, True, False, False, False]
1 False [False, True, False, True, False, False, True, False, False, True, False]

This seems to be fully consistent with what we expect. No red flag.

Prepare dataset

Let’s go on and write the function which will iterate to generate a set of sample Boolean features and label.

[9]:

def prepare_dataset(N_SAMPLES=100):
    """Prepare a set of dummy Boolean sample features and label.

    :param N_SAMPLES: Number of samples to generate, defaults to 100.
    :type N_SAMPLES: int

    :return: Set of sample features.
    :rtype: tuple[list[bool]]

    :return: Set of single-digit sample label.
    :rtype: tuple[int]
    """
    # Initialize X and Y datasets
    X_features = []
    Y_label = []

    # Iterate over N_SAMPLES
    for i in range(N_SAMPLES):

        # Compute random Boolean features
        features = features_boolean()

        # Retrieve label associated with features
        label = label_features(features)

        # Append sample features to X_features
        X_features.append(features)

        # Append sample label to Y_label
        Y_label.append(label)

    # Prepare X-Y pairwise dataset
    dataset = list(zip(X_features, Y_label))

    # Shuffle dataset
    random.shuffle(dataset)

    # Separate X-Y pairs
    X_features, Y_label = zip(*dataset)

    return X_features, Y_label

Note that features is assigned to X within X_features while label is assigned to Y within Y_features. The dummy law is the function such as f(X) = Y.

Let’s check the function.

[10]:

X_features, Y_label = prepare_dataset(N_SAMPLES=10)

for sample in zip(X_features, Y_label):
    features, label = sample
    print(label, features)

1 [False, False, True, True, True, False, True, False, False, False, False]
1 [False, True, True, False, True, False, False, True, False, False, False]
0 [False, True, False, True, True, True, False, True, True, True, False]
0 [False, False, True, False, False, True, True, True, True, True, False]
1 [False, False, True, True, False, False, False, False, True, False, True]
0 [True, True, True, True, True, False, False, False, False, False, True]
0 [True, True, True, True, False, True, False, True, False, True, True]
0 [True, True, False, True, True, False, False, True, False, False, True]
1 [False, True, False, True, True, False, False, False, True, False, True]
1 [True, True, False, False, False, True, False, True, False, False, True]

We are done with the code required to generate a dummy dataset of sample Boolean features and associated label.

Live examples

The function prepare_dataset() presented herein is used in the following live examples:

Notebook atEpyNN/epynnlive/dummy_boolean/train.ipynb or following this link.
Regular python code at EpyNN/epynnlive/dummy_boolean/train.py.

Previous Next

Built with Sphinx using a theme provided by Read the Docs.