Dummy dataset

  • Find this notebook at EpyNN/epynnlive/dummy_string/prepare_dataset.ipynb.

  • Regular python code at EpyNN/epynnlive/dummy_string/prepare_dataset.py.

Run the notebook online with Google Colab.

Level: Beginner

This notebook is part of the series on preparing data for Neural Network regression with EpyNN.

In addition to the topic-specific content, it contains several explanations about basics or general concepts in programming that are important in the context.

Note that elements developed in the following notebooks may not be reviewed herein:

What is a string data-type?

String data type is a form of data which corresponds to sequences of character data.

Before preparing a dummy dataset of sample string features, let’s explore some properties of Python built-in data-types.

[1]:
my_str = '1'
my_int = int(my_str)
my_float = float(my_str)

print(type(my_str), my_str)
print(type(my_int), my_int)
print(type(my_float), my_float)
<class 'str'> 1
<class 'int'> 1
<class 'float'> 1.0

Note that data objects inherit from a class.

Note that my_str and my_int looks the same. Does it mean they really are the same, in a practical perspective?

[2]:
print(my_str + my_str, '- This is called string concatenation')

print(my_int + my_int, '- This is the arithmetical addition')

try:
    print(my_str + my_int)
except TypeError as error:
    print(error, '- Concatenation does not exist for int')

try:
    print(my_int + my_str)
except TypeError as error:
    print(error, '- Addition does not exist for str')
11 - This is called string concatenation
2 - This is the arithmetical addition
can only concatenate str (not "int") to str - Concatenation does not exist for int
unsupported operand type(s) for +: 'int' and 'str' - Addition does not exist for str

The conclusion is: we cannot do maths with the string data type

Although this problem will not be fixed at the stage of preparing a dummy dataset with string sample features, we will still extend the notebook to show the principle of it.

Why preparing a dummy dataset with string features?

In addition to the general interest of dummy dataset explained in dummy dataset with boolean sample features, we will take the opportunity to review the principle which translates string data into something which can be mathematically processed in the context of a Neural Network. The principle is called encoding, and one of the methods is called one-hot encoding.

One-hot encoding of string features

As previously stated in the introductory comments of this notebook, there are things you can do and others you cannot, with respect to the type of variable you are handling.

[18]:
print(1 + 1)        # Arithmetics
print('1' + '1')    # String concatenation
print()
print(str(1) + str(1))       # String concatenation
print(int('1') + int('1'))   # Arithmetics
2
11

11
2

There are things you can do with either integers or strings, and ways you can convert one data type into the other.

But obviously, you cannot convert a non-digit string to integer data type.

[19]:
print('A')

try:
    print(int('A'))
except ValueError as error:
    print(error)
A
invalid literal for int() with base 10: 'A'

Given our features, which are string features in a list, how can we proceed with arithmetic on them?

In EpyNN this is made automatically at a later stage. See Embedding (Input) for full documentation.

For now, we will do it by hand. We have:

[20]:
print(features)
print(list(set(features)))
['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']
['T', 'C', 'G', 'A']

A list of string features, composed of four possible characters - or single-character words - for a length of N_FEATURES.

Since our sample of features contains all four characters, it is fine to proceed with encoding.

[21]:
words = list(set(features))

words_to_idx = {word: idx for idx, word in enumerate(words)}
# The just above line is equivalent to
# words_to_idx = {words[idx]:idx for idx in range(len(words))}

idx_to_words = {idx: words for words, idx in words_to_idx.items()}

Check it.

[22]:
print(words)
print(words_to_idx)
print(idx_to_words)
['T', 'C', 'G', 'A']
{'T': 0, 'C': 1, 'G': 2, 'A': 3}
{0: 'T', 1: 'C', 2: 'G', 3: 'A'}

You may have gotten where we are heading now.

By the way, we instantiated dictionary objects here, {} and such data type works like {key: value}.

Let’s proceed.

[23]:
for word in words:
    print(word, words_to_idx[word])
T 0
C 1
G 2
A 3

Could this be the conversion table? Not really. Because numbers are quantities. Every character should be represented by the same quantity: one.

[24]:
for word in words:
    print(word, words_to_idx[word], np.zeros(len(words)))
T 0 [0. 0. 0. 0.]
C 1 [0. 0. 0. 0.]
G 2 [0. 0. 0. 0.]
A 3 [0. 0. 0. 0.]

Almost there.

[25]:
for word in words:
    encoded_word = np.zeros(len(words))
    encoded_word[words_to_idx[word]] = 1

    print(word, words_to_idx[word], encoded_word)
T 0 [1. 0. 0. 0.]
C 1 [0. 1. 0. 0.]
G 2 [0. 0. 1. 0.]
A 3 [0. 0. 0. 1.]

This is One-hot encoding.

With string features.

[26]:
for feature in features:
    encoded_feature = np.zeros(len(words))
    encoded_feature[words_to_idx[feature]] = 1

    print(feature, words_to_idx[feature], encoded_feature)
C 1 [0. 1. 0. 0.]
T 0 [1. 0. 0. 0.]
T 0 [1. 0. 0. 0.]
C 1 [0. 1. 0. 0.]
A 3 [0. 0. 0. 1.]
T 0 [1. 0. 0. 0.]
C 1 [0. 1. 0. 0.]
T 0 [1. 0. 0. 0.]
G 2 [0. 0. 1. 0.]
A 3 [0. 0. 0. 1.]
A 3 [0. 0. 0. 1.]
T 0 [1. 0. 0. 0.]

Let’s make a proper NumPy array from this.

[27]:
# Just to show the principle of nested list comprehension. Not necessarily a best idea, but still :)
encoded_features = np.array([[1 if idx == words_to_idx[feature] else 0 for idx in range(len(words))] for feature in features])

print(encoded_features)
print(encoded_features.shape)
[[0 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [0 1 0 0]
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [0 0 0 1]
 [1 0 0 0]]
(12, 4)

Decoding one-hot encoded data.

[28]:
print(features)
print([idx_to_words[np.argmax(encoded_feature)] for encoded_feature in encoded_features])
['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']
['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']

And we will finish on the purpose of this np.argmax() function.

[29]:
test = np.array([10, 0, 0])
print(np.argmax(test))

test = np.array([0, 0, 10])
print(np.argmax(test), end='\n\n')

test = np.array(
    [
        [0, 0, 10],
        [10, 0, 0]
    ]
)

print(test.shape[0], 'with respect to', test.shape[1])
print(np.argmax(test, axis=0), end='\n\n')

print(test.shape[1], 'with respect to', test.shape[0])
print(np.argmax(test, axis=1))
0
2

2 with respect to 3
[1 0 0]

3 with respect to 2
[2 0]

In words, np.argmax() returns the index at which is located the maximal value. It returns the argument which return the maximum when provided as index of the array.

Built-in Python on list.

[30]:
lst = [0, 0, 10]

print(lst.index(max(lst)))
2

We are now done with the code required to generate a dummy dataset of sample string features and associated labels.

And we extensively reviewed the practice of one-hot encoding, as well as some insights into Python data types and NumPy array manipulation!

Live examples

The function prepare_dataset() presented herein is used in the following live examples:

  • Notebook atEpyNN/epynnlive/dummy_string/train.ipynb or following this link.

  • Regular python code at EpyNN/epynnlive/dummy_string/train.py.