# Dummy dataset

* Find this notebook at `EpyNN/epynnlive/dummy_string/prepare_dataset.ipynb`. 
* Regular python code at `EpyNN/epynnlive/dummy_string/prepare_dataset.py`.

Run the notebook online with [Google Colab](https://colab.research.google.com/github/Synthaze/EpyNN/blob/main/epynnlive/dummy_string/prepare_dataset.ipynb).

**Level: Beginner**

This notebook is part of the series on preparing data for Neural Network regression with EpyNN.

In addition to the topic-specific content, it contains several explanations about basics or general concepts in programming that are important in the context.

Note that elements developed in the following notebooks may not be reviewed herein:

* [Boolean dataset](../dummy_boolean/prepare_dataset.ipynb)

## What is a string data-type?

String data type is a form of data which corresponds to sequences of character data.

Before preparing a dummy dataset of sample string features, let's explore some properties of Python built-in data-types.

In [1]:
my_str = '1'
my_int = int(my_str)
my_float = float(my_str)

print(type(my_str), my_str)
print(type(my_int), my_int)
print(type(my_float), my_float)

 1
 1
 1.0


Note that data objects inherit from a class.

Note that ``my_str`` and ``my_int`` looks the same. Does it mean they really are the same, in a practical perspective?

In [2]:
print(my_str + my_str, '- This is called string concatenation')

print(my_int + my_int, '- This is the arithmetical addition')

try:
 print(my_str + my_int)
except TypeError as error:
 print(error, '- Concatenation does not exist for int')
 
try:
 print(my_int + my_str)
except TypeError as error:
 print(error, '- Addition does not exist for str')

11 - This is called string concatenation
2 - This is the arithmetical addition
can only concatenate str (not "int") to str - Concatenation does not exist for int
unsupported operand type(s) for +: 'int' and 'str' - Addition does not exist for str


The conclusion is: **we cannot do maths with the string data type**

Although this problem will not be fixed at the stage of preparing a dummy dataset with string sample features, we will still extend the notebook to show the principle of it.

## Why preparing a dummy dataset with string features?

In addition to the general interest of dummy dataset explained in [dummy dataset with boolean sample features](../dummy_boolean/prepare_dataset.ipynb#Why-preparing-a-dummy-dataset-with-Boolean-features), we will take the opportunity to review the principle which translates string data into something which can be mathematically processed in the context of a Neural Network. The principle is called *encoding*, and one of the methods is called *one-hot encoding*.

## Prepare a set of string sample features and related label

### Imports

In [3]:
# EpyNN/epynnlive/dummy_string/prepare_dataset.ipynb
# Standard library imports
import random

# Related third party imports
import numpy as np

### Seeding

In [4]:
random.seed(1)

For reproducibility.

### Generate features

We need to define a function which will generate pseudo-random string features.

In [5]:
def features_string(N_FEATURES=12):
 """Generate dummy string features.

 :param N_FEATURES: Number of features, defaults to 12.
 :type N_FEATURES: int

 :return: random string features of length N_FEATURES.
 :rtype: list[str]
 """
 # List of words
 WORDS = ['A', 'T', 'G', 'C']

 # Random choice of words for N_FEATURES iterations
 features = [random.choice(WORDS) for j in range(N_FEATURES)]

 return features

The code is commented and quite self-explaining. Let's proceed with a call.

In [6]:
features = features_string()
print(features, len(features))

['T', 'A', 'G', 'A', 'C', 'C', 'C', 'C', 'T', 'A', 'C', 'A'] 12


Did we get a string, strictly speaking?

In [7]:
print(type(features))




The answer is no: we got a list. But did we expect to get a string, after all?

We are talking about sample features which are **each** a string. But because sample features represent a **list of features**, then we got a list.

Does this list really contain strings? This is important, because if you ignore your data type, you will mess up your day of work.

In [8]:
for feature in features:
 print(feature, len(feature), type(feature))

T 1 
A 1 
G 1 
A 1 
C 1 
C 1 
C 1 
C 1 
T 1 
A 1 
C 1 
A 1 


Looks good! The list contains elements of type ``str``, rigourously of ``class 'str'``.

Let's play around to see which relationships between sequential data type ``str`` and ``list``.

In [9]:
print('Original list of string features of length N_FEATURES')
print(type(features), features, end='\n\n')

print('Join the original list and get a single string feature of length N_FEATURES')
print(type(''.join(features)), ''.join(features), end='\n\n')

Original list of string features of length N_FEATURES
 ['T', 'A', 'G', 'A', 'C', 'C', 'C', 'C', 'T', 'A', 'C', 'A']

Join the original list and get a single string feature of length N_FEATURES
 TAGACCCCTACA



Now let's see the classical mistake:

In [10]:
# Because each string feature has length 1, the mistake is not seen
str_feature = ''.join(features)

print(list(str_feature))
print((list(str_feature) == features))

['T', 'A', 'G', 'A', 'C', 'C', 'C', 'C', 'T', 'A', 'C', 'A']
True


That's cool, it seems we can reverse the ``join()`` with the ``list()``.

Really?

In [11]:
# At least one feature has length greater than one
test_features = ['My', ' mistake'] # Features as string with 2 elements

str_test_features = ''.join(test_features)

print(list(str_test_features))
print((list(str_test_features) == features))

['M', 'y', ' ', 'm', 'i', 's', 't', 'a', 'k', 'e']
False


**When applying list() with a string as argument, it separates all characters in the string**

Now that we have seen some traps about the string data type and we have built our function to generate dummy sample features of string data type, we will go ahead with label assignment.

But best practices first, let's clean the namespace out of the test content.

In [12]:
del test_features
del str_test_features

If you want to increase the level of complexity in your life, don't follow best practices. Otherwise, do follow them.

### Generate label

As previously mentioned, in this dummy example we have the *a priori* knowledge of the law we want to model.

Herein, the law which associates samples features and label should be simple, because this is a sort of positive control: if the further regression using our Neural Network cannot model this simple law, it possibly means we did a big mistake.

In [13]:
def label_features(features):
 """Prepare label associated with features.

 The dummy law is:

 First and last elements are equal = positive.
 First and last elements are NOT equal = negative.

 :param features: random string features of length N_FEATURES.
 :type features: list[str]

 :return: Single-digit label with respect to features.
 :rtype: int
 """
 # Single-digit positive and negative labels
 p_label = 0
 n_label = 1

 # Pattern associated with positive label (0)
 if features[0] == features[-1]:
 label = p_label

 # Other pattern associated with negative label (1)
 elif features[0] != features[-1]:
 label = n_label

 return label

The code above is commented and self explaining.

Let's check the function we made for a few iterations.

In [14]:
for i in range(5):
 features = features_string()
 label = label_features(features)

 print(label, features)

1 ['C', 'C', 'A', 'C', 'G', 'T', 'A', 'G', 'A', 'A', 'A', 'A']
0 ['C', 'T', 'C', 'A', 'T', 'C', 'C', 'T', 'G', 'T', 'T', 'C']
0 ['G', 'A', 'C', 'A', 'T', 'G', 'A', 'G', 'C', 'T', 'G', 'G']
0 ['C', 'C', 'A', 'C', 'T', 'C', 'C', 'T', 'G', 'G', 'A', 'C']
1 ['A', 'T', 'C', 'G', 'C', 'A', 'C', 'A', 'G', 'C', 'T', 'T']


This seems to be fully consistent with what we expect. No red flag.

Not that to not make things too complicated, we came to accept that the dummy dataset will be imbalanced in term of label representation. This question is important, and we will have the opportunity to discuss it when feeding the Network with the data.

This is what I’m talking about.

In [15]:
labels = [label_features(features_string()) for i in range(100)]

print(0, labels.count(0))
print(1, labels.count(1))

0 22
1 78


By the way, we have just performed a *list comprehension* operation which compute ``label_features()`` from features returned by the argument ``features_string()``, for 100 iterations.

This is a *Pythonic* way to do things.

### Prepare dataset

Let's go on and write the function which will iterate to generate a set of sample string features and label.

In [16]:
def prepare_dataset(N_SAMPLES=100):
 """Prepare a set of dummy string sample features and label.

 :param N_SAMPLES: Number of samples to generate, defaults to 100.
 :type N_SAMPLES: int

 :return: Set of sample features.
 :rtype: tuple[list[str]]

 :return: Set of single-digit sample label.
 :rtype: tuple[int]
 """
 # Initialize X and Y datasets
 X_features = []
 Y_label = []

 # Iterate over N_SAMPLES
 for i in range(N_SAMPLES):

 # Compute random string features
 features = features_string()

 # Retrieve label associated with features
 label = label_features(features)

 # Append sample features to X_features
 X_features.append(features)

 # Append sample label to Y_label
 Y_label.append(label)

 # Prepare X-Y pairwise dataset
 dataset = list(zip(X_features, Y_label))

 # Shuffle dataset
 random.shuffle(dataset)

 # Separate X-Y pairs
 X_features, Y_label = zip(*dataset)

 return X_features, Y_label

Note that this function is identical to the one discussed in [dummy dataset with boolean sample features](../dummy_boolean/prepare_dataset.ipynb#Why-preparing-a-dummy-dataset-with-Boolean-features) because the variable content with respect to data type is generated by the functions ``features_string()`` and ``label_features()``.

Let's check the function.

In [17]:
X_features, Y_label = prepare_dataset(N_SAMPLES=10)

for sample in zip(X_features, Y_label):
 features, label = sample
 print(label, features)

1 ['G', 'A', 'G', 'C', 'G', 'A', 'A', 'G', 'T', 'T', 'A', 'A']
1 ['C', 'C', 'T', 'T', 'A', 'T', 'C', 'T', 'G', 'T', 'C', 'G']
0 ['C', 'T', 'T', 'C', 'C', 'T', 'A', 'A', 'T', 'A', 'A', 'C']
1 ['A', 'A', 'A', 'C', 'C', 'T', 'T', 'T', 'C', 'G', 'T', 'T']
1 ['G', 'A', 'G', 'A', 'C', 'A', 'T', 'T', 'G', 'C', 'A', 'A']
1 ['C', 'A', 'T', 'T', 'A', 'C', 'A', 'C', 'T', 'C', 'G', 'T']
0 ['C', 'A', 'C', 'G', 'C', 'G', 'G', 'C', 'A', 'C', 'A', 'C']
1 ['T', 'T', 'G', 'T', 'G', 'C', 'G', 'A', 'G', 'T', 'C', 'A']
1 ['C', 'C', 'T', 'G', 'A', 'T', 'T', 'C', 'G', 'C', 'C', 'G']
1 ['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']


We are now done with the code required to generate a dummy dataset of sample string features and associated labels.

## One-hot encoding of string features

As previously stated in the introductory comments of this notebook, there are things you can do and others you cannot, with respect to the type of variable you are handling.

In [18]:
print(1 + 1) # Arithmetics
print('1' + '1') # String concatenation
print()
print(str(1) + str(1)) # String concatenation
print(int('1') + int('1')) # Arithmetics

2
11

11
2


There are things you can do with either integers or strings, and ways you can convert one data type into the other.

But obviously, you cannot convert a non-digit string to integer data type.

In [19]:
print('A')

try:
 print(int('A'))
except ValueError as error:
 print(error)

A
invalid literal for int() with base 10: 'A'


Given our features, which are string features in a list, how can we proceed with arithmetic on them?

In EpyNN this is made automatically at a later stage. See [Embedding (Input)](https://epynn.net/Embedding.html) for full documentation.

For now, we will do it by hand. We have:

In [20]:
print(features)
print(list(set(features)))

['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']
['T', 'C', 'G', 'A']


A list of string features, composed of four possible characters - or single-character words - for a length of N_FEATURES.

Since our sample of features contains all four characters, it is fine to proceed with encoding.

In [21]:
words = list(set(features))

words_to_idx = {word: idx for idx, word in enumerate(words)}
# The just above line is equivalent to
# words_to_idx = {words[idx]:idx for idx in range(len(words))}

idx_to_words = {idx: words for words, idx in words_to_idx.items()}

Check it.

In [22]:
print(words)
print(words_to_idx)
print(idx_to_words)

['T', 'C', 'G', 'A']
{'T': 0, 'C': 1, 'G': 2, 'A': 3}
{0: 'T', 1: 'C', 2: 'G', 3: 'A'}


You may have gotten where we are heading now.

By the way, we instantiated dictionary objects here, ``{}`` and such data type works like ``{key: value}``.

Let's proceed.

In [23]:
for word in words:
 print(word, words_to_idx[word])

T 0
C 1
G 2
A 3


Could this be the conversion table? Not really. Because numbers are quantities. Every character should be represented by the same quantity: one.

In [24]:
for word in words:
 print(word, words_to_idx[word], np.zeros(len(words)))

T 0 [0. 0. 0. 0.]
C 1 [0. 0. 0. 0.]
G 2 [0. 0. 0. 0.]
A 3 [0. 0. 0. 0.]


Almost there.

In [25]:
for word in words:
 encoded_word = np.zeros(len(words))
 encoded_word[words_to_idx[word]] = 1
 
 print(word, words_to_idx[word], encoded_word)

T 0 [1. 0. 0. 0.]
C 1 [0. 1. 0. 0.]
G 2 [0. 0. 1. 0.]
A 3 [0. 0. 0. 1.]


This is **One-hot encoding**.

With string features.

In [26]:
for feature in features:
 encoded_feature = np.zeros(len(words))
 encoded_feature[words_to_idx[feature]] = 1
 
 print(feature, words_to_idx[feature], encoded_feature)

C 1 [0. 1. 0. 0.]
T 0 [1. 0. 0. 0.]
T 0 [1. 0. 0. 0.]
C 1 [0. 1. 0. 0.]
A 3 [0. 0. 0. 1.]
T 0 [1. 0. 0. 0.]
C 1 [0. 1. 0. 0.]
T 0 [1. 0. 0. 0.]
G 2 [0. 0. 1. 0.]
A 3 [0. 0. 0. 1.]
A 3 [0. 0. 0. 1.]
T 0 [1. 0. 0. 0.]


Let's make a proper NumPy array from this.

In [27]:
# Just to show the principle of nested list comprehension. Not necessarily a best idea, but still :)
encoded_features = np.array([[1 if idx == words_to_idx[feature] else 0 for idx in range(len(words))] for feature in features])

print(encoded_features)
print(encoded_features.shape)

[[0 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [0 1 0 0]
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]
 [0 0 0 1]
 [0 0 0 1]
 [1 0 0 0]]
(12, 4)


Decoding one-hot encoded data.

In [28]:
print(features)
print([idx_to_words[np.argmax(encoded_feature)] for encoded_feature in encoded_features])

['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']
['C', 'T', 'T', 'C', 'A', 'T', 'C', 'T', 'G', 'A', 'A', 'T']


And we will finish on the purpose of this ``np.argmax()`` function.

In [29]:
test = np.array([10, 0, 0])
print(np.argmax(test))

test = np.array([0, 0, 10])
print(np.argmax(test), end='\n\n')

test = np.array(
 [
 [0, 0, 10],
 [10, 0, 0]
 ]
)

print(test.shape[0], 'with respect to', test.shape[1])
print(np.argmax(test, axis=0), end='\n\n')

print(test.shape[1], 'with respect to', test.shape[0])
print(np.argmax(test, axis=1))

0
2

2 with respect to 3
[1 0 0]

3 with respect to 2
[2 0]


In words, ``np.argmax()`` returns the index at which is located the maximal value. It returns the argument which return the maximum when provided as index of the array.

Built-in Python on **list**.

In [30]:
lst = [0, 0, 10]

print(lst.index(max(lst)))

2


We are now done with the code required to generate a dummy dataset of sample string features and associated labels.

And we extensively reviewed the practice of one-hot encoding, as well as some insights into Python data types and NumPy array manipulation!

## Live examples

The function ``prepare_dataset()`` presented herein is used in the following live examples:

* Notebook at`EpyNN/epynnlive/dummy_string/train.ipynb` or following [this link](train.ipynb). 
* Regular python code at `EpyNN/epynnlive/dummy_string/train.py`.