{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dummy dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Find this notebook at `EpyNN/epynnlive/dummy_boolean/prepare_dataset.ipynb`. \n", "* Regular python code at `EpyNN/epynnlive/dummy_boolean/prepare_dataset.py`.\n", "\n", "Run the notebook online with [Google Colab](https://colab.research.google.com/github/Synthaze/EpyNN/blob/main/epynnlive/dummy_boolean/prepare_dataset.ipynb).\n", "\n", "**Level: Beginner**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is part of the series on preparing data for Neural Network regression with EpyNN.\n", "\n", "In addition to the topic-specific content, it contains several explanations about basics or general concepts in programming that are important in the context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is a Boolean data-type?\n", "\n", "Boolean data type is a form of data with only two possible values, namely ``True`` and ``False`` in most programming languages. In Python, these values evaluate to ``1`` and ``0`` behind the scene. Calculations using Boolean data are very quick and performance gain also arises from easier data and output processing compared to other data types.\n", "\n", "Examples of real world topics well suited for the Boolean data type may include: molecular interactions, gene regulation, disease prediction and diagnosis, among many others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why preparing a dummy dataset with Boolean features?\n", "\n", "A dummy dataset means an ensemble of data having no interest in the real world. However, dummy datasets can be prepared in a way that results from Neural Network regression are made **predictable**. When having the *a priori* knowledge of the law we want to model, it is easier to evaluate if the Neural Network is working in an optimal way. When dealing with samples described by Boolean features, it is then a good and time-saving practice to test the code and settings on simple problems from dummy data that should work if no mistake was introduced in the procedure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare a set of Boolean sample features and related label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a Python script, the first thing is to import libraries. Drawing the analogy between common sense and programming: *\"a building or room **[a python file]** containing collections of **[objects]** books, periodicals, and sometimes films and recorded music for people **[for developers]** to read, borrow, or refer to\"*.\n", "\n", "We need to import libraries for the usable content it provides." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# EpyNN/epynnlive/dummy_boolean/prepare_dataset.ipynb\n", "# Standard library imports\n", "import random" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to [PEP 8 -- Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/#imports), imports should be grouped. Herein, we imported the library `random` which is a Python built-in library or *standard library*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seeding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `random` library provides functions to generate *pseudo-random* numbers. It is said *pseudo-random* because the numbers are generated by an algorithm, rather than physical processes expected to be *truly-random*. See [\"True\" vs. pseudo-random numbers](https://en.wikipedia.org/wiki/Random_number_generation#%22True%22_vs._pseudo-random_numbers) on Wikipedia for a brief introduction.\n", "\n", "Still, pseudo-random generators are convenient because they offer the possibility to reproduce sequences of numbers. This is important because when modifying code in a program which includes such generators, we want to evaluate the impact of the code modification, not the noise from the difference in the input sequence of pseudo-random numbers.\n", "\n", "To introduce reproducibility in pseudo-random numbers generated by the `random` library, we need to **seed** the generator." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n" ] } ], "source": [ "random.seed(1)\n", "\n", "print(random.randint(0, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each time you run this code, the same number will be generated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to define a function which will generate pseudo-random Boolean features." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def features_boolean(N_FEATURES=11):\n", " \"\"\"Generate dummy string features.\n", "\n", " :param N_FEATURES: Number of features, defaults to 11.\n", " :type N_FEATURES: int\n", "\n", " :return: random Boolean features of length N_FEATURES.\n", " :rtype: list[bool]\n", " \"\"\"\n", " # Random choice True or False for N_FEATURES iterations\n", " features = [random.choice([True, False]) for j in range(N_FEATURES)]\n", "\n", " return features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code is commented and quite self-explaining. Let's proceed with a call." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[True, False, True, False, False, False, False, True, True, False, True] 11\n" ] } ], "source": [ "features = features_boolean()\n", "print(features, len(features))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When no *argument* is provided, the ``features_boolean()`` function which takes ``N_FEATURES`` as a parameter will consider the *default argument* value.\n", "\n", "To change this behavior, the function ``features_boolean()`` could also be called as follows" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[False, False, True, False, False, True, True, False, True, True] 10\n", "[True, True, False, True, False, True, True, False, False, True, False, True] 12\n" ] } ], "source": [ "features = features_boolean(10)\n", "print(features, len(features))\n", "\n", "features = features_boolean(12)\n", "print(features, len(features))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This generates a features list of length 10 and 12. You may have observed that **Boolean values are different** from one list to the next.\n", "\n", "Let's have a bit of fun to depict the previous comments about pseudo-random numbers and generator seeding." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[True, True, False, True, False, False, False, False, True, True] 10\n", "[True, True, False, True, False, False, False, False, True, True] 10\n", "[True, True, False, True, False, False, False, False, True, True, False, True] 12\n", "[True, True, False, True, False] 5\n" ] } ], "source": [ "random.seed(1)\n", "features = features_boolean(N_FEATURES=10)\n", "print(features, len(features))\n", "\n", "random.seed(1)\n", "features = features_boolean(10)\n", "print(features, len(features))\n", "\n", "random.seed(1)\n", "features = features_boolean(12)\n", "print(features, len(features))\n", "\n", "random.seed(1)\n", "features = features_boolean(5)\n", "print(features, len(features))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The author of these lines is pleased to let you draw your own conclusion.\n", "\n", "Now that we have a function to generate dummy Boolean features to describe samples, we need another function to generate the associated sample label." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the context of Supervised Machine Learning (SML), there are three things: sample features, sample label, and the law which makes the link between the two. Among those three, two are known *a priori* and before the actual training procedure: sample features and label.\n", "\n", "What is unknown is the law, or mathematical function, which takes sample features as input and returns the correct sample label.\n", "\n", "As already mentioned, in this dummy example we have the *a priori* knowledge of the law we want to model.\n", "\n", "Herein, the law which associates samples features and label should be simple, because this is a sort of positive control: if the further regression using our Neural Network cannot model this simple law, it likely means we did a big mistake." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def label_features(features):\n", " \"\"\"Prepare label associated with features.\n", "\n", " The dummy law is:\n", "\n", " More True = positive.\n", " More False = negative.\n", "\n", " :param features: random Boolean features of length N_FEATURES.\n", " :type features: list[bool]\n", "\n", " :return: Single-digit label with respect to features.\n", " :rtype: int\n", " \"\"\"\n", " # Single-digit positive and negative labels\n", " p_label = 0\n", " n_label = 1\n", "\n", " # Test if features contains more True (0)\n", " if features.count(True) > features.count(False):\n", " label = p_label\n", "\n", " # Test if features contains more False (1)\n", " elif features.count(True) < features.count(False):\n", " label = n_label\n", "\n", " return label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If one ``features`` object contains more ``True`` than ``False`` then we associate a positive label ``p_label``, and vice versa. Note that the default argument of the ``features_boolean()`` is set to ``11`` because this is an odd number. So features must fall within one or the other condition with no predictable imbalance between positive and negative label associated with sample features.\n", "\n", "Let's check the function we made for a few iterations." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.1\n", " False [False, False, False, True, True, False, True, False, False, True, False]\n", "0 True [False, True, True, False, True, True, True, True, False, True, False]\n", "0 True [True, True, False, False, True, False, True, True, False, False, True]\n", "1 False [False, True, True, False, True, False, False, True, False, False, False]\n", "1 False [False, True, False, True, False, False, True, False, False, True, False]\n" ] } ], "source": [ "for i in range(5):\n", " features = features_boolean()\n", " label = label_features(features)\n", "\n", " print(label, (features.count(True) > features.count(False)), features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This seems to be fully consistent with what we expect. No red flag." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's go on and write the function which will iterate to generate a set of sample Boolean features and label." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def prepare_dataset(N_SAMPLES=100):\n", " \"\"\"Prepare a set of dummy Boolean sample features and label.\n", "\n", " :param N_SAMPLES: Number of samples to generate, defaults to 100.\n", " :type N_SAMPLES: int\n", "\n", " :return: Set of sample features.\n", " :rtype: tuple[list[bool]]\n", "\n", " :return: Set of single-digit sample label.\n", " :rtype: tuple[int]\n", " \"\"\"\n", " # Initialize X and Y datasets\n", " X_features = []\n", " Y_label = []\n", "\n", " # Iterate over N_SAMPLES\n", " for i in range(N_SAMPLES):\n", "\n", " # Compute random Boolean features\n", " features = features_boolean()\n", "\n", " # Retrieve label associated with features\n", " label = label_features(features)\n", "\n", " # Append sample features to X_features\n", " X_features.append(features)\n", "\n", " # Append sample label to Y_label\n", " Y_label.append(label)\n", "\n", " # Prepare X-Y pairwise dataset\n", " dataset = list(zip(X_features, Y_label))\n", "\n", " # Shuffle dataset\n", " random.shuffle(dataset)\n", "\n", " # Separate X-Y pairs\n", " X_features, Y_label = zip(*dataset)\n", "\n", " return X_features, Y_label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that *features* is assigned to X within ``X_features`` while *label* is assigned to Y within ``Y_features``. The dummy law is the function such as ``f(X) = Y``.\n", "\n", "Let's check the function." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 [False, False, True, True, True, False, True, False, False, False, False]\n", "1 [False, True, True, False, True, False, False, True, False, False, False]\n", "0 [False, True, False, True, True, True, False, True, True, True, False]\n", "0 [False, False, True, False, False, True, True, True, True, True, False]\n", "1 [False, False, True, True, False, False, False, False, True, False, True]\n", "0 [True, True, True, True, True, False, False, False, False, False, True]\n", "0 [True, True, True, True, False, True, False, True, False, True, True]\n", "0 [True, True, False, True, True, False, False, True, False, False, True]\n", "1 [False, True, False, True, True, False, False, False, True, False, True]\n", "1 [True, True, False, False, False, True, False, True, False, False, True]\n" ] } ], "source": [ "X_features, Y_label = prepare_dataset(N_SAMPLES=10)\n", "\n", "for sample in zip(X_features, Y_label):\n", " features, label = sample\n", " print(label, features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are done with the code required to generate a dummy dataset of sample Boolean features and associated label." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Live examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function ``prepare_dataset()`` presented herein is used in the following live examples:\n", "\n", "* Notebook at`EpyNN/epynnlive/dummy_boolean/train.ipynb` or following [this link](train.ipynb). \n", "* Regular python code at `EpyNN/epynnlive/dummy_boolean/train.py`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 }