{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Protein Modification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Find this notebook at `EpyNN/epynnlive/ptm_protein/prepare_dataset.ipynb`. \n",
    "* Regular python code at `EpyNN/epynnlive/ptm_protein/prepare_dataset.py`.\n",
    "\n",
    "Run the notebook online with [Google Colab](https://colab.research.google.com/github/Synthaze/EpyNN/blob/main/epynnlive/ptm_protein/prepare_dataset.ipynb).\n",
    "\n",
    "**Level: Intermediate**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is part of the series on preparing data for Neural Network regression with EpyNN. \n",
    "\n",
    "It deals with a real world problem and therefore will focus on the problem itself, rather than basics that were reviewed along with the preparation of the following dummy dataset: \n",
    "\n",
    "* [Boolean dataset](../dummy_boolean/prepare_dataset.ipynb)\n",
    "* [String dataset](../dummy_string/prepare_dataset.ipynb)\n",
    "* [Time-series (numerical)](../dummy_time/prepare_dataset.ipynb)\n",
    "* [Image (numerical)](../dummy_image/prepare_dataset.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Post Translational Modification (PTM) of Proteins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Post-Translational Modification (PTM) of proteins is an ensemble of mechanisms by which the primary sequence of a protein can be chemically modified after - and in some circumstances during - biosynthesis by the ribosomes.\n",
    "\n",
    "When talking about *one* PTM, it generally refers to a given chemical group that may be covalently linked with given amino acid residues in proteins.\n",
    "\n",
    "For instance, the formation of a phosphoester between a phosphate group and side-chain hydroxyl of serine, threonine and tyrosine is known as phosphorylation. While proteins overall may contain a given number of such residues, phosphorylation may occur particularly on a given subset, generally with respect to specific cellular conditions.\n",
    "\n",
    "From a given number of chemically unmodified proteins (proteome), below is a list of some characteristics with respect to PTM:\n",
    "\n",
    "* PTM increase chemical diversity: for a given *proteome*, there is a corresponding *phosphoproteome* or *oglcnacome* if talking about *O*-GlcNAcylation. Said explicitely, a chemically uniform protein may give rise to an ensemble of chemically distinct proteins upon modification.\n",
    "* PTM may enrich gene's function: as for other mechanisms, the fact that a given gene product - the chemically unmodified protein - may be modified to yield distinct chemical entities is equivalent to multiplying the number of end-products from a single gene. As such, the number of functions for this gene is expected to increase, because distinct functions are achieved by distinct molecules, and this is actually what PTM do: create chemically distinct proteins from the same gene product.\n",
    "* Chemical groups defining one PTM are numerous: among the most studied, one may cite phosphorylation, ubiquitinylation, *O*-GlcNActylation, methylation, succinylation, among dozens of others.\n",
    "\n",
    "PTMs are major regulators of cell signaling and play a role in virtually every biological process.\n",
    "\n",
    "As such, this is a big challenge to predict whether or not one protein may be modified with respect to one PTM.\n",
    "\n",
    "Let's draw something to illustrate one aspect of the deal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![2hr9](tctp.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a protein called [TCTP](https://www.rcsb.org/structure/2hr9) and above is shown a type of 3D model commonly used to represent proteins. The red sticks represent serine residues along the protein primary sequence. Those with label SER-46 and SER-64 where shown to undergo phosphorylation in cells.\n",
    "\n",
    "But in theory, phosphorylation could occur on all serines within this structure. The reality is that such modifications only occur on *some* serines.\n",
    "\n",
    "This is what we are going to challenge here, with a PTM called *O*-GlcNAcylation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare a set of peptides"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s prepare a set of O-GlcNAcylated and presumably not *O*-GlcNAcylated peptide sequences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# EpyNN/epynnlive/ptm_protein/prepare_dataset.ipynb\n",
    "# Install dependencies\n",
    "!pip3 install --upgrade-strategy only-if-needed epynn\n",
    "\n",
    "# Standard library imports\n",
    "import tarfile\n",
    "import random\n",
    "import os\n",
    "\n",
    "# Related third party imports\n",
    "import wget\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Local application/library specific imports\n",
    "from epynn.commons.library import read_file\n",
    "from epynn.commons.logs import process_logs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the `tarfile` which is a Python built-in *standard* library and the first choice to deal with `.tar` archives and related."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Seeding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "random.seed(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For reproducibility."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Simple function to download data from the cloud as `.tar` archive. Once uncompressed, it yields a `data/` directory containing `.dat` text files for positive and negative sequences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_sequences():\n",
    "    \"\"\"Download a set of peptide sequences.\n",
    "    \"\"\"\n",
    "    data_path = os.path.join('.', 'data')\n",
    "\n",
    "    if not os.path.exists(data_path):\n",
    "\n",
    "        # Download @url with wget\n",
    "        url = 'https://synthase.s3.us-west-2.amazonaws.com/ptm_prediction_data.tar'\n",
    "        fname = wget.download(url)\n",
    "\n",
    "        # Extract archive\n",
    "        tar = tarfile.open(fname).extractall('.')\n",
    "        process_logs('Make: ' + fname, level=1)\n",
    "\n",
    "        # Clean-up\n",
    "        os.remove(fname)\n",
    "\n",
    "    return None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Retrieve the data as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "download_sequences()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('data', [], ['21_positive.dat', '21_negative.dat'])\n"
     ]
    }
   ],
   "source": [
    "for path in os.walk('data'):\n",
    "    print(path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s have a quick look to what one file's content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['SQDVSNAFSPSISKAQPGAPP', 'GPRIPDHQRTSVPENHAQSRI', 'QFSCKCLTGFTGQKCETDVNE', 'KLIKRLYVDKSLNLSTEFISS', 'QQKEGEQNQQTQQQQILIQPQ']\n"
     ]
    }
   ],
   "source": [
    "with open(os.path.join('data', '21_positive.dat'), 'r') as f:\n",
    "    print(f.read().splitlines()[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are 21 amino-acids long peptide sequences.\n",
    "\n",
    "Note that positive sequences are *Homo sapiens* *O*-GlcNAcylated peptides sourced from [The *O*-GlcNAc Database](https://www.oglcnac.mcw.edu).\n",
    "\n",
    "Negative sequences are *Homo sapiens* peptide sequence not reported in the above-mentioned source. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a function we use to prepare the labeled dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prepare_dataset(N_SAMPLES=100):\n",
    "    \"\"\"Prepare a set of labeled peptides.\n",
    "\n",
    "    :param N_SAMPLES: Number of peptide samples to retrieve, defaults to 100.\n",
    "    :type N_SAMPLES: int\n",
    "\n",
    "    :return: Set of peptides.\n",
    "    :rtype: tuple[list[str]]\n",
    "\n",
    "    :return: Set of single-digit peptides label.\n",
    "    :rtype: tuple[int]\n",
    "    \"\"\"\n",
    "    # Single-digit positive and negative labels\n",
    "    p_label = 0\n",
    "    n_label = 1\n",
    "\n",
    "    # Positive data are Homo sapiens O-GlcNAcylated peptide sequences from oglcnac.mcw.edu\n",
    "    path_positive = os.path.join('data', '21_positive.dat')\n",
    "\n",
    "    # Negative data are peptide sequences presumably not O-GlcNAcylated\n",
    "    path_negative = os.path.join('data', '21_negative.dat')\n",
    "\n",
    "    # Read text files, each containing one sequence per line\n",
    "    positive = [[list(x), p_label] for x in read_file(path_positive).splitlines()]\n",
    "    negative = [[list(x), n_label] for x in read_file(path_negative).splitlines()]\n",
    "\n",
    "    # Shuffle data to prevent from any sorting previously applied\n",
    "    random.shuffle(positive)\n",
    "    random.shuffle(negative)\n",
    "\n",
    "    # Truncate to prepare a balanced dataset\n",
    "    negative = negative[:len(positive)]\n",
    "\n",
    "    # Prepare a balanced dataset\n",
    "    dataset = positive + negative\n",
    "\n",
    "    # Shuffle dataset\n",
    "    random.shuffle(dataset)\n",
    "\n",
    "    # Truncate dataset to N_SAMPLES\n",
    "    dataset = dataset[:N_SAMPLES]\n",
    "\n",
    "    # Separate X-Y pairs\n",
    "    X_features, Y_label = zip(*dataset)\n",
    "\n",
    "    return X_features, Y_label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 ['T', 'A', 'A', 'M', 'R', 'N', 'T', 'K', 'R', 'G', 'S', 'W', 'Y', 'I', 'E', 'A', 'L', 'A', 'Q', 'V', 'F']\n",
      "0 ['N', 'K', 'K', 'L', 'A', 'P', 'S', 'S', 'T', 'P', 'S', 'N', 'I', 'A', 'P', 'S', 'D', 'V', 'V', 'S', 'N']\n",
      "0 ['R', 'G', 'A', 'G', 'S', 'S', 'A', 'F', 'S', 'Q', 'S', 'S', 'G', 'T', 'L', 'A', 'S', 'N', 'P', 'A', 'T']\n",
      "1 ['T', 'D', 'N', 'D', 'W', 'P', 'I', 'Y', 'V', 'E', 'S', 'G', 'E', 'E', 'N', 'D', 'P', 'A', 'G', 'D', 'D']\n",
      "1 ['G', 'Q', 'E', 'R', 'F', 'R', 'S', 'I', 'T', 'Q', 'S', 'Y', 'Y', 'R', 'S', 'A', 'N', 'A', 'L', 'I', 'L']\n",
      "1 ['S', 'I', 'N', 'T', 'G', 'C', 'L', 'N', 'A', 'C', 'T', 'Y', 'C', 'K', 'T', 'K', 'H', 'A', 'R', 'G', 'N']\n",
      "0 ['N', 'K', 'A', 'S', 'L', 'P', 'P', 'K', 'P', 'G', 'T', 'M', 'A', 'A', 'G', 'G', 'G', 'G', 'P', 'A', 'P']\n",
      "0 ['A', 'S', 'V', 'Q', 'D', 'Q', 'T', 'T', 'V', 'R', 'T', 'V', 'A', 'S', 'A', 'T', 'T', 'A', 'I', 'E', 'I']\n",
      "0 ['A', 'S', 'L', 'E', 'G', 'K', 'K', 'I', 'K', 'D', 'S', 'T', 'A', 'A', 'S', 'R', 'A', 'T', 'T', 'L', 'S']\n",
      "0 ['R', 'R', 'Q', 'P', 'V', 'G', 'G', 'L', 'G', 'L', 'S', 'I', 'K', 'G', 'G', 'S', 'E', 'H', 'N', 'V', 'P']\n"
     ]
    }
   ],
   "source": [
    "X_features, Y_label = prepare_dataset(N_SAMPLES=10)\n",
    "\n",
    "for peptide, label in zip(X_features, Y_label):\n",
    "    print(label, peptide)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These sequences are centered with respect to the modified or presumably unmodified residue, which may be a serine or a threonine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 ['S']\n",
      "0 ['S']\n",
      "0 ['S']\n",
      "1 ['S']\n",
      "1 ['S']\n",
      "1 ['T']\n",
      "0 ['T']\n",
      "0 ['T']\n",
      "0 ['S']\n",
      "0 ['S']\n"
     ]
    }
   ],
   "source": [
    "for peptide, label in zip(X_features, Y_label):\n",
    "    print(label, peptide[len(peptide) // 2:len(peptide) // 2 + 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because *O*-GlcNAcylation may impact Serine or Threonine, note that negative sequences with label ``0`` were prepared to also contain such residues at the same position."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have already seen in [String dataset](../dummy_string/prepare_dataset.ipynb) how to perform [*one-hot encoding*](../dummy_string/prepare_dataset.ipynb#One-hot-encoding-of-string-features) of string features.\n",
    "\n",
    "Just for fun, and also because you may like to use such data in convolutional networks, let's convert a peptide sequence into an image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n",
      "{'I': 0, 'S': 1, 'K': 2, 'V': 3, 'W': 4, 'Q': 5, 'N': 6, 'C': 7, 'H': 8, 'G': 9, 'A': 10, 'R': 11, 'D': 12, 'F': 13, 'L': 14, 'Y': 15, 'E': 16, 'T': 17, 'P': 18, 'M': 19}\n",
      "['G', 'R', 'I', 'S', 'A', 'L', 'Q', 'G', 'K', 'L', 'S', 'K', 'L', 'D', 'Y', 'R', 'D', 'I', 'T', 'K', 'Q']\n",
      "[9, 11, 0, 1, 10, 14, 5, 9, 2, 14, 1, 2, 14, 12, 15, 11, 12, 0, 17, 2, 5]\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAD4AAAD4CAYAAAC0cFXtAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAALP0lEQVR4nO2de4wV1R3HP18Xl25Z1kWhosgWLRuSrSICoTZuDT6LxKh9ZkljobXVGkk0adPQmmhj04S2UduGBuuDaBsrND5S0hKV+qjdpFIowQcCBQ2mIC91QXa1kNVf/5izOFzm7p2dubt32HM+yeTOPefcOed753Xm/Ob3OzIzfOSEWjegVgThvhGE+8aIWjcgiVGjRtmYMWMqluvq6qKnp0dZ6iik8DFjxrBw4cKK5ZYsWZK5jlyHuqQ5krZI2iZpUUL+SEkrXP4aSZPy1FdNMguXVAf8FrgCaAPmSWorKXYd0GVmk4G7gZ9nra/a5Nnjs4BtZvaGmR0GlgNXl5S5GnjIrT8KXCIp0zlZbfIInwD8N/Z9h0tLLGNmvcAB4JQcdVaNwtzOJF0vaZ2kdT09PYNeXx7hO4GJse9nuLTEMpJGACcB7yRtzMzuNbOZZjZz1KhROZqVjjzC1wKtks6UVA90ACtLyqwE5rv1rwLPWkEeBzPfx82sV9JC4CmgDlhmZhsl3QGsM7OVwAPAHyRtA94l+nMKQa4OjJmtAlaVpN0WW/8f8LWBbre+vp6WlpZU5bJSmIvbUBOE+0YQ7htBuG8E4b7hrXAV5JnhKCSlbpSZZRrY8HaPB+G+EYT7RhA+UCRNlPScpNckbZR0c0KZ2ZIOSNrgltuStlUTzCzTApwGTHfro4H/AG0lZWYDf8mwbUu7ZG1/5j1uZrvMbL1bPwhs4liDQmGpirXUGQPPA9YkZH9e0kvAW8APzGxjmW1cD1zv1mlqaqpYb3d3d9Ym5++ySmoE/g78zMweL8lrAj4ys25Jc4Ffm1lrpW2OGDHCGhsbK9bd3d1Nb2/v0HdZJZ0IPAY8XCoawMzeM7Nut74KOFHS2Dx1Vos8V3URGQw2mdldZcqM77OOSprl6ks0IQ01ec7xC4BrgVckbXBpPwZaAMzsHiKz0Y2SeoEPgI6imJAK+Vha+HP8eCYI940g3DcK+YLfhAkTWLTomNfmjmHx4sWZ6/B2jwfhvhGE+0YQ7htBuG8UsufW2NhIe3t7qnJZ8XaP5xYuabukV5ylZF1CviT9xvmlvCxpet46q0G1DvWLzOztMnlXAK1u+Ryw1H3WlKE41K8Gfm8RLwLNkk4bgnr7pRrCDXha0r+dNaSUNL4rR7lmdHV1VaFZ/VMN4e1mNp3okL5J0oVZNhJ3zUjjZZiX3MLNbKf73As8QeSWFSeN78qQk9eENErS6L514HLg1ZJiK4Fvuqv7+cABM9uVp95qkPeqfirwhLMSjQD+aGZPSvoeHLGmrALmAtuA94Fv5ayzKhTSktLc3Gxpem6dnZ3s378/WFIGQhDuG0G4bwThvhGE+4a3wo/rwcYNGzZkrsPbPR6E+0YQ7htB+ECRNCXma7JB0nuSbikpU1iflDxxYLYA0+BIpK+dRKOspfzDzK7MWs9gUa1D/RLgdTN7s0rbG3Sq1XPrAB4pkzdgn5S6ujqWLl1asdJ9+/Zlay3V8UmpJxL1WTPbU5KXySdl5MiRNn78+Ip17969m0OHDtVslPUKYH2paBimPikx5lHmMB+uPil9ZqPLgBtiaXErSvBJGQjHyzl+XBKE+4a3wgs55jZ58mSWL19esVxHR/YQkN7u8SDcN4Jw3wjCfSMI9w1vhRfyeTzEiBhEUgmXtEzSXkmvxtJOlrRa0lb3mfiSuaT5rsxWSfOTytSCtHv8QWBOSdoi4Bk3XPyM+34Ukk4GbifyQZkF3F7uDxpqUgk3sxeIpgKIE58D5SHgmoSffhFYbWbvmlkXsJpj/8CakOd5/NTYC/e7id5dLyWVPwocG91rsKnKxc0NGee6PcR9Uk44YfCvuXlq2NPnRuU+9yaUKaQ/CuQTHp8DZT7w54QyTwGXSxrjLmqXu7Sak/Z29gjwT2CKpB2SrgMWA5dJ2gpc6r4jaaak+wHM7F3gp0STyawF7nBpNaeQPbdgSRlEgnDfCMJ9Iwj3jSDcN7wVHuzjvhGE+0YQ7htBeDnKWFF+KWmzi9b1hKTmMr/tN/JXLUmzxx/kWCPAauBsM5tKND/Kj/r5/UVmNs3MZmZr4uBQUXiSFcXMnnYTqQO8SDRsfFxRjZ7bt4EVZfL6In8Z8Dszu7fcRuKWlHHjxvHmm5X9eg4fPjzw1jryvqh/K9ALPFymSLuZ7ZT0KWC1pM3uCDoG96fcC9Da2jroQ795HO4WAFcC3yjndZAi8lfNyCRc0hzgh8BVZvZ+mTJpIn/VjDS3syQryhKiGa9Wu1vVPa7s6ZL6JmI/Feh0Pmf/Av5qZk8OiooMVDzHzWxeQvIDZcq+RRTCDDN7Azg3V+sGkdBz840g3DcKOebW0NDA1KlTU5XLird7PAj3jSDcN4Jw3wjCfcNb4YXssnZ3d9PZ2ZmqXFa83eNZLSk/kbQzFrVrbpnfzpG0xc2RUnmSwiEkqyUF4G5nIZnmovochYv49VuiyEBtwDxJbXkaW00yWVJSMgvYZmZvmNlhYDmRH0shyHOOL3RGw2VlPItS+6PA0fOkHDx4MEez0pFV+FLgM0SB7HYBd+ZtSNwnZfTo0Xk3V5FMws1sj5l9aGYfAfeRbCEprD8KZLekxOcy+hLJFpK1QKukM13Mtw4iP5ZCULED4ywps4GxknYQeQ7OljSNyBq6HRfdS9LpwP1mNtfMeiUtJHK+qQOWlYveVwsGzZLivq8imiBmQNTV1dHU1JSqXFZCz803gnDfCMJ9Iwj3DW+FF9Kb2L0QmAozC97EAyEI940g3DeCcN9IM/S0jOj17L1mdrZLWwFMcUWagf1mNi3ht9uBg8CHQG+h3DPMrN8FuBCYDrxaJv9O4LYyeduBsZXqSPidpV0Guu2+Jc2Y2wuSJiXlufkRvg5cnPWPrxV5z/EvAHvMbGuZ/Eqz0R8hbklpaWlJtddmzJiRueF5hZedMcORejb6uCVl3LhxOZtVmTw+KSOAL1PeA2n4+aQ4LgU2m9mOpMzh6pMCCfMf+eCTgpktSEgLPilFJwj3DW+FF3KwMYQuHESCcN8Iwn0jCPeNINw3vBVeSJ+UhoYGzjnnnIrlDhw4kLmONCMwEyU9J+k1SRsl3ezSh/10Ib3A982sDTifaLS0jeE+XYiZ7TKz9W79ILCJyNPguJ4uZEAXN2dROQ9YQ5WnCxlqUguX1Ag8BtxiZu/F81ysp1wP9nFLSp6oXWlJO3nEiUSiHzazx11yVacLiVtS6uvr07Y/M2mu6iJ6MX+Tmd0Vyxr204VcAFwLXFziWRimC6k2zc3N1t7eXrFcZ2cn+/fvzzTmVkjhkg4CW0qSxwJvl6RNMbNMTmqF7LICW0pfG5G0LiktawXePqQE4QUjKYxp2rRUFPLiNhQUdY8POkF4LSkZ1FjvPo/ElYjFmjBJuyR1SeqRtEbSJEkLJO2L9Sy/U7HSrG8GVnMBfkE0kFEHvAPcA9QDLwFnA68DZwHdRI+2K9zvOojeuloALBlInYXY43w8qDELeJkoNHlfXImbcLEmXNn3+ThmxaPAJVkqLIrwvkGNCUR7t29QYwcwiY8HMz7hvn9F0jUWhTo/ADS6tJclPSop/iicyJB1WSX9DUiy9t+akFbuHvtponjua4BfSXrFpT8F3GdmhyTdQHT09Pt+7ZAJN7NLy+VJ6hvU2EkUdKNvUOMMojegz3Lb2CnpA2Af8DwwAziJ6FTo+7PuJ7pm9EtRDvW+QY21wFTg+VhciaVEsSbOdcNfnwROIxonmAg8y9FH0lVEA6L9Uoiem6RTgD8BLURX7tGAiA7heiKxlwFNQBfQQHQHeIto1Pa7RIJ7iS58N5rZ5n7rLILwWlCUQ33ICcJ9Iwj3jSDcN/4PoD0U6+FqaNYAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "X_features, _ = prepare_dataset(N_SAMPLES=10)\n",
    "\n",
    "# Flatten the list of lists (list of peptides) and make a set()\n",
    "aas = list(set([feature for features in X_features for feature in features]))\n",
    "\n",
    "# set() contains unique elements\n",
    "print(len(aas))  # 20 amino-acids\n",
    "\n",
    "e2i = {k: i for i, k in enumerate(aas)}  # element_to_idx encoder 0-19\n",
    "\n",
    "features = X_features[0]\n",
    "\n",
    "print(e2i)       # Encoder\n",
    "print(features)  # Peptide before encoding\n",
    "print([e2i[feature] for feature in features])  # After encoding\n",
    "\n",
    "# NumPy array to plot as image\n",
    "img_features = np.array([e2i[feature] for feature in features])\n",
    "img_features = np.expand_dims(img_features, axis=1)\n",
    "\n",
    "plt.imshow(img_features, cmap='gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, let’s reshape. The number 21 is divisible by 7 and 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAD4CAYAAAAn1CIKAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAIhElEQVR4nO3dX4iU1x3G8e9TzdJ2V81CbQlq/ANBkGprEKFEA01psak0N70wpIGWgFcWA4WSXvbKu9BelIIkaQvaSMgfCCVNGqghCKmNpqZx1RSrLSoRLYnGiFRMf73Y2bBZTfase86+s78+H1gyM47vOcs37zg7M2ePIgLL6zNdT8DacuDkHDg5B07OgZOb2+Kgg4ODMTw83OLQkxoYGOhk3DEnT57sbOyI0MTbmgQeHh5m27ZtLQ49qdtvv72Tccc88MADnY4/kR+ik3Pg5Bw4OQdOzoGTc+DkHDg5B07OgZNz4OQcOLmiwJI2SXpb0nFJj7SelNUzaWBJc4BfAt8GVgH3S1rVemJWR8kZvB44HhEnIuIqsAe4r+20rJaSwIuAU+Oun+7d9jGStko6IOnA5cuXa83Ppqnak6yI2BkR6yJi3eDgYK3D2jSVBD4DLBl3fXHvNpsFSgK/DtwhabmkAWAL8HzbaVktk35kJyKuSdoGvATMAZ6IiJHmM7Mqij6TFREvAC80nos14FeyknPg5Bw4OQdOzoGTc+DkHDg5B07OgZNTi1/CMnfu3BgaGqp+3BI7duzoZNwxGzZs6GTcLVu2MDIyct3yUZ/ByTlwcg6cnAMn58DJOXByDpycAyfnwMk5cHIOnJwDJ1eyuvAJSeckHZ6JCVldJWfwb4BNjedhjUwaOCJeBd6dgblYA9V+26ykrcDW3uVah7VpqhY4InYCO2H0Df9ax7Xp8bPo5Bw4uZIfk54EXgNWSjot6aH207JaStYH3z8TE7E2/BCdnAMn58DJOXByDpycAyfnwMk5cHIOnFyTzSmHhoY6W0Z58eLFTsYds3nz5k7GPXv27A1v9xmcnAMn58DJOXByDpycAyfnwMk5cHIOnJwDJ+fAyTlwciWfi14iaa+kI5JGJG2fiYlZHSXvJl0DfhwRb0iaBxyU9HJEHGk8N6ugZPnoOxHxRu/yJeAoN9ic0vrTlP4NlrQMWAvsv8GffbT76NWrVytNz6arOLCkIeAZ4OGIeH/in4/ffXRgYKDmHG0aSrd4v4XRuLsj4tm2U7KaSp5FC3gcOBoRj7afktVUcgbfBTwI3CPpUO/r3sbzskpKlo/uA/xLN2Ypv5KVnAMn58DJOXByDpycAyfnwMk5cHIOnFyT5aOLFi3qbBfQjRs3djLumAULFnQ6/kQ+g5Nz4OQcODkHTs6Bk3Pg5Bw4OQdOzoGTc+DkHDg5B06u5IPvn5X0F0lv9paP/mwmJmZ1lLyb9B/gnoj4oLeEZZ+kP0TEnxvPzSoo+eB7AB/0rt7S+/LehLNE6eKzOZIOAeeAlyPiU5ePvvfee5WnaTerKHBEfBgRXwUWA+slffkG9/lo+ejw8HDladrNmtKz6Ii4AOzFO4LPGiXPohdKurV3+XPAN4FjjedllZQ8i74N+K2kOYz+D/FURPy+7bSslpJn0X9j9Pdy2CzkV7KSc+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5JqsD+7Srl27Oh1/zZo1nYz7Sdva+gxOzoGTc+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5Kay89kcSX+V5M9EzyJTOYO3M7oxpc0ipasLFwPfAR5rOx2rrfQM/jnwE+C/n3QHLx/tTyWLzzYD5yLi4Kfdz8tH+1Pp3oXflfRPYA+jexh2+666FSvZAfynEbE4IpYBW4A/RcT3m8/MqvDPwclN6TNZEfEK8EqTmVgTPoOTc+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5JosH71y5QpvvfVWi0NPav78+Z2MO2bp0qWdjj+Rz+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5Bw4OQdOrui16N6qhkvAh8C1iFjXclJWz1TebPh6RPy72UysCT9EJ1caOIA/SjooaeuN7jB++eilS5fqzdCmpfQhekNEnJH0ReBlScci4tXxd4iIncBOgBUrVnh/4T5Rur3smd5/zwHPAetbTsrqKVkAPihp3thl4FvA4dYTszpKHqK/BDwnaez+v4uIF5vOyqop2X30BPCVGZiLNeAfk5Jz4OQcODkHTs6Bk3Pg5Bw4OQdOzoGTc+DkFFH/nb1169bFgQMHqh+3RNfLN1evXt3JuPv27ePChQuaeLvP4OQcODkHTs6Bk3Pg5Bw4OQdOzoGTc+DkHDg5B06udO/CWyU9LemYpKOSvtZ6YlZH6dqkXwAvRsT3JA0An284J6to0sCSFgB3Az8AiIirwNW207JaSh6ilwPngV/3Noh+rLdG6WPGLx89f/589YnazSkJPBe4E/hVRKwFLgOPTLzT+N1HFy5cWHmadrNKAp8GTkfE/t71pxkNbrNAye6jZ4FTklb2bvoGcKTprKya0mfRPwJ2955BnwB+2G5KVlNR4Ig4BPhXJ81CfiUrOQdOzoGTc+DkHDg5B07OgZNz4OQcODkHTq7J8lFJ54F/3eRf/wLQ1S8en81jL42I696nbRJ4OiQd6GrLgIxj+yE6OQdOrh8D7/TY9fTdv8FWVz+ewVaRAyfXV4ElbZL0tqTjkq77aG7DcZ+QdE7SjO9FIWmJpL2SjkgakbS96gAR0RdfwBzgH8AKYAB4E1g1Q2PfzehHgQ938H3fBtzZuzwP+HvN77ufzuD1wPGIONFbHrMHuG8mBo7RPaDenYmxbjD2OxHxRu/yJeAosKjW8fsp8CLg1Ljrp6n4jc4GkpYBa4H9k9y1WD8F/r8maQh4Bng4It6vddx+CnwGWDLu+uLebelJuoXRuLsj4tmax+6nwK8Dd0ha3ltBsQV4vuM5NafRHcceB45GxKO1j983gSPiGrANeInRJxpPRcTITIwt6UngNWClpNOSHpqJcXvuAh4E7pF0qPd1b62D+6XK5PrmDLY2HDg5B07OgZNz4OQcODkHTu5/F8pHdqMeSG8AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.imshow(img_features.reshape(7, 3), cmap='gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems to be working! We are done."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Live examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function ``prepare_dataset()`` presented herein is used in the following live examples:\n",
    "\n",
    "* Notebook at`EpyNN/epynnlive/dummy_string/train.ipynb` or following [this link](train.ipynb). \n",
    "* Regular python code at `EpyNN/epynnlive/dummy_string/train.py`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}