{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Protein Modification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Find this notebook at `EpyNN/epynnlive/ptm_protein/prepare_dataset.ipynb`. \n", "* Regular python code at `EpyNN/epynnlive/ptm_protein/prepare_dataset.py`.\n", "\n", "Run the notebook online with [Google Colab](https://colab.research.google.com/github/Synthaze/EpyNN/blob/main/epynnlive/ptm_protein/prepare_dataset.ipynb).\n", "\n", "**Level: Intermediate**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is part of the series on preparing data for Neural Network regression with EpyNN. \n", "\n", "It deals with a real world problem and therefore will focus on the problem itself, rather than basics that were reviewed along with the preparation of the following dummy dataset: \n", "\n", "* [Boolean dataset](../dummy_boolean/prepare_dataset.ipynb)\n", "* [String dataset](../dummy_string/prepare_dataset.ipynb)\n", "* [Time-series (numerical)](../dummy_time/prepare_dataset.ipynb)\n", "* [Image (numerical)](../dummy_image/prepare_dataset.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Post Translational Modification (PTM) of Proteins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Post-Translational Modification (PTM) of proteins is an ensemble of mechanisms by which the primary sequence of a protein can be chemically modified after - and in some circumstances during - biosynthesis by the ribosomes.\n", "\n", "When talking about *one* PTM, it generally refers to a given chemical group that may be covalently linked with given amino acid residues in proteins.\n", "\n", "For instance, the formation of a phosphoester between a phosphate group and side-chain hydroxyl of serine, threonine and tyrosine is known as phosphorylation. While proteins overall may contain a given number of such residues, phosphorylation may occur particularly on a given subset, generally with respect to specific cellular conditions.\n", "\n", "From a given number of chemically unmodified proteins (proteome), below is a list of some characteristics with respect to PTM:\n", "\n", "* PTM increase chemical diversity: for a given *proteome*, there is a corresponding *phosphoproteome* or *oglcnacome* if talking about *O*-GlcNAcylation. Said explicitely, a chemically uniform protein may give rise to an ensemble of chemically distinct proteins upon modification.\n", "* PTM may enrich gene's function: as for other mechanisms, the fact that a given gene product - the chemically unmodified protein - may be modified to yield distinct chemical entities is equivalent to multiplying the number of end-products from a single gene. As such, the number of functions for this gene is expected to increase, because distinct functions are achieved by distinct molecules, and this is actually what PTM do: create chemically distinct proteins from the same gene product.\n", "* Chemical groups defining one PTM are numerous: among the most studied, one may cite phosphorylation, ubiquitinylation, *O*-GlcNActylation, methylation, succinylation, among dozens of others.\n", "\n", "PTMs are major regulators of cell signaling and play a role in virtually every biological process.\n", "\n", "As such, this is a big challenge to predict whether or not one protein may be modified with respect to one PTM.\n", "\n", "Let's draw something to illustrate one aspect of the deal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![2hr9](tctp.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a protein called [TCTP](https://www.rcsb.org/structure/2hr9) and above is shown a type of 3D model commonly used to represent proteins. The red sticks represent serine residues along the protein primary sequence. Those with label SER-46 and SER-64 where shown to undergo phosphorylation in cells.\n", "\n", "But in theory, phosphorylation could occur on all serines within this structure. The reality is that such modifications only occur on *some* serines.\n", "\n", "This is what we are going to challenge here, with a PTM called *O*-GlcNAcylation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare a set of peptides" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s prepare a set of O-GlcNAcylated and presumably not *O*-GlcNAcylated peptide sequences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# EpyNN/epynnlive/ptm_protein/prepare_dataset.ipynb\n", "# Install dependencies\n", "!pip3 install --upgrade-strategy only-if-needed epynn\n", "\n", "# Standard library imports\n", "import tarfile\n", "import random\n", "import os\n", "\n", "# Related third party imports\n", "import wget\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# Local application/library specific imports\n", "from epynn.commons.library import read_file\n", "from epynn.commons.logs import process_logs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the `tarfile` which is a Python built-in *standard* library and the first choice to deal with `.tar` archives and related." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seeding" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "random.seed(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reproducibility." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simple function to download data from the cloud as `.tar` archive. Once uncompressed, it yields a `data/` directory containing `.dat` text files for positive and negative sequences." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def download_sequences():\n", " \"\"\"Download a set of peptide sequences.\n", " \"\"\"\n", " data_path = os.path.join('.', 'data')\n", "\n", " if not os.path.exists(data_path):\n", "\n", " # Download @url with wget\n", " url = 'https://synthase.s3.us-west-2.amazonaws.com/ptm_prediction_data.tar'\n", " fname = wget.download(url)\n", "\n", " # Extract archive\n", " tar = tarfile.open(fname).extractall('.')\n", " process_logs('Make: ' + fname, level=1)\n", "\n", " # Clean-up\n", " os.remove(fname)\n", "\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve the data as follows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "download_sequences()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the directory." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('data', [], ['21_positive.dat', '21_negative.dat'])\n" ] } ], "source": [ "for path in os.walk('data'):\n", " print(path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let’s have a quick look to what one file's content." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['SQDVSNAFSPSISKAQPGAPP', 'GPRIPDHQRTSVPENHAQSRI', 'QFSCKCLTGFTGQKCETDVNE', 'KLIKRLYVDKSLNLSTEFISS', 'QQKEGEQNQQTQQQQILIQPQ']\n" ] } ], "source": [ "with open(os.path.join('data', '21_positive.dat'), 'r') as f:\n", " print(f.read().splitlines()[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are 21 amino-acids long peptide sequences.\n", "\n", "Note that positive sequences are *Homo sapiens* *O*-GlcNAcylated peptides sourced from [The *O*-GlcNAc Database](https://www.oglcnac.mcw.edu).\n", "\n", "Negative sequences are *Homo sapiens* peptide sequence not reported in the above-mentioned source. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a function we use to prepare the labeled dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def prepare_dataset(N_SAMPLES=100):\n", " \"\"\"Prepare a set of labeled peptides.\n", "\n", " :param N_SAMPLES: Number of peptide samples to retrieve, defaults to 100.\n", " :type N_SAMPLES: int\n", "\n", " :return: Set of peptides.\n", " :rtype: tuple[list[str]]\n", "\n", " :return: Set of single-digit peptides label.\n", " :rtype: tuple[int]\n", " \"\"\"\n", " # Single-digit positive and negative labels\n", " p_label = 0\n", " n_label = 1\n", "\n", " # Positive data are Homo sapiens O-GlcNAcylated peptide sequences from oglcnac.mcw.edu\n", " path_positive = os.path.join('data', '21_positive.dat')\n", "\n", " # Negative data are peptide sequences presumably not O-GlcNAcylated\n", " path_negative = os.path.join('data', '21_negative.dat')\n", "\n", " # Read text files, each containing one sequence per line\n", " positive = [[list(x), p_label] for x in read_file(path_positive).splitlines()]\n", " negative = [[list(x), n_label] for x in read_file(path_negative).splitlines()]\n", "\n", " # Shuffle data to prevent from any sorting previously applied\n", " random.shuffle(positive)\n", " random.shuffle(negative)\n", "\n", " # Truncate to prepare a balanced dataset\n", " negative = negative[:len(positive)]\n", "\n", " # Prepare a balanced dataset\n", " dataset = positive + negative\n", "\n", " # Shuffle dataset\n", " random.shuffle(dataset)\n", "\n", " # Truncate dataset to N_SAMPLES\n", " dataset = dataset[:N_SAMPLES]\n", "\n", " # Separate X-Y pairs\n", " X_features, Y_label = zip(*dataset)\n", "\n", " return X_features, Y_label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the function." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 ['T', 'A', 'A', 'M', 'R', 'N', 'T', 'K', 'R', 'G', 'S', 'W', 'Y', 'I', 'E', 'A', 'L', 'A', 'Q', 'V', 'F']\n", "0 ['N', 'K', 'K', 'L', 'A', 'P', 'S', 'S', 'T', 'P', 'S', 'N', 'I', 'A', 'P', 'S', 'D', 'V', 'V', 'S', 'N']\n", "0 ['R', 'G', 'A', 'G', 'S', 'S', 'A', 'F', 'S', 'Q', 'S', 'S', 'G', 'T', 'L', 'A', 'S', 'N', 'P', 'A', 'T']\n", "1 ['T', 'D', 'N', 'D', 'W', 'P', 'I', 'Y', 'V', 'E', 'S', 'G', 'E', 'E', 'N', 'D', 'P', 'A', 'G', 'D', 'D']\n", "1 ['G', 'Q', 'E', 'R', 'F', 'R', 'S', 'I', 'T', 'Q', 'S', 'Y', 'Y', 'R', 'S', 'A', 'N', 'A', 'L', 'I', 'L']\n", "1 ['S', 'I', 'N', 'T', 'G', 'C', 'L', 'N', 'A', 'C', 'T', 'Y', 'C', 'K', 'T', 'K', 'H', 'A', 'R', 'G', 'N']\n", "0 ['N', 'K', 'A', 'S', 'L', 'P', 'P', 'K', 'P', 'G', 'T', 'M', 'A', 'A', 'G', 'G', 'G', 'G', 'P', 'A', 'P']\n", "0 ['A', 'S', 'V', 'Q', 'D', 'Q', 'T', 'T', 'V', 'R', 'T', 'V', 'A', 'S', 'A', 'T', 'T', 'A', 'I', 'E', 'I']\n", "0 ['A', 'S', 'L', 'E', 'G', 'K', 'K', 'I', 'K', 'D', 'S', 'T', 'A', 'A', 'S', 'R', 'A', 'T', 'T', 'L', 'S']\n", "0 ['R', 'R', 'Q', 'P', 'V', 'G', 'G', 'L', 'G', 'L', 'S', 'I', 'K', 'G', 'G', 'S', 'E', 'H', 'N', 'V', 'P']\n" ] } ], "source": [ "X_features, Y_label = prepare_dataset(N_SAMPLES=10)\n", "\n", "for peptide, label in zip(X_features, Y_label):\n", " print(label, peptide)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These sequences are centered with respect to the modified or presumably unmodified residue, which may be a serine or a threonine." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 ['S']\n", "0 ['S']\n", "0 ['S']\n", "1 ['S']\n", "1 ['S']\n", "1 ['T']\n", "0 ['T']\n", "0 ['T']\n", "0 ['S']\n", "0 ['S']\n" ] } ], "source": [ "for peptide, label in zip(X_features, Y_label):\n", " print(label, peptide[len(peptide) // 2:len(peptide) // 2 + 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because *O*-GlcNAcylation may impact Serine or Threonine, note that negative sequences with label ``0`` were prepared to also contain such residues at the same position." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have already seen in [String dataset](../dummy_string/prepare_dataset.ipynb) how to perform [*one-hot encoding*](../dummy_string/prepare_dataset.ipynb#One-hot-encoding-of-string-features) of string features.\n", "\n", "Just for fun, and also because you may like to use such data in convolutional networks, let's convert a peptide sequence into an image." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20\n", "{'I': 0, 'S': 1, 'K': 2, 'V': 3, 'W': 4, 'Q': 5, 'N': 6, 'C': 7, 'H': 8, 'G': 9, 'A': 10, 'R': 11, 'D': 12, 'F': 13, 'L': 14, 'Y': 15, 'E': 16, 'T': 17, 'P': 18, 'M': 19}\n", "['G', 'R', 'I', 'S', 'A', 'L', 'Q', 'G', 'K', 'L', 'S', 'K', 'L', 'D', 'Y', 'R', 'D', 'I', 'T', 'K', 'Q']\n", "[9, 11, 0, 1, 10, 14, 5, 9, 2, 14, 1, 2, 14, 12, 15, 11, 12, 0, 17, 2, 5]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAD4AAAD4CAYAAAC0cFXtAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAALP0lEQVR4nO2de4wV1R3HP18Xl25Z1kWhosgWLRuSrSICoTZuDT6LxKh9ZkljobXVGkk0adPQmmhj04S2UduGBuuDaBsrND5S0hKV+qjdpFIowQcCBQ2mIC91QXa1kNVf/5izOFzm7p2dubt32HM+yeTOPefcOed753Xm/Ob3OzIzfOSEWjegVgThvhGE+8aIWjcgiVGjRtmYMWMqluvq6qKnp0dZ6iik8DFjxrBw4cKK5ZYsWZK5jlyHuqQ5krZI2iZpUUL+SEkrXP4aSZPy1FdNMguXVAf8FrgCaAPmSWorKXYd0GVmk4G7gZ9nra/a5Nnjs4BtZvaGmR0GlgNXl5S5GnjIrT8KXCIp0zlZbfIInwD8N/Z9h0tLLGNmvcAB4JQcdVaNwtzOJF0vaZ2kdT09PYNeXx7hO4GJse9nuLTEMpJGACcB7yRtzMzuNbOZZjZz1KhROZqVjjzC1wKtks6UVA90ACtLyqwE5rv1rwLPWkEeBzPfx82sV9JC4CmgDlhmZhsl3QGsM7OVwAPAHyRtA94l+nMKQa4OjJmtAlaVpN0WW/8f8LWBbre+vp6WlpZU5bJSmIvbUBOE+0YQ7htBuG8E4b7hrXAV5JnhKCSlbpSZZRrY8HaPB+G+EYT7RhA+UCRNlPScpNckbZR0c0KZ2ZIOSNrgltuStlUTzCzTApwGTHfro4H/AG0lZWYDf8mwbUu7ZG1/5j1uZrvMbL1bPwhs4liDQmGpirXUGQPPA9YkZH9e0kvAW8APzGxjmW1cD1zv1mlqaqpYb3d3d9Ym5++ySmoE/g78zMweL8lrAj4ys25Jc4Ffm1lrpW2OGDHCGhsbK9bd3d1Nb2/v0HdZJZ0IPAY8XCoawMzeM7Nut74KOFHS2Dx1Vos8V3URGQw2mdldZcqM77OOSprl6ks0IQ01ec7xC4BrgVckbXBpPwZaAMzsHiKz0Y2SeoEPgI6imJAK+Vha+HP8eCYI940g3DcK+YLfhAkTWLTomNfmjmHx4sWZ6/B2jwfhvhGE+0YQ7htBuG8UsufW2NhIe3t7qnJZ8XaP5xYuabukV5ylZF1CviT9xvmlvCxpet46q0G1DvWLzOztMnlXAK1u+Ryw1H3WlKE41K8Gfm8RLwLNkk4bgnr7pRrCDXha0r+dNaSUNL4rR7lmdHV1VaFZ/VMN4e1mNp3okL5J0oVZNhJ3zUjjZZiX3MLNbKf73As8QeSWFSeN78qQk9eENErS6L514HLg1ZJiK4Fvuqv7+cABM9uVp95qkPeqfirwhLMSjQD+aGZPSvoeHLGmrALmAtuA94Fv5ayzKhTSktLc3Gxpem6dnZ3s378/WFIGQhDuG0G4bwThvhGE+4a3wo/rwcYNGzZkrsPbPR6E+0YQ7htB+ECRNCXma7JB0nuSbikpU1iflDxxYLYA0+BIpK+dRKOspfzDzK7MWs9gUa1D/RLgdTN7s0rbG3Sq1XPrAB4pkzdgn5S6ujqWLl1asdJ9+/Zlay3V8UmpJxL1WTPbU5KXySdl5MiRNn78+Ip17969m0OHDtVslPUKYH2paBimPikx5lHmMB+uPil9ZqPLgBtiaXErSvBJGQjHyzl+XBKE+4a3wgs55jZ58mSWL19esVxHR/YQkN7u8SDcN4Jw3wjCfSMI9w1vhRfyeTzEiBhEUgmXtEzSXkmvxtJOlrRa0lb3mfiSuaT5rsxWSfOTytSCtHv8QWBOSdoi4Bk3XPyM+34Ukk4GbifyQZkF3F7uDxpqUgk3sxeIpgKIE58D5SHgmoSffhFYbWbvmlkXsJpj/8CakOd5/NTYC/e7id5dLyWVPwocG91rsKnKxc0NGee6PcR9Uk44YfCvuXlq2NPnRuU+9yaUKaQ/CuQTHp8DZT7w54QyTwGXSxrjLmqXu7Sak/Z29gjwT2CKpB2SrgMWA5dJ2gpc6r4jaaak+wHM7F3gp0STyawF7nBpNaeQPbdgSRlEgnDfCMJ9Iwj3jSDcN7wVHuzjvhGE+0YQ7htBeDnKWFF+KWmzi9b1hKTmMr/tN/JXLUmzxx/kWCPAauBsM5tKND/Kj/r5/UVmNs3MZmZr4uBQUXiSFcXMnnYTqQO8SDRsfFxRjZ7bt4EVZfL6In8Z8Dszu7fcRuKWlHHjxvHmm5X9eg4fPjzw1jryvqh/K9ALPFymSLuZ7ZT0KWC1pM3uCDoG96fcC9Da2jroQ795HO4WAFcC3yjndZAi8lfNyCRc0hzgh8BVZvZ+mTJpIn/VjDS3syQryhKiGa9Wu1vVPa7s6ZL6JmI/Feh0Pmf/Av5qZk8OiooMVDzHzWxeQvIDZcq+RRTCDDN7Azg3V+sGkdBz840g3DcKOebW0NDA1KlTU5XLird7PAj3jSDcN4Jw3wjCfcNb4YXssnZ3d9PZ2ZmqXFa83eNZLSk/kbQzFrVrbpnfzpG0xc2RUnmSwiEkqyUF4G5nIZnmovochYv49VuiyEBtwDxJbXkaW00yWVJSMgvYZmZvmNlhYDmRH0shyHOOL3RGw2VlPItS+6PA0fOkHDx4MEez0pFV+FLgM0SB7HYBd+ZtSNwnZfTo0Xk3V5FMws1sj5l9aGYfAfeRbCEprD8KZLekxOcy+hLJFpK1QKukM13Mtw4iP5ZCULED4ywps4GxknYQeQ7OljSNyBq6HRfdS9LpwP1mNtfMeiUtJHK+qQOWlYveVwsGzZLivq8imiBmQNTV1dHU1JSqXFZCz803gnDfCMJ9Iwj3DW+FF9Kb2L0QmAozC97EAyEI940g3DeCcN9IM/S0jOj17L1mdrZLWwFMcUWagf1mNi3ht9uBg8CHQG+h3DPMrN8FuBCYDrxaJv9O4LYyeduBsZXqSPidpV0Guu2+Jc2Y2wuSJiXlufkRvg5cnPWPrxV5z/EvAHvMbGuZ/Eqz0R8hbklpaWlJtddmzJiRueF5hZedMcORejb6uCVl3LhxOZtVmTw+KSOAL1PeA2n4+aQ4LgU2m9mOpMzh6pMCCfMf+eCTgpktSEgLPilFJwj3DW+FF3KwMYQuHESCcN8Iwn0jCPeNINw3vBVeSJ+UhoYGzjnnnIrlDhw4kLmONCMwEyU9J+k1SRsl3ezSh/10Ib3A982sDTifaLS0jeE+XYiZ7TKz9W79ILCJyNPguJ4uZEAXN2dROQ9YQ5WnCxlqUguX1Ag8BtxiZu/F81ysp1wP9nFLSp6oXWlJO3nEiUSiHzazx11yVacLiVtS6uvr07Y/M2mu6iJ6MX+Tmd0Vyxr204VcAFwLXFziWRimC6k2zc3N1t7eXrFcZ2cn+/fvzzTmVkjhkg4CW0qSxwJvl6RNMbNMTmqF7LICW0pfG5G0LiktawXePqQE4QUjKYxp2rRUFPLiNhQUdY8POkF4LSkZ1FjvPo/ElYjFmjBJuyR1SeqRtEbSJEkLJO2L9Sy/U7HSrG8GVnMBfkE0kFEHvAPcA9QDLwFnA68DZwHdRI+2K9zvOojeuloALBlInYXY43w8qDELeJkoNHlfXImbcLEmXNn3+ThmxaPAJVkqLIrwvkGNCUR7t29QYwcwiY8HMz7hvn9F0jUWhTo/ADS6tJclPSop/iicyJB1WSX9DUiy9t+akFbuHvtponjua4BfSXrFpT8F3GdmhyTdQHT09Pt+7ZAJN7NLy+VJ6hvU2EkUdKNvUOMMojegz3Lb2CnpA2Af8DwwAziJ6FTo+7PuJ7pm9EtRDvW+QY21wFTg+VhciaVEsSbOdcNfnwROIxonmAg8y9FH0lVEA6L9Uoiem6RTgD8BLURX7tGAiA7heiKxlwFNQBfQQHQHeIto1Pa7RIJ7iS58N5rZ5n7rLILwWlCUQ33ICcJ9Iwj3jSDcN/4PoD0U6+FqaNYAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "X_features, _ = prepare_dataset(N_SAMPLES=10)\n", "\n", "# Flatten the list of lists (list of peptides) and make a set()\n", "aas = list(set([feature for features in X_features for feature in features]))\n", "\n", "# set() contains unique elements\n", "print(len(aas)) # 20 amino-acids\n", "\n", "e2i = {k: i for i, k in enumerate(aas)} # element_to_idx encoder 0-19\n", "\n", "features = X_features[0]\n", "\n", "print(e2i) # Encoder\n", "print(features) # Peptide before encoding\n", "print([e2i[feature] for feature in features]) # After encoding\n", "\n", "# NumPy array to plot as image\n", "img_features = np.array([e2i[feature] for feature in features])\n", "img_features = np.expand_dims(img_features, axis=1)\n", "\n", "plt.imshow(img_features, cmap='gray')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Well, let’s reshape. The number 21 is divisible by 7 and 3." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAHgAAAD4CAYAAAAn1CIKAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAIhElEQVR4nO3dX4iU1x3G8e9TzdJ2V81CbQlq/ANBkGprEKFEA01psak0N70wpIGWgFcWA4WSXvbKu9BelIIkaQvaSMgfCCVNGqghCKmNpqZx1RSrLSoRLYnGiFRMf73Y2bBZTfase86+s78+H1gyM47vOcs37zg7M2ePIgLL6zNdT8DacuDkHDg5B07OgZOb2+Kgg4ODMTw83OLQkxoYGOhk3DEnT57sbOyI0MTbmgQeHh5m27ZtLQ49qdtvv72Tccc88MADnY4/kR+ik3Pg5Bw4OQdOzoGTc+DkHDg5B07OgZNz4OQcOLmiwJI2SXpb0nFJj7SelNUzaWBJc4BfAt8GVgH3S1rVemJWR8kZvB44HhEnIuIqsAe4r+20rJaSwIuAU+Oun+7d9jGStko6IOnA5cuXa83Ppqnak6yI2BkR6yJi3eDgYK3D2jSVBD4DLBl3fXHvNpsFSgK/DtwhabmkAWAL8HzbaVktk35kJyKuSdoGvATMAZ6IiJHmM7Mqij6TFREvAC80nos14FeyknPg5Bw4OQdOzoGTc+DkHDg5B07OgZNTi1/CMnfu3BgaGqp+3BI7duzoZNwxGzZs6GTcLVu2MDIyct3yUZ/ByTlwcg6cnAMn58DJOXByDpycAyfnwMk5cHIOnJwDJ1eyuvAJSeckHZ6JCVldJWfwb4BNjedhjUwaOCJeBd6dgblYA9V+26ykrcDW3uVah7VpqhY4InYCO2H0Df9ax7Xp8bPo5Bw4uZIfk54EXgNWSjot6aH207JaStYH3z8TE7E2/BCdnAMn58DJOXByDpycAyfnwMk5cHIOnFyTzSmHhoY6W0Z58eLFTsYds3nz5k7GPXv27A1v9xmcnAMn58DJOXByDpycAyfnwMk5cHIOnJwDJ+fAyTlwciWfi14iaa+kI5JGJG2fiYlZHSXvJl0DfhwRb0iaBxyU9HJEHGk8N6ugZPnoOxHxRu/yJeAoN9ic0vrTlP4NlrQMWAvsv8GffbT76NWrVytNz6arOLCkIeAZ4OGIeH/in4/ffXRgYKDmHG0aSrd4v4XRuLsj4tm2U7KaSp5FC3gcOBoRj7afktVUcgbfBTwI3CPpUO/r3sbzskpKlo/uA/xLN2Ypv5KVnAMn58DJOXByDpycAyfnwMk5cHIOnFyT5aOLFi3qbBfQjRs3djLumAULFnQ6/kQ+g5Nz4OQcODkHTs6Bk3Pg5Bw4OQdOzoGTc+DkHDg5B06u5IPvn5X0F0lv9paP/mwmJmZ1lLyb9B/gnoj4oLeEZZ+kP0TEnxvPzSoo+eB7AB/0rt7S+/LehLNE6eKzOZIOAeeAlyPiU5ePvvfee5WnaTerKHBEfBgRXwUWA+slffkG9/lo+ejw8HDladrNmtKz6Ii4AOzFO4LPGiXPohdKurV3+XPAN4FjjedllZQ8i74N+K2kOYz+D/FURPy+7bSslpJn0X9j9Pdy2CzkV7KSc+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5JqsD+7Srl27Oh1/zZo1nYz7Sdva+gxOzoGTc+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5Kay89kcSX+V5M9EzyJTOYO3M7oxpc0ipasLFwPfAR5rOx2rrfQM/jnwE+C/n3QHLx/tTyWLzzYD5yLi4Kfdz8tH+1Pp3oXflfRPYA+jexh2+666FSvZAfynEbE4IpYBW4A/RcT3m8/MqvDPwclN6TNZEfEK8EqTmVgTPoOTc+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5JosH71y5QpvvfVWi0NPav78+Z2MO2bp0qWdjj+Rz+DkHDg5B07OgZNz4OQcODkHTs6Bk3Pg5Bw4OQdOrui16N6qhkvAh8C1iFjXclJWz1TebPh6RPy72UysCT9EJ1caOIA/SjooaeuN7jB++eilS5fqzdCmpfQhekNEnJH0ReBlScci4tXxd4iIncBOgBUrVnh/4T5Rur3smd5/zwHPAetbTsrqKVkAPihp3thl4FvA4dYTszpKHqK/BDwnaez+v4uIF5vOyqop2X30BPCVGZiLNeAfk5Jz4OQcODkHTs6Bk3Pg5Bw4OQdOzoGTc+DkFFH/nb1169bFgQMHqh+3RNfLN1evXt3JuPv27ePChQuaeLvP4OQcODkHTs6Bk3Pg5Bw4OQdOzoGTc+DkHDg5B06udO/CWyU9LemYpKOSvtZ6YlZH6dqkXwAvRsT3JA0An284J6to0sCSFgB3Az8AiIirwNW207JaSh6ilwPngV/3Noh+rLdG6WPGLx89f/589YnazSkJPBe4E/hVRKwFLgOPTLzT+N1HFy5cWHmadrNKAp8GTkfE/t71pxkNbrNAye6jZ4FTklb2bvoGcKTprKya0mfRPwJ2955BnwB+2G5KVlNR4Ig4BPhXJ81CfiUrOQdOzoGTc+DkHDg5B07OgZNz4OQcODkHTq7J8lFJ54F/3eRf/wLQ1S8en81jL42I696nbRJ4OiQd6GrLgIxj+yE6OQdOrh8D7/TY9fTdv8FWVz+ewVaRAyfXV4ElbZL0tqTjkq77aG7DcZ+QdE7SjO9FIWmJpL2SjkgakbS96gAR0RdfwBzgH8AKYAB4E1g1Q2PfzehHgQ938H3fBtzZuzwP+HvN77ufzuD1wPGIONFbHrMHuG8mBo7RPaDenYmxbjD2OxHxRu/yJeAosKjW8fsp8CLg1Ljrp6n4jc4GkpYBa4H9k9y1WD8F/r8maQh4Bng4It6vddx+CnwGWDLu+uLebelJuoXRuLsj4tmax+6nwK8Dd0ha3ltBsQV4vuM5NafRHcceB45GxKO1j983gSPiGrANeInRJxpPRcTITIwt6UngNWClpNOSHpqJcXvuAh4E7pF0qPd1b62D+6XK5PrmDLY2HDg5B07OgZNz4OQcODkHTu5/F8pHdqMeSG8AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.imshow(img_features.reshape(7, 3), cmap='gray')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems to be working! We are done." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Live examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function ``prepare_dataset()`` presented herein is used in the following live examples:\n", "\n", "* Notebook at`EpyNN/epynnlive/dummy_string/train.ipynb` or following [this link](train.ipynb). \n", "* Regular python code at `EpyNN/epynnlive/dummy_string/train.py`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 }