Skip to content

Building the Local Database of proteins

Dreycey Albin edited this page Jun 2, 2024 · 24 revisions

Overview

This pipeline is used to download all proteins of interest from a given database (or databases). PhageScanner allows for either UniProt of Entrez downloads and more databases can eventually be added to this set. The goal of PhageScanner is to create models to predict if a given genome (or metagenome) contains a type of general protein class (for example, toxic or phage virion protein), and this pipeline downloads the proteins for training the models.

Basic Usage

python phagescanner.py database -c Path/To/Config.yaml -o path/to/output/directory/ -n name_for_files_<classname>

Configuration File

This pipeline requires a configuration file for specifying the following:

  1. DB query per protein class. Both Uniprot and Entrez allow for declarative queries to retrieve all proteins matching a set of criteria. For example, all proteins that correspond to a particular gene ontology (GO) class.
  2. The clustering percentage. This is needed to prevent using the same proteins in both the training and testing sets, so it essentially removes duplicates.
  3. The number of partitions for the training/testing split. k-fold cross validation is used for the testing pipeline. Therefore, this database pipeline needs to split the proteins into different partitions for the downstream training pipeline.

Each configuration file is a yaml file that contains this information. The names of the proteins classes can be changed as desired, as well as the quries used. Below is an example of this configuration file:

clustering:
  name: cdhit #CDHIT is the only clustering tool allowed at the moment.
  deduplication-threshold: 100
  clustering-percentage: 95
  k_partitions: 5 # number of partitions in k-fold cross validation
classes:
  - name: KnownToxin
    uniprot: '(bacteriophage AND reviewed: true AND ((go:0090729)) OR ("Cholera toxin" AND CTX) OR ("exotoxin C" AND (go:0090729)) OR ("exotoxin A" AND "(SpeA)" AND (go:0090729)) OR ("Verotoxin" OR "shiga-like toxin" AND (go:0090729)) OR ("Botulinum toxin" AND (go:0090729)) OR ("Diphtheria toxin" AND (go:0090729)) OR ("Toxic shock" AND (go:0090729)) OR ("Ctx" AND (go:0090729)) OR ("Shiga toxin Stx" AND (go:0090729)) OR (bacteriophage AND (go:0090729)))'
  - name: Non-Toxin
    uniprot: 'bacteriophage NOT (go:0090729) AND reviewed: true'

Generic template:

clustering:
  name: cdhit #CDHIT is the only clustering tool allowed at the moment.
  deduplication-threshold: 100
  clustering-percentage: 95
  k_partitions: <add number of k-fold partitions>
classes:
  - name: <add name for the positive class>
    uniprot: <add query for uniprot>
    entrez: <add query for entrez>
  - name: <add name for the negative class>
    uniprot: <add query for uniprot>
    entrez: <add query for entrez>

Examples

  • Multiclass PVPs. This database pipeline shows an example with splitting the Phage Virion Proteins (PVPs) into multiple classes for higher granularity than the binary approach.
python phagescanner.py database -c configs/multiclass_pvps/database_multiclass.yaml -o ./benchmarking_database/ -n benchmarking -v info
  • binary pvps
python phagescanner.py database -c configs/binary_pvps/database_binary.yaml -o ./binary_database/ -n benchmarking -v info
  • Toxin proteins
python phagescanner.py database -c configs/phage_toxins/database_toxins.yaml -o ./toxin_database/ -n benchmarking -v info
Clone this wiki locally