Skip to content

Building the Local Database of proteins

Dreycey Albin edited this page Jun 4, 2024 · 24 revisions

Overview

This pipeline is used to download all proteins of interest from a given database (or databases). PhageScanner allows for either UniProt of Entrez downloads and more databases can eventually be added to this set. The goal of PhageScanner is to create models to predict if a given genome (or metagenome) contains a type of general protein class (for example, toxic or phage virion protein), and this pipeline downloads the proteins for training the models.

Basic Usage

For Help:

python phagescanner.py database -h

Required Arguments:

python phagescanner.py database -c Path/To/Config.yaml \
                                -o path/to/output/directory/

Optional Arguments:

python phagescanner.py database -c Path/To/Config.yaml \
                                -o path/to/output/directory/ \
                                --cdhit_path <path to cd-hit executable> [Defaults to `cd-hit`] \
                                -v debug

Minimal Configuration File Requirements

While the same configuration file can be used for all three pipelines, this pipeline requires a configuration file specifications:

  1. DB query per protein class. Both Uniprot and Entrez allow for declarative queries to retrieve all proteins matching a set of criteria. For example, all proteins that correspond to a particular gene ontology (GO) class.
  2. The clustering percentage. This is needed to prevent using the same proteins in both the training and testing sets, so it essentially removes duplicates.
  3. The number of partitions for the training/testing split. k-fold cross validation is used for the testing pipeline. Therefore, this database pipeline needs to split the proteins into different partitions for the downstream training pipeline.

Each configuration file is a yaml file that contains this information. The names of the proteins classes can be changed as desired, as well as the queries used. Below is an example of this configuration file:

Example for Binary PVPs:

clustering:
  deduplication-threshold: 100
  clustering-percentage: 90
  k_partitions: 5 # number of partitions in k-fold cross validation
classes:
  - name: PVP
    uniprot: "capsid AND cc_subcellular_location: virion AND reviewed: true"
    entrez: "bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title])"
  - name: non-PVP
    uniprot: "capsid NOT cc_subcellular_location: virion AND reviewed: true"
    entrez: "bacteriophage[Organism] NOT (bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title]))"

Generic template:

clustering:
  deduplication-threshold: 100
  clustering-percentage: 95
  k_partitions: <add number of k-fold partitions>
classes:
  - name: <add name for the positive class>
    uniprot: <add query for uniprot>
    entrez: <add query for entrez>
  - name: <add name for the negative class>
    uniprot: <add query for uniprot>
    entrez: <add query for entrez>

Examples

  • Binary PVPs. This is the common example of splitting proteins into being either Phage Virion Proteins (PVPs) or not.
python phagescanner.py database -c configs/binary_pvps_config.yaml -o ./binary_database/ -v info
  • Multiclass PVPs. This is example with splitting the Phage Virion Proteins (PVPs) into multiple classes for higher granularity than the binary approach.
python phagescanner.py database -c configs/feature_testing_config.yaml -o ./multiclass_database/ -v info
  • Toxin proteins. This example shows the flexibility of the pipeline, which extends far beyond only PVPs. For example, one may want to know if a phage genome contains any toxins in order to develop a safe phage therapy cocktail.
python phagescanner.py database -c configs/phagetoxins_config.yaml -o ./toxin_database/ -v info
Clone this wiki locally