-
Notifications
You must be signed in to change notification settings - Fork 0
Building the Local Database of proteins
This pipeline is used to download all proteins of interest from a given database (or databases). PhageScanner allows for either UniProt of Entrez downloads and more databases can eventually be added to this set. The goal of PhageScanner is to create models to predict if a given genome (or metagenome) contains a type of general protein class (for example, toxic or phage virion protein), and this pipeline downloads the proteins for training the models.
python phagescanner.py database -c Path/To/Config.yaml -o path/to/output/directory/ -n name_for_files_<classname>
This pipeline requires a configuration file for specifying the following:
- DB query per protein class. Both Uniprot and Entrez allow for declarative queries to retrieve all proteins matching a set of criteria. For example, all proteins that correspond to a particular gene ontology (GO) class.
- The clustering percentage. This is needed to prevent using the same proteins in both the training and testing sets, so it essentially removes duplicates.
- The number of partitions for the training/testing split. k-fold cross validation is used for the testing pipeline. Therefore, this database pipeline needs to split the proteins into different partitions for the downstream training pipeline.
Each configuration file is a yaml file that contains this information. The names of the proteins classes can be changed as desired, as well as the quries used. Below is an example of this configuration file:
clustering:
name: cdhit #CDHIT is the only clustering tool allowed at the moment.
deduplication-threshold: 100
clustering-percentage: 95
k_partitions: 5 # number of partitions in k-fold cross validation
classes:
- name: KnownToxin
uniprot: '(bacteriophage AND reviewed: true AND ((go:0090729)) OR ("Cholera toxin" AND CTX) OR ("exotoxin C" AND (go:0090729)) OR ("exotoxin A" AND "(SpeA)" AND (go:0090729)) OR ("Verotoxin" OR "shiga-like toxin" AND (go:0090729)) OR ("Botulinum toxin" AND (go:0090729)) OR ("Diphtheria toxin" AND (go:0090729)) OR ("Toxic shock" AND (go:0090729)) OR ("Ctx" AND (go:0090729)) OR ("Shiga toxin Stx" AND (go:0090729)) OR (bacteriophage AND (go:0090729)))'
- name: Non-Toxin
uniprot: 'bacteriophage NOT (go:0090729) AND reviewed: true'
Generic template:
clustering:
name: cdhit #CDHIT is the only clustering tool allowed at the moment.
deduplication-threshold: 100
clustering-percentage: 95
k_partitions: <add number of k-fold partitions>
classes:
- name: <add name for the positive class>
uniprot: <add query for uniprot>
entrez: <add query for entrez>
- name: <add name for the negative class>
uniprot: <add query for uniprot>
entrez: <add query for entrez>
- Multiclass PVPs. This database pipeline shows an example with splitting the Phage Virion Proteins (PVPs) into multiple classes for higher granularity than the binary approach.
python phagescanner.py database -c configs/multiclass_pvps/database_multiclass.yaml -o ./benchmarking_database/ -n benchmarking -v info
- binary pvps
python phagescanner.py database -c configs/binary_pvps/database_binary.yaml -o ./binary_database/ -n benchmarking -v info
- Toxin proteins
python phagescanner.py database -c configs/phage_toxins/database_toxins.yaml -o ./toxin_database/ -n benchmarking -v info