-
Notifications
You must be signed in to change notification settings - Fork 0
PhageScanner Configuration Files
Each pipeline in PhageScanner utilizes a YAML configuration file to enhance modularity and extensibility. This flexibility allows users to tailor the system to detect various classes of proteins according to specific needs. For instance, if there is a need to predict both toxic proteins and locate Phage Virion Proteins (PVPs), users can simply set up one configuration for each category. During the prediction process, separate configuration files can direct to the corresponding model files. It's also straightforward to focus on a single class, like PVP or toxic proteins, using just one configuration file per class.
This guide provides an overview of the configuration files and their practical applications. Example configurations can be found in the repository here: PhageScanner/configs
Below is a basic configuration example for predicting whether a protein is a Phage Virion Protein (PVP). It's important to note that a single configuration file can serve all three pipelines.
clustering:
deduplication-threshold: 100
clustering-percentage: 90
k_partitions: 5 # number of partitions in k-fold cross validation
classes:
- name: PVP
uniprot: "capsid AND cc_subcellular_location: virion AND reviewed: true"
entrez: "bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title])"
- name: non-PVP
uniprot: "capsid NOT cc_subcellular_location: virion AND reviewed: true"
entrez: "bacteriophage[Organism] NOT (bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title]))"
models:
- name: "PVP-SVM (SVM)"
model_info:
model_name: "SVM"
sequential: false
features: # Options: "AAC", "DPC", "ISO", "PSEUDOAAC", "ATC", "CTD"
- name: "DPC"
parameters: # DPC must have 'gap_size' parameter. 0 for regular DPC
gap_size: 0
- name: "AAC"
- name: "ATC"
- name: "CTD"
- name: "PCP"
In this YAML file, there are three main sections: (1) clustering
, (2) classes
, and (3) models
. The clustering
section details how proteins are grouped after being downloaded. The classes
section defines the target classes for the model and specifies which proteins fit into each category using queries to Uniprot and/or Entrez. The models
section outlines which models to train and which feature extractors to use, with an example here of an SVM model, named "PVP-SVM (SVM)", utilizing five different feature types listed in the features
section.
This section explains how proteins are grouped together after being downloaded to avoid duplicates in the training and testing datasets. Here’s what needs to be specified:
-
deduplication-threshold: Sets how strictly duplicates are removed. For example, a setting of
100
means only exact duplicates are removed. A setting of90
means proteins with more than 90% similarity are considered duplicates. The recommended setting is100
. -
clustering-percentage: Used to cluster proteins into groups after duplicates are removed. This should be set lower than the deduplication threshold to ensure variations between training and testing datasets. The recommended setting is
90
. -
k_partitions: Determines how many partitions are created for k-fold cross validation, which impacts the amount of training data and training time. Recommended values are
5
or10
.
This section allows you to define which protein classes to predict and include in each class based on queries:
-
Class Names: Each class must have a user-specified name like
PVPs
,Toxic
, orDNAInvolved
. This name is used during predictions to label the coding regions. -
Protein Queries: Include a
uniprot
and/orentrez
query for each class to specify which proteins belong to that class. This is a common practice in modeling to ensure reproducibility. - Flexibility: You can define anywhere from 2 to N classes, providing significant flexibility in how many types of proteins you can classify.
This section details the models that will be trained for predicting each class:
- Model Name: User-specified and should match one of the models available in PhageScanner.
-
Model Info:
-
model_name
: Specifies the model to use. It must be one supported by PhageScanner. -
sequential
: Indicates whether the model uses a 1D or 2D input vector. Set this totrue
for models like LSTM and CNN that require 2D inputs.
-
-
Features: Defines what feature extractors to use for preparing the model's input data. For example, Dipeptide Composition (
DPC
) might include agap_size
parameter to define gaps between dinucleotides.
The PhageScanner pipeline offers a variety of feature extractors, providing flexibility to test different features with the available models and facilitating the reproduction of models used in the scientific literature. Below is a table of available feature extractors in PhageScanner. Each feature extractor can be combined as shown in the examples. It is important to note that vectors returned from each feature extractor are concatenated to create the final vector used for each model. Thus, the impact of adding multiple features should be carefully considered beforehand.
Name | Config Specifier | Vector Size | Description |
---|---|---|---|
Amino Acid Composition | AAC |
20 | Calculates the frequency of each amino acid in the protein. |
Dipeptide Composition | DPC |
400 (20x20) (optional parameter: gap_size ) |
Analyzes the frequency of two adjacent amino acids, optionally separated by a specified gap. |
Tripeptide Composition | TPC |
8000 (20x20x20) | Measures the frequency of tripeptides, i.e., sequences of three consecutive amino acids. |
Isoelectric Point | ISO |
1 | Determines the isoelectric point of a protein, indicating at which pH it carries no net charge. |
Pseudo-Amino Acid Composition | PSEUDOAAC |
Variable (20+ custom features) | Calculates a modified amino acid composition that includes additional biological information. |
Atomic Composition | ATC |
5 (C, H, N, O, S) | Computes the frequency of various atoms present within the protein's amino acids. |
Composition, Transition, Distribution | CTD |
Variable | Analyzes the composition, transition, and distribution of amino acid properties in a protein. |
Protein Chemical Properties | PCP |
10+ based on specific properties | Extracts a profile of chemical properties like hydrophobicity and charge from the protein. |
Chemical Features | CHEMFEATURES |
16+ based on complex attributes | Obtains a wide range of chemical and physical properties of the protein. |
Sequential One-Hot | SEQUENTIALONEHOT |
2000x20 (or sequence length x 20) | Generates a one-hot encoded matrix representing the protein sequence for deep learning models. |
Hashed Sequence | HASH_SEQ |
Configurable (e.g., 50) | Creates a hashed feature vector of the protein sequence to manage large data efficiently. |
Protein Sequence | PROTEINSEQ |
Sequence length | Returns the raw sequence of the protein without any transformation. |
PhageScanner supports several models, with potential for extension through community contributions or specific requests. Each model listed below includes its configuration file specifier along with a description and the library used.
Name | Model Specifier | Architecture Description |
---|---|---|
Support Vector Machine | SVM |
Library: Scikit-Learn. Standardizes data using StandardScaler followed by SVC with probability estimates. |
Feedforward Neural Network | FFNN |
Library: Keras. Consists of an input layer, multiple dense layers with ReLU activation and dropout, ending in a softmax output layer. |
Multinomial Naive Bayes | MULTINAIVEBAYES |
Library: Scikit-Learn. Utilizes MultinomialNB suitable for classification with discrete features. |
Gradient Boosting | GRADBOOST |
Library: Scikit-Learn. Utilizes GradientBoostingClassifier with specific settings for robust modeling. |
Random Forest | RANDOMFOREST |
Library: Scikit-Learn. Uses RandomForestClassifier with controlled tree depth and random state. |
BLAST | BLAST |
Library: Custom BLASTWrapper. Compares sequences against a database for classification. |
Logistic Regression | LOGREG |
Library: Scikit-Learn. Implements LogisticRegression for binary classification with 'ovr' setting. |
Convolutional Neural Network | CNN |
Library: Keras. Features convolutional layers, pooling, batch normalization, and dense layers with a softmax output. |
Recurrent Neural Network | RNN |
Library: Keras. Includes an `LSTM |