PhageScanner Configuration Files

Overview

Each pipeline in PhageScanner utilizes a YAML configuration file to enhance modularity and extensibility. This flexibility allows users to tailor the system to detect various classes of proteins according to specific needs. For instance, if there is a need to predict both toxic proteins and locate Phage Virion Proteins (PVPs), users can simply set up one configuration for each category. During the prediction process, separate configuration files can direct to the corresponding model files. It's also straightforward to focus on a single class, like PVP or toxic proteins, using just one configuration file per class.

This guide provides an overview of the configuration files and their practical applications. Example configurations can be found in the repository here: PhageScanner/configs

Basic Example

Below is a basic configuration example for predicting whether a protein is a Phage Virion Protein (PVP). It's important to note that a single configuration file can serve all three pipelines.

clustering:
  deduplication-threshold: 100
  clustering-percentage: 90
  k_partitions: 5 # number of partitions in k-fold cross validation
classes:
  - name: PVP
    uniprot: "capsid AND cc_subcellular_location: virion AND reviewed: true"
    entrez: "bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title])"
  - name: non-PVP
    uniprot: "capsid NOT cc_subcellular_location: virion AND reviewed: true"
    entrez: "bacteriophage[Organism] NOT (bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title]))"
models:
  - name: "PVP-SVM (SVM)"
    model_info:
      model_name: "SVM"
      sequential: false
    features: # Options: "AAC", "DPC", "ISO", "PSEUDOAAC", "ATC", "CTD"
      - name: "DPC"
        parameters: # DPC must have 'gap_size' parameter. 0 for regular DPC
          gap_size: 0
      - name: "AAC"
      - name: "ATC"
      - name: "CTD"
      - name: "PCP"

In this YAML file, there are three main sections: (1) clustering, (2) classes, and (3) models. The clustering section details how proteins are grouped after being downloaded. The classes section defines the target classes for the model and specifies which proteins fit into each category using queries to Uniprot and/or Entrez. The models section outlines which models to train and which feature extractors to use, with an example here of an SVM model, named "PVP-SVM (SVM)", utilizing five different feature types listed in the features section.

Clustering Section

This section explains how proteins are grouped together after being downloaded to avoid duplicates in the training and testing datasets. Here’s what needs to be specified:

deduplication-threshold: Sets how strictly duplicates are removed. For example, a setting of 100 means only exact duplicates are removed. A setting of 90 means proteins with more than 90% similarity are considered duplicates. The recommended setting is 100.
clustering-percentage: Used to cluster proteins into groups after duplicates are removed. This should be set lower than the deduplication threshold to ensure variations between training and testing datasets. The recommended setting is 90.
k_partitions: Determines how many partitions are created for k-fold cross validation, which impacts the amount of training data and training time. Recommended values are 5 or 10.

Classes Section

This section allows you to define which protein classes to predict and include in each class based on queries:

Class Names: Each class must have a user-specified name like PVPs, Toxic, or DNAInvolved. This name is used during predictions to label the coding regions.
Protein Queries: Include a uniprot and/or entrez query for each class to specify which proteins belong to that class. This is a common practice in modeling to ensure reproducibility.
Flexibility: You can define anywhere from 2 to N classes, providing significant flexibility in how many types of proteins you can classify.

Models Section

This section details the models that will be trained for predicting each class:

Model Name: User-specified and should match one of the models available in PhageScanner.
Model Info:
- model_name: Specifies the model to use. It must be one supported by PhageScanner.
- sequential: Indicates whether the model uses a 1D or 2D input vector. Set this to true for models like LSTM and CNN that require 2D inputs.
Features: Defines what feature extractors to use for preparing the model's input data. For example, Dipeptide Composition (DPC) might include a gap_size parameter to define gaps between dinucleotides.

Feature Extractors in PhageScanner

The PhageScanner pipeline offers a variety of feature extractors, providing flexibility to test different features with the available models and facilitating the reproduction of models used in the scientific literature. Below is a table of available feature extractors in PhageScanner. Each feature extractor can be combined as shown in the examples. It is important to note that vectors returned from each feature extractor are concatenated to create the final vector used for each model. Thus, the impact of adding multiple features should be carefully considered beforehand.

Name	Config Specifier	Vector Size	Description
Amino Acid Composition	`AAC`	20	Calculates the frequency of each amino acid in the protein.
Dipeptide Composition	`DPC`	400 (20x20) (optional parameter: `gap_size`)	Analyzes the frequency of two adjacent amino acids, optionally separated by a specified gap.
Tripeptide Composition	`TPC`	8000 (20x20x20)	Measures the frequency of tripeptides, i.e., sequences of three consecutive amino acids.
Isoelectric Point	`ISO`	1	Determines the isoelectric point of a protein, indicating at which pH it carries no net charge.
Pseudo-Amino Acid Composition	`PSEUDOAAC`	Variable (20+ custom features)	Calculates a modified amino acid composition that includes additional biological information.
Atomic Composition	`ATC`	5 (C, H, N, O, S)	Computes the frequency of various atoms present within the protein's amino acids.
Composition, Transition, Distribution	`CTD`	Variable	Analyzes the composition, transition, and distribution of amino acid properties in a protein.
Protein Chemical Properties	`PCP`	10+ based on specific properties	Extracts a profile of chemical properties like hydrophobicity and charge from the protein.
Chemical Features	`CHEMFEATURES`	16+ based on complex attributes	Obtains a wide range of chemical and physical properties of the protein.
Sequential One-Hot	`SEQUENTIALONEHOT`	2000x20 (or sequence length x 20)	Generates a one-hot encoded matrix representing the protein sequence for deep learning models.
Hashed Sequence	`HASH_SEQ`	Configurable (e.g., 50)	Creates a hashed feature vector of the protein sequence to manage large data efficiently.
Protein Sequence	`PROTEINSEQ`	Sequence length	Returns the raw sequence of the protein without any transformation.

Model Options in PhageScanner

PhageScanner supports several models, with potential for extension through community contributions or specific requests. Each model listed below includes its configuration file specifier along with a description and the library used.

Name	Model Specifier	Architecture Description
Support Vector Machine	`SVM`	Library: Scikit-Learn. Standardizes data using `StandardScaler` followed by `SVC` with probability estimates.
Feedforward Neural Network	`FFNN`	Library: Keras. Consists of an input layer, multiple dense layers with ReLU activation and dropout, ending in a softmax output layer.
Multinomial Naive Bayes	`MULTINAIVEBAYES`	Library: Scikit-Learn. Utilizes `MultinomialNB` suitable for classification with discrete features.
Gradient Boosting	`GRADBOOST`	Library: Scikit-Learn. Utilizes `GradientBoostingClassifier` with specific settings for robust modeling.
Random Forest	`RANDOMFOREST`	Library: Scikit-Learn. Uses `RandomForestClassifier` with controlled tree depth and random state.
BLAST	`BLAST`	Library: Custom BLASTWrapper. Compares sequences against a database for classification.
Logistic Regression	`LOGREG`	Library: Scikit-Learn. Implements `LogisticRegression` for binary classification with 'ovr' setting.
Convolutional Neural Network	`CNN`	Library: Keras. Features convolutional layers, pooling, batch normalization, and dense layers with a softmax output.
Recurrent Neural Network	`RNN`	Library: Keras. Includes an `LSTM

Getting Started

Using Each Pipeline

Misc

Community Development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PhageScanner Configuration Files

Overview

Basic Example

Clustering Section

Classes Section

Models Section

Feature Extractors in PhageScanner

Model Options in PhageScanner

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Using Each Pipeline

Misc

Clone this wiki locally