Skip to content

PhageScanner Configuration Files

Dreycey Albin edited this page Jun 7, 2024 · 21 revisions

Overview

Each pipeline in PhageScanner utilizes a YAML configuration file to enhance modularity and extensibility. This flexibility allows users to tailor the system to detect various classes of proteins according to specific needs. For instance, if there is a need to predict both toxic proteins and locate Phage Virion Proteins (PVPs), users can simply set up one configuration for each category. During the prediction process, separate configuration files can direct to the corresponding model files. It's also straightforward to focus on a single class, like PVP or toxic proteins, using just one configuration file per class.

This guide provides an overview of the configuration files and their practical applications. Example configurations can be found in the repository here: PhageScanner/configs

Basic Example

Below is a basic configuration example for predicting whether a protein is a Phage Virion Protein (PVP). It's important to note that a single configuration file can serve all three pipelines.

clustering:
  deduplication-threshold: 100
  clustering-percentage: 90
  k_partitions: 5 # number of partitions in k-fold cross validation
classes:
  - name: PVP
    uniprot: "capsid AND cc_subcellular_location: virion AND reviewed: true"
    entrez: "bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title])"
  - name: non-PVP
    uniprot: "capsid NOT cc_subcellular_location: virion AND reviewed: true"
    entrez: "bacteriophage[Organism] NOT (bacteriophage[Organism] AND ((shaft[Title] OR sheath[Title]) AND tail[Title]) OR head-tail[Title] OR tail fiber[Title] OR portal[Title] OR minor tail[Title] OR major tail[Title] OR baseplate[Title] OR minor capsid[Title] OR major capsid[Title]))"
models:
  - name: "PVP-SVM (SVM)"
    model_info:
      model_name: "SVM"
      sequential: false
    features: # Options: "AAC", "DPC", "ISO", "PSEUDOAAC", "ATC", "CTD"
      - name: "DPC"
        parameters: # DPC must have 'gap_size' parameter. 0 for regular DPC
          gap_size: 0
      - name: "AAC"
      - name: "ATC"
      - name: "CTD"
      - name: "PCP"

In this YAML file, there are three main sections: (1) clustering, (2) classes, and (3) models. The clustering section details how proteins are grouped after being downloaded. The classes section defines the target classes for the model and specifies which proteins fit into each category using queries to Uniprot and/or Entrez. The models section outlines which models to train and which feature extractors to use, with an example here of an SVM model, named "PVP-SVM (SVM)", utilizing five different feature types listed in the features section.

Clustering Section

This section explains how proteins are grouped together after being downloaded to avoid duplicates in the training and testing datasets. Here’s what needs to be specified:

  • deduplication-threshold: Sets how strictly duplicates are removed. For example, a setting of 100 means only exact duplicates are removed. A setting of 90 means proteins with more than 90% similarity are considered duplicates. The recommended setting is 100.
  • clustering-percentage: Used to cluster proteins into groups after duplicates are removed. This should be set lower than the deduplication threshold to ensure variations between training and testing datasets. The recommended setting is 90.
  • k_partitions: Determines how many partitions are created for k-fold cross validation, which impacts the amount of training data and training time. Recommended values are 5 or 10.

Classes Section

This section allows you to define which protein classes to predict and include in each class based on queries:

  • Class Names: Each class must have a user-specified name like PVPs, Toxic, or DNAInvolved. This name is used during predictions to label the coding regions.
  • Protein Queries: Include a uniprot and/or entrez query for each class to specify which proteins belong to that class. This is a common practice in modeling to ensure reproducibility.
  • Flexibility: You can define anywhere from 2 to N classes, providing significant flexibility in how many types of proteins you can classify.

Models Section

This section details the models that will be trained for predicting each class:

  • Model Name: User-specified and should match one of the models available in PhageScanner.
  • Model Info:
    • model_name: Specifies the model to use. It must be one supported by PhageScanner.
    • sequential: Indicates whether the model uses a 1D or 2D input vector. Set this to true for models like LSTM and CNN that require 2D inputs.
  • Features: Defines what feature extractors to use for preparing the model's input data. For example, Dipeptide Composition (DPC) might include a gap_size parameter to define gaps between dinucleotides.

Feature Extractors in PhageScanner

The PhageScanner pipeline offers a variety of feature extractors, providing flexibility to test different features with the available models and facilitating the reproduction of models used in the scientific literature. Below is a table of available feature extractors in PhageScanner. Each feature extractor can be combined as shown in the examples. It is important to note that vectors returned from each feature extractor are concatenated to create the final vector used for each model. Thus, the impact of adding multiple features should be carefully considered beforehand.

Name Config Specifier Vector Size Description
Amino Acid Composition AAC 20 Calculates the frequency of each amino acid in the protein.
Dipeptide Composition DPC 400 (20x20) (optional parameter: gap_size) Analyzes the frequency of two adjacent amino acids, optionally separated by a specified gap.
Tripeptide Composition TPC 8000 (20x20x20) Measures the frequency of tripeptides, i.e., sequences of three consecutive amino acids.
Isoelectric Point ISO 1 Determines the isoelectric point of a protein, indicating at which pH it carries no net charge.
Pseudo-Amino Acid Composition PSEUDOAAC Variable (20+ custom features) Calculates a modified amino acid composition that includes additional biological information.
Atomic Composition ATC 5 (C, H, N, O, S) Computes the frequency of various atoms present within the protein's amino acids.
Composition, Transition, Distribution CTD Variable Analyzes the composition, transition, and distribution of amino acid properties in a protein.
Protein Chemical Properties PCP 10+ based on specific properties Extracts a profile of chemical properties like hydrophobicity and charge from the protein.
Chemical Features CHEMFEATURES 16+ based on complex attributes Obtains a wide range of chemical and physical properties of the protein.
Sequential One-Hot SEQUENTIALONEHOT 2000x20 (or sequence length x 20) Generates a one-hot encoded matrix representing the protein sequence for deep learning models.
Hashed Sequence HASH_SEQ Configurable (e.g., 50) Creates a hashed feature vector of the protein sequence to manage large data efficiently.
Protein Sequence PROTEINSEQ Sequence length Returns the raw sequence of the protein without any transformation.

Model Options in PhageScanner

PhageScanner supports several models, with potential for extension through community contributions or specific requests. Each model listed below includes its configuration file specifier along with a description and the library used.

Name Model Specifier Architecture Description
Support Vector Machine SVM Library: Scikit-Learn. Standardizes data using StandardScaler followed by SVC with probability estimates.
Feedforward Neural Network FFNN Library: Keras. Consists of an input layer, multiple dense layers with ReLU activation and dropout, ending in a softmax output layer.
Multinomial Naive Bayes MULTINAIVEBAYES Library: Scikit-Learn. Utilizes MultinomialNB suitable for classification with discrete features.
Gradient Boosting GRADBOOST Library: Scikit-Learn. Utilizes GradientBoostingClassifier with specific settings for robust modeling.
Random Forest RANDOMFOREST Library: Scikit-Learn. Uses RandomForestClassifier with controlled tree depth and random state.
BLAST BLAST Library: Custom BLASTWrapper. Compares sequences against a database for classification.
Logistic Regression LOGREG Library: Scikit-Learn. Implements LogisticRegression for binary classification with 'ovr' setting.
Convolutional Neural Network CNN Library: Keras. Features convolutional layers, pooling, batch normalization, and dense layers with a softmax output.
Recurrent Neural Network RNN Library: Keras. Includes an `LSTM
Clone this wiki locally