Skip to content

Training and Testing ML models

Dreycey Albin edited this page Jun 4, 2024 · 14 revisions

Overview

The training pipeline is used to train the models after the database pipeline has downloaded and partitioned the protein classes. The goal of this pipeline is to allow for testing many different models and the features for each model (via the configuration file). The input for PhageScanner is a genome, set of genomes, or a metagenomic dataset. Therefore, the only input for the models are the primary protein sequences gleaned from the DNA sequences. To expand the potential of the models, different features are extracted from the primary sequence of the protein. PhageScanner has many different feature extractors built-in and these feature extractors can be swapped and tested for any of the models specified.

Basic Usage

python phagescanner.py train -c Path/To/Config.yaml \
                             -o path/to/output/directory/ \
                             --database_csv_path <path to database CSV files> \
                             -v debug

Minimal Configuration File Requirements

While the same configuration file can be used for all three pipelines, this pipeline requires a configuration file specifications:

  1. The models and feature extractors to be used for annotation. After having trained the models during the training pipeline, we now aim to use them as a tool toward annotating genomic data (i.e. assembled genomes and/or sequencing data). Therefore, the prediction pipeline requires knowing both the location of the trained models (saved locally), the mapping from index to protein class, and the feature extractors being used (order matters).
  2. The protein classes. The protein classes are used to create a mapping between the output vector and the class name.

Example for Binary PVPs

classes:
  - name: PVP
  - name: non-PVP
models:
  - name: "PhageScanner (RNN)"
    model_info:
      model_name: "RNN"
      sequential: 3
    features: # Options: "AAC", "DPC", "ISO", "PSEUDOAAC", "ATC", "CTD"
      - name: "DPC"
        parameters: # DPC must have 'gap_size' parameter. 0 for regular DPC
          gap_size: 0

Examples

  • Binary PVPs. This is the common example of splitting proteins into being either Phage Virion Proteins (PVPs) or not.
python phagescanner.py train -c configs/binary_pvps_config.yaml -o binary_training_output -db ./binary_database/ -v debug
  • Multiclass PVPs. This is example with splitting the Phage Virion Proteins (PVPs) into multiple classes for higher granularity than the binary approach.
python phagescanner.py train -c configs/multiclass_config.yaml -o training_output -db ./multiclass_database/ -v debug
  • testing different features using baseline, logistic regression model, for the multiclass PVPs.
python phagescanner.py train -c configs/feature_testing_config.yaml -o feature_testing -db ./multiclass_database/ -v debug
  • Toxin proteins. This example shows the flexibility of the pipeline, which extends far beyond only PVPs. For example, one may want to know if a phage genome contains any toxins in order to develop a safe phage therapy cocktail.
python phagescanner.py train -c configs/phagetoxins_config.yaml -o phagetoxin_training -db ./toxin_database/ -v debug
Clone this wiki locally