LABEL’s purpose is to quickly (relative to building an MSA and tree), automatically, and correctly assign clades or lineages to nucleotide sequences. Automated lineage assignment has applications in surveillance, research, and high-throughput database annotation. Additional information is on the LABEL website or you can read the manuscript.
LABEL v0.7.0, updated 2025
Samuel S. Shepard ([email protected]), Centers for Disease Control & Prevention
Usage:
LABEL [-E C_OPT] [-W WRK_PATH|-O OUT_PATH] [-TRD|-S] [-L LIN_PATH] <nts.fasta> <project> <Module:H5,H9,etc.>
-T Do TRAINING again instead of using classifier files.
-E SGE clustering option. Use 1 or 2 for SGE with array jobs, else local.
-R No RECURSIVE prediction. Limits scope, useful with -L option.
-D No DELETION of extra intermediary files.
-S Show available protein modules.
-W Web-server mode: requires ABSOLUTE path to WRITABLE working directory.
-O Output directory path, do not use with web mode.
Example: ./LABEL -C gisaid_H5N1.fa Bird_Flu H5
-
LABEL takes FASTA formatted nucleotide sequences. The FASTA may be single or multi-line and may contain any number of sequences. Extra sequences with redundant headers are removed (first-read, first kept)! Commas and apostrophes are removed from headers while internal spaces are underlined.
-
LABEL generates re-annotated FASTA sequences, scoring data, tab-delimited files, and miscellaneous text files. LABEL's output is limited to text. LABEL's output is limited to a specified output directory (or to a default working directory within the package) and to the current working directory of the calling user.
File | Type | Description |
---|---|---|
PROJ_final.tab | Standard. | Tab-delimited headers & predicted clades. |
PROJ_final.txt | Standard. | A prettier output of the above. |
LEVEL_trace.tab | Standard. | Table of HMM scores at each level, suitable for visualization in R. |
LEVEL_result.tab | Standard. | For the current prediction level, tab-delimited headers & predicted clades. |
LEVEL_result.txt | Standard. | For the current prediction level, A prettier output of the above. |
FASTA/ | Standard. | Folder containing fasta files and newick trees. |
FASTA/PROJ_predictions.fas | Standard. | Query sequence file with predictions added like: _{PRED:CLAD} |
FASTA/PROJ_reannotated.fas | Default. | Query file with annotations replaced with predicted ones, ordered by clade. |
FASTA/PROJ_clade_CLAD.fas | Standard. | The re-annotated file partitioned into separate clade files. |
c-*/ | Standard. | Clade/lineage subfolder for the hierarchical predictions. |
The project name is denoted "PROJ", the lineage or clade is called "CLAD", and the module of interest as “MOD”.
LABEL modules are merely directories within the LABEL_RES/training_data folder and contain all associated pHMMs as well as SVM training data. Extensions such as x-filter.txt control against inappropriate data input.
Most of the these modules were trained by Sam Shepard and/or Ujwal Bagal.
Module | Description |
---|---|
B_HAv2019, B_HAv2017 | Influenza B hemagglutinin clade modules, trained in 2019 and 2017 |
B_NAv2016 | Influenza B neuraminidase clade module, trained in 2016 |
B_PB2v2016, B_PB1v2016, B_PAv2016, B_NPv2016, B_MPv2016, B_NSv2016 | Influenza B internal gene segment lineage modules, trained in 2016 |
†H5v2023 | A provisional module for a proposed update to the H5 nomenclature, trained in 2023 |
H5v2015, H5v2013, H5v2011 | Influenza A hemagglutinin modules for H5N1, for nomenclatures from 2015, c. 2013, & c. 2011 |
†H7v2013 | Influenza A hemagglutinin module for H7 subtype, trained in 2013 |
H9v2011 | Influenza A hemagglutinin module for H9N2 described in the LABEL manuscript |
H1pdm09v2019, H1pdm09v2018 | Influenza A H1N1pdm09 classification modules, trained in 2019 and 2018 |
H3v2019, H3v2016b, H3v2016a, H3v2016 | Influenza A hemagglutinin modules for H3 subtype classification, trained in 2016 and 2019 |
irma-FLU, irma-FLU-v2 | IRMA modules for influenza virus classification |
irma-FLU-HA, irma-FLU-NA, irma-FLU-OG, irma-FLU-OG-v2 | IRMA modules for influenza hemagglutinin, neuraminadase, and other influenza genes. Note: HA, NA and OG are part of IRMA's secondary two-stage LABEL modules. |
irma-FLU-HE | IRMA module for hemagglutinin-esterase (flu C,D). |
Modules may contain a release.txt
file with addition information. For up-to-date module availability, use: ./LABEL -S
† Provisional or experimental
We recommend a single multi-core machine with no fewer than 2 cores (8 or more threads work best) and at least 2 GB of RAM. LABEL runtime is impacted by the number of cores available on a machine. In addition software requirements include:
- Linux (RHEL8 or later GLIBC), MacOS 10.14 (intel) or MacOS 11 (arm64)
- BASH version 3+
- Standard utilities: sleep, cut, paste, jobs, zip, env, cat, cp, getopts.
- Perl version 5.16 or later
- Standard includes: Getopt::Long, File::Basename
Download the latest archive via our releases page. Use of wget
or curl
for downloads is recommended for MacOS to preserve functionality.
-
Unzip the archive containing LABEL.
-
Move the package to your desired location and add the folder to your
PATH
- Note: LABEL_RES and LABEL must be in the same folder.
-
LABEL is now installed. To test it from the package folder, execute:
./LABEL LABEL_RES/training_data/H9v2011/H9v2011_downsample.fa test_project H9v2011
Simply run:
docker run --rm -itv $(pwd):/data ghcr.io/cdcgov/label:latest LABEL # label args
We aggregate and provide builds of 3rd party software for execution at runtime with LABEL. You may install or obtain your own copies and LABEL will detect them, but the user will be required to test for compatibility.
- GNU Parallel
- Artifacts:
parallel
- Requires: system Perl
- Purpose: parallelization
- License: GPL v3
- Artifacts:
- SHOGUN version 1.1.0 (2.1+ is not compatible)
- Artifacts:
shogun
(cmdline_static) - Provided architectures: linux/x86_64, linux/aarch64, apple/x86_64 (arm64 via Rosetta2)
- Purpose: executes the SVM decision phase.
- License: GPL v3
- Artifacts:
- SAM version 3.5
- Artifacts:
align2model
,hmmscore
,modelfromalign
- Provided architectures: linux/x86_64, linux/aarch64, apple/universal (arm64 + intel)
- Purpose: build HMM profiles, score sequences for evaluation
- License: Custom academic/government, not-for-profit, redistributed with permission
- Artifacts:
Warning
Note that SAM is redistributed with permission for LABEL but its terms exclude commerical use without a license. If you are a commercial entity, you might need to reach out to the authors to obtain their custom license.
Lineage Assignment By Extended Learning (LABEL) uses hidden Markov model (HMM) profiles of clade alignments--or groups of clades--to analyze query sequences and then classify them via machine learning techniques. The HMM scoring step is performed via SAM. Prediction is performed hierarchically--usually starting out at a more general level (e.g., a groups of clades) and going to a very specific terminal level (a particular clade). This roughly corresponds to the hierarchical structure of phylogenetic trees and the H5N1 nomenclature system. The prediction phase of LABEL is done via support vector machines (SVM) using the free SHOGUN Machine Learning Toolbox v1.1.0 (multi-class GMNP SVM with polynomial kernel of degree 20, <www.shogun-toolbox.org>).
Training is performed using a combination of support scripts and by manually applied expert knowledge. Generally, a curated and annotated multiple sequence alignment is used along with the createLABELlevel.sh script.
For direct correspondence on the project, feel free to contact: Samuel S. Shepard, Centers for Disease Control and Prevention or reach out to other contributors.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later. This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version. This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details. You should have received a copy of the Apache Software License along with this program. If not, see: http://www.apache.org/licenses/LICENSE-2.0.html. The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer. For more information about CDC's privacy policy, please visit http://www.cdc.gov/other/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.