Skip to content

Commit 699fb18

Browse files
committed
first draft of anglerfish run
1 parent 796b726 commit 699fb18

File tree

3 files changed

+54
-3
lines changed

3 files changed

+54
-3
lines changed

docs/anglerfish-run.md

Lines changed: 54 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,22 @@ This describes the the `anglerfish run` mode of running this tool, which expects
44

55
## Use cases
66

7+
The primary use of anglerfish would be to detect issues in Illumina sequencing pools using a method (ONT sequencing) independent from Illumina. Specific use cases:
8+
9+
- When samples are pooled evenly, detect outliers
10+
- Detect baroding issues or potential sample-mixups
11+
- And downstream of anglerfish: identify samples by mapping
12+
13+
```{figure} figure1.png
14+
:alt: A figure that with six sub-plots, with three per row
15+
:label: figure1
16+
:width: 70%
17+
:align: center
18+
19+
Figure 1. From [poster2020](./AGBT_poster_20200214.pdf). Top row: We compared pool of 40 Illumina barcodes sequenced on a MinION device and demultiplexed using Anglerfish with the outputs from an Illumina sequencer. It shows the abundances are comparable.
20+
Bottom row: Downstream uses cases of Anglerfish. Detecting library insert sizes, detecting pooling errors and mapping demultiplexed data to reference genomes.
21+
```
22+
723
## Output
824

925
Example of file output from `angerfish run` with a single setup pool (as opposed the a [complex](#mixed-setup-pools) one) and without specifying an `--out_fastq` option thus generating a default name for the output folder.
@@ -14,12 +30,23 @@ anglerfish_run_YYYY_MM_DD_HHMMSS
1430
├── anglerfish_stats.json
1531
├── anglerfish_stats.txt
1632
├── index_len(indexlength).fasta
17-
└── index_len(indexlength).paf
33+
├── index_len(indexlength).paf
34+
├── sample1.fastq.gz
35+
└── sample2.fastq.gz
1836
```
1937

2038
The basic operation of Anglerfish is to map the input reads to a template of the adaptors `index_len(indexlength).fasta` using
2139
minimap2 - output as an alignment file to `index_len(indexlength).paf`.
2240

41+
```{figure} figure2.png
42+
:alt: A figure showing a simple schematic on how anglerfish uses read-mapping to identify barcodes
43+
:label: figure2
44+
:width: 70%
45+
:align: center
46+
47+
Figure 2. And example of how read mapping in anglerfish works. Adapter templates for I7 and I5 map to the ends of the read "5b42", the Illumina barcodes are read from the "N" gap in the templates.
48+
```
49+
2350
Anglerfish reports the stats of the run to a report called `anglerfish_stats.txt`, with the same number found in a machine readable JSON format (`anglerfish_stats.json`) Let's look at a few field from this report and number the lines:
2451

2552
```
@@ -37,10 +64,11 @@ Anglerfish reports the stats of the run to a report called `anglerfish_stats.txt
3764
- 03: Each adapter type will have their own section in the header
3865
- 04: Any alignment from minimap given constraints of the [parameters](https://github.com/NationalGenomicsInfrastructure/anglerfish/blob/34ff1667d65281694e664bd48f53fa780f2075ce/anglerfish/demux/demux.py#L59) it's given
3966
- 06: Reads matching the template (even partially) adapter1-insert-adapter2
40-
- 08-09: Any [other matches](#uncategorized-alignments)
67+
- 08-09: Any reads falling outside of the adapter1-insert-adapter2 expectation. One reason for this could be incomplete [splitting](https://web.archive.org/web/20250207143034/https://nanoporetech.com/document/kit-14-device-and-informatics#introduction-to-read-splitting) of chimeric reads by the sequencing software. Anglerfish will not resolve such reads, and these cases have not been studied by the anglerfish authors.
4168

4269
`anglerfish_dataframe.csv` is an attempt to summarize all index level stats (samplesheet samples and unknown indexes) into a
4370
single "flat" table.
71+
And finally, the DNA inserts of each demultiplexed read will be output into fastq files according the samplesheet in `sample1.fastq.gz`, `sample2.fastq.gz`, etc.
4472

4573
## Mixed setup pools
4674

@@ -63,6 +91,25 @@ The path the fastq files supports glob'ing, e.g. you can specify multiple files
6391

6492
## Multiple ONT barcodes
6593

94+
Anglerfish has support for ONT barcoding using the option `-n, --ont_barcodes`. It does however assume a directory structure that is set by the sequencing software MinKNOW, where the fastq files of the demultiplexed ONT barcodes are arranged into folders name `barcode01`, `barcode02`, ..., e.g.:
95+
96+
```
97+
20250207_1125_1F_NNN12345_fa78ca0f/
98+
├── barcode01
99+
├── barcode02
100+
├── barcode03
101+
└── barcode04
102+
```
103+
104+
Let's say `barcode01` and `barcode02` contain Illumina pools you are interested in demultiplexing using anglerfish. The samplesheet you give might look something like this:
105+
106+
```
107+
dual1,truseq_dual,TAATGCGC-CAGGACGT,/path/to/20250207_1125_1F_NNN12345_fa78ca0f/barcode01/*.fastq.gz
108+
dual2,truseq_dual,TAATGCGC-GTACTGAC,/path/to/20250207_1125_1F_NNN12345_fa78ca0f/barcode01/*.fastq.gz
109+
single1,truseq,GAAACCCT,/path/to/20250207_1125_1F_NNN12345_fa78ca0f/barcode02/*.fastq.gz
110+
single2,truseq,CTGACTGA,/path/to/20250207_1125_1F_NNN12345_fa78ca0f/barcode02/*.fastq.gz
111+
```
112+
66113
## Unknown indexes
67114

68115
A list of indexes that do not match (within a set edit distance) of the indexes in the samplesheet will be listed in descending order at the bottom of the [report](#output).
@@ -72,4 +119,8 @@ The results might be distorted when the input fastq file(s) contain a [mixed ada
72119

73120
## Lenient mode
74121

75-
## Uncategorized alignments?
122+
A common category of human error when inputting samplesheets for sequencing instruments are mixing up the orientations of the barcodes, e.g. erroneously reverse complementing of the entire column of i5 indices. This is not helped by Illumina changing standards in their [versions](https://web.archive.org/web/20230602174828/https://knowledge.illumina.com/software/general/software-general-reference_material-list/000001800) of samplesheets.
123+
124+
To correct for this error in Anglerfish and still be able to evaluate other pooling issues at the same time, an extra mode is available called lenient (with option `-l, --lenient`). Essentially this mode will demultiplex the samplesheet four times using all the possible orientations of the indexes, i.e. all samples normal-I5 + normal-I7, all samples reverse comp-I5 + normal-I7, ....
125+
If one of the four runs yield 4 times more (adjustable using the option `-x, --lenient_factor `) demultiplexed reads than the second most abundant run - it will be the one reported in the `anglerfish_stats.txt` report.
126+
Additionally you can force the anglerfish to run in one of these three alternative index orientations using the `-p, --force_rc [i7|i5|i7+i5|original]`, this will however disable lenient mode.

docs/figure1.png

496 KB
Loading

docs/figure2.png

175 KB
Loading

0 commit comments

Comments
 (0)