Skip to content

Commit e557a43

Browse files
committed
Merge branch 'main' of github.com:davetang/learning_vcf_file
2 parents 01696e1 + 2b1d002 commit e557a43

8 files changed

+404
-50
lines changed

README.md

+266-46
Large diffs are not rendered by default.

csq/README.md

+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
## README
2+
3+
Problem:
4+
5+
> However, current predictors analyse variants as isolated events, which can
6+
lead to incorrect predictions when adjacent variants alter the same codon, or
7+
when a frame-shifting indel is followed by a frame-restoring indel.
8+
9+
BCFtools/csq is a fast program for haplotype-aware consequence calling which
10+
can take into account known phase.
11+
12+
There are several popular existing programs for variant annotation including:
13+
14+
1. Ensembl Variant Effect Predictor (VEP)
15+
2. SnpEff
16+
3. ANNOVAR
17+
18+
but they do not take phasing into account.
19+
20+
## BCFtools/csq
21+
22+
`bcftools csq` requires a phased VCF, a GFF3 file with gene predictions, and a
23+
reference FASTA file.
24+
25+
```
26+
About: Haplotype-aware consequence caller.
27+
Usage: bcftools csq [OPTIONS] in.vcf
28+
29+
Required options:
30+
-f, --fasta-ref FILE Reference file in fasta format
31+
-g, --gff-annot FILE GFF3 annotation file
32+
33+
CSQ options:
34+
-B, --trim-protein-seq INT Abbreviate protein-changing predictions to max INT aminoacids
35+
-c, --custom-tag STRING Use this tag instead of the default BCSQ
36+
-l, --local-csq Localized predictions, consider only one VCF record at a time
37+
-n, --ncsq INT Maximum number of per-haplotype consequences to consider for each site [15]
38+
-p, --phase a|m|r|R|s How to handle unphased heterozygous genotypes: [r]
39+
a: take GTs as is, create haplotypes regardless of phase (0/1 -> 0|1)
40+
m: merge *all* GTs into a single haplotype (0/1 -> 1, 1/2 -> 1)
41+
r: require phased GTs, throw an error on unphased het GTs
42+
R: create non-reference haplotypes if possible (0/1 -> 1|1, 1/2 -> 1|2)
43+
s: skip unphased hets
44+
Options:
45+
-e, --exclude EXPR Exclude sites for which the expression is true
46+
--force Run even if some sanity checks fail
47+
-i, --include EXPR Select sites for which the expression is true
48+
--no-version Do not append version and command line to the header
49+
-o, --output FILE Write output to a file [standard output]
50+
-O, --output-type b|u|z|v|t[0-9] b: compressed BCF, u: uncompressed BCF, z: compressed VCF
51+
v: uncompressed VCF, t: plain tab-delimited text output, 0-9: compression level [v]
52+
-r, --regions REGION Restrict to comma-separated list of regions
53+
-R, --regions-file FILE Restrict to regions listed in a file
54+
--regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1]
55+
-s, --samples -|LIST Samples to include or "-" to apply all variants and ignore samples
56+
-S, --samples-file FILE Samples to include
57+
-t, --targets REGION Similar to -r but streams rather than index-jumps
58+
-T, --targets-file FILE Similar to -R but streams rather than index-jumps
59+
--targets-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [0]
60+
--threads INT Use multithreading with <int> worker threads [0]
61+
-v, --verbose INT Verbosity level 0-2 [1]
62+
63+
Example:
64+
bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf
65+
66+
# GFF3 annotation files can be downloaded from Ensembl. e.g. for human:
67+
ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/
68+
ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/
69+
```
70+
71+
The program begins by parsing gene predictions in the GFF3 file, then streams
72+
through the VCF file using a fast region lookup at each site to find overlaps
73+
with regions of supported genomic types (exons, CDS, UTRs or general
74+
transcripts). For more details read the paper (see [Further
75+
reading](#further-reading).
76+
77+
## Further reading
78+
79+
* [BCFtools/csq: haplotype-aware variant consequences
80+
](https://academic.oup.com/bioinformatics/article/33/13/2037/3000373)
2.36 MB
Binary file not shown.
Binary file not shown.
7.02 MB
Binary file not shown.
Binary file not shown.

readme.Rmd

+57-2
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ It is also relatively straightforward to compile on Linux (if your system has al
4141
```bash
4242
dir=$HOME/tools
4343

44-
ver=1.15
44+
ver=1.18
4545
for tool in htslib bcftools samtools; do
4646
check=${tool}
4747
if [[ ${tool} == htslib ]]; then
@@ -474,7 +474,62 @@ bcftools view eg/aln.bt.vcf.gz | perl -nle 'BEGIN { srand(1984) } if (/^#/){ pri
474474

475475
### Split an annotation field
476476

477-
The [split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) plugin can be used to split a structured field. However, `split-vep` was written to work with VCF files created by `bcftools csq` or [VEP](https://github.com/Ensembl/ensembl-vep). It is possible to get it working with [SnpEff](https://github.com/pcingola/SnpEff), another popular variant annotation tool. with some modifications to the VCF header.
477+
The [split-vep](https://samtools.github.io/bcftools/howtos/plugin.split-vep.html) plugin can be used to split a structured field. `split-vep` was written to work with VCF files created by [VEP](https://github.com/Ensembl/ensembl-vep) or `bcftools csq`.
478+
479+
```{bash engine.opts='-l'}
480+
bcftools +split-vep -h || true
481+
```
482+
483+
#### VEP
484+
485+
An [example VCF file](https://github.com/davetang/vcf_example) that was annotated with VEP is available as `eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz`. To list the annotation fields use `-l`.
486+
487+
```{bash engine.opts='-l'}
488+
bcftools +split-vep -l eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz | head
489+
```
490+
491+
Use `-f` to print the wanted fields in your own specified format; variants without consequences are excluded.
492+
493+
494+
```{bash engine.opts='-l'}
495+
bcftools +split-vep -f '%CHROM:%POS,%ID,%Consequence\n' eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz | head
496+
```
497+
498+
Limit output to missense or more severe variants.
499+
500+
```{bash engine.opts='-l'}
501+
bcftools +split-vep -f '%CHROM:%POS,%ID,%Consequence\n' -s worst:missense+ eg/S1.haplotypecaller.filtered_VEP.ann.vcf.gz | head
502+
```
503+
504+
#### BCFtools csq
505+
506+
An [example VCF file](https://github.com/davetang/vcf_example) that was annotated with BCFtools csq is available as `eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz`. The tag added by `csq` is `INFO/BCSQ`, so we need to provide this to split-vep. To list the annotation fields use `-l`.
507+
508+
```{bash engine.opts='-l'}
509+
bcftools +split-vep -a BCSQ -l eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz
510+
```
511+
512+
Use `-f` to print the wanted fields in your own specified format; variants without consequences are excluded.
513+
514+
```{bash engine.opts='-l'}
515+
bcftools +split-vep -a BCSQ -f '%CHROM:%POS,%ID,%Consequence\n' eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz | head
516+
```
517+
518+
The `-d` or `--duplicate` is useful to output annotations per transcript/allele on a new line.
519+
520+
```{bash engine.opts='-l'}
521+
bcftools +split-vep -a BCSQ -f '%transcript,%Consequence\n' eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz | head
522+
```
523+
524+
Use `-d` to split.
525+
526+
```{bash engine.opts='-l'}
527+
bcftools +split-vep -a BCSQ -d -f '%transcript,%Consequence\n' eg/S1.haplotypecaller.filtered.phased.csq.vcf.gz | head
528+
```
529+
530+
#### SnpEff
531+
532+
It is possible to use the split-vep plugin with [SnpEff](https://github.com/pcingola/SnpEff), another popular variant annotation tool with some modifications to the VCF header.
478533

479534
SnpEff provides annotations with the `ANN` tag.
480535

script/tools_for_readme.sh

+1-2
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ if [[ ! -d ${install_path} ]]; then
2222
fi
2323

2424
for tool in htslib bcftools; do
25-
ver=1.16
25+
ver=1.18
2626
check=${tool}
2727
if [[ ${tool} == htslib ]]; then
2828
check=bgzip
@@ -60,4 +60,3 @@ fi
6060

6161
>&2 echo Done
6262
exit 0
63-

0 commit comments

Comments
 (0)