Skip to content

Negative GQ values after vcf_to_hdf5 #341

Open
@grlewis333

Description

@grlewis333

Hi, I was comparing some different VCF extraction packages and have been really impressed by the extraction speed in the h5 output of vcf_to_hdf5. However, I was doing some basic tests just to check outputs were the same from each method and have found some strange behaviour in the scikit-allel output.

Specifically I found that my GQ values >128 appear to have a negative offset...

Here is an example plotting GQ against sample DP and you can see the (orange) scikit-allel result has this strange offset:
image

For example, row 265 of the VCF is:
chr17 1005289 . A G 431133 PASS AC=6;AF=0.823;AN=6;DP=109787;FS=0;MQ=249.96;MQRankSum=5.615;QD=3.94;ReadPosRankSum=3.074;SOR=0.7 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP 1/1:0,53:1:53:156:PASS:.:.:244,159,0:206,156,0 1/1:0,40:1:40:117:PASS:.:.:205,120,0:167.3,117.3,0 1/1:0,50:1:50:147:PASS:.:.:235,150,0:197.4,147.4,0

You can see that the correct GQ for sample 1 is 156, but the value I get through scikit-allel is -100:

allel.vcf_to_hdf5('fpath.vcf.gz', 'fpath.h5', fields='*', overwrite=True)
callset = h5py.File(fpath.h5,mode='r')
GQ = callset['calldata/GQ']
GQ_0 = GQs[:,0]
print(GQ_0[265])

>>> -100

I found 271 cases where the GQ is offset and in each cases it is offset by either 28 (256) or 211 (2048) which can't be a coincidence.

Any idea what might be causing this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions