Description
Hi, I was comparing some different VCF extraction packages and have been really impressed by the extraction speed in the h5 output of vcf_to_hdf5. However, I was doing some basic tests just to check outputs were the same from each method and have found some strange behaviour in the scikit-allel output.
Specifically I found that my GQ values >128 appear to have a negative offset...
Here is an example plotting GQ against sample DP and you can see the (orange) scikit-allel result has this strange offset:
For example, row 265 of the VCF is:
chr17 1005289 . A G 431133 PASS AC=6;AF=0.823;AN=6;DP=109787;FS=0;MQ=249.96;MQRankSum=5.615;QD=3.94;ReadPosRankSum=3.074;SOR=0.7 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP 1/1:0,53:1:53:156:PASS:.:.:244,159,0:206,156,0 1/1:0,40:1:40:117:PASS:.:.:205,120,0:167.3,117.3,0 1/1:0,50:1:50:147:PASS:.:.:235,150,0:197.4,147.4,0
You can see that the correct GQ for sample 1 is 156, but the value I get through scikit-allel is -100:
allel.vcf_to_hdf5('fpath.vcf.gz', 'fpath.h5', fields='*', overwrite=True)
callset = h5py.File(fpath.h5,mode='r')
GQ = callset['calldata/GQ']
GQ_0 = GQs[:,0]
print(GQ_0[265])
>>> -100
I found 271 cases where the GQ is offset and in each cases it is offset by either 28 (256) or 211 (2048) which can't be a coincidence.
Any idea what might be causing this issue?