Open
Description
The function sgkit.window_by_position
fails if the input dataset has contigs that lack variant sites, throwing an IndexError
.
Traceback
IndexError Traceback (most recent call last)
Cell In[36], line 1
----> 1 ds = sgkit.window_by_position(ds, size=10)
File ~/dev/sgkit-dev/sgkit/sgkit/window.py:218, in window_by_position(ds, size, step, offset, variant_contig, variant_position, window_start_position, merge)
214 positions = ds[variant_position].values
215 window_start_positions = (
216 ds[window_start_position].values if window_start_position is not None else None
217 )
--> 218 return _window_per_contig(
219 ds,
220 variant_contig,
221 merge,
222 _get_windows_by_position,
223 size,
224 step,
225 offset,
226 positions,
227 window_start_positions,
228 )
File ~/dev/sgkit-dev/sgkit/sgkit/window.py:320, in _window_per_contig(ds, variant_contig, merge, windowing_fn, *args, **kwargs)
318 contig_window_stops = []
319 for i, contig in enumerate(get_contigs(ds)):
--> 320 starts, stops = windowing_fn(
321 contig, contig_bounds[i], contig_bounds[i + 1], *args, **kwargs
322 )
323 contig_window_starts.append(starts)
324 contig_window_stops.append(stops)
File ~/dev/sgkit-dev/sgkit/sgkit/window.py:429, in _get_windows_by_position(contig, start, stop, size, step, offset, positions, window_start_positions)
426 contig_pos = positions[start:stop]
427 if window_start_positions is None:
428 # TODO: position is 1-based (or use offset?)
--> 429 pos_starts = np.arange(offset, contig_pos[-1] + offset, step=step)
430 window_starts = np.searchsorted(contig_pos, pos_starts) + start
431 window_stops = np.searchsorted(contig_pos, pos_starts + size) + start
IndexError: index -1 is out of bounds for axis 0 with size 0
This will often be the case for fragmented references with many contigs. I provide a minimum working example for reference.
Minimum working example
The VCF below is a modified version of the file sgkit/tests/data/sample.vcf
where I added the contig
key to the header:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##contig=<ID=19,length=1000>
##contig=<ID=20,length=40000>
##contig=<ID=21,length=10000>
##contig=<ID=X,length=1000>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
19 111 . A C 9.6 . . GT:HQ 0|0:10,15 0|0:10,10 0/1:3,3
19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3
20 4370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 7330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,.
20 10696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,.
20 30237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,.
20 34567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 ./.:40:3
20 35237 . T . . . . GT 0/0 0|0 ./.
X 10 rsTest AC A,ATG,C 10 PASS . GT 0 0/1 0|2
Convert to Zarr
bgzip sample.extrachrom.vcf
tabix sample.extrachrom.vcf.gz
vcf2zarr convert sample.extrachrom.vcf.gz sample.extrachrom.vcz
and run code below to reproduce
import sgkit
import numpy as np
ds = sgkit.load_dataset("sample.extrachrom.vcz")
ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples")
ds = sgkit.window_by_position(ds, size=10)