Skip to content

Windowing fails for datasets where contigs lack variant sites #1268

Open
@percyfal

Description

@percyfal

The function sgkit.window_by_position fails if the input dataset has contigs that lack variant sites, throwing an IndexError.

Traceback
IndexError                                Traceback (most recent call last)
Cell In[36], line 1
----> 1 ds = sgkit.window_by_position(ds, size=10)

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:218, in window_by_position(ds, size, step, offset, variant_contig, variant_position, window_start_position, merge)
    214 positions = ds[variant_position].values
    215 window_start_positions = (
    216     ds[window_start_position].values if window_start_position is not None else None
    217 )
--> 218 return _window_per_contig(
    219     ds,
    220     variant_contig,
    221     merge,
    222     _get_windows_by_position,
    223     size,
    224     step,
    225     offset,
    226     positions,
    227     window_start_positions,
    228 )

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:320, in _window_per_contig(ds, variant_contig, merge, windowing_fn, *args, **kwargs)
    318 contig_window_stops = []
    319 for i, contig in enumerate(get_contigs(ds)):
--> 320     starts, stops = windowing_fn(
    321         contig, contig_bounds[i], contig_bounds[i + 1], *args, **kwargs
    322     )
    323     contig_window_starts.append(starts)
    324     contig_window_stops.append(stops)

File ~/dev/sgkit-dev/sgkit/sgkit/window.py:429, in _get_windows_by_position(contig, start, stop, size, step, offset, positions, window_start_positions)
    426 contig_pos = positions[start:stop]
    427 if window_start_positions is None:
    428     # TODO: position is 1-based (or use offset?)
--> 429     pos_starts = np.arange(offset, contig_pos[-1] + offset, step=step)
    430     window_starts = np.searchsorted(contig_pos, pos_starts) + start
    431     window_stops = np.searchsorted(contig_pos, pos_starts + size) + start

IndexError: index -1 is out of bounds for axis 0 with size 0

This will often be the case for fragmented references with many contigs. I provide a minimum working example for reference.

Minimum working example

The VCF below is a modified version of the file sgkit/tests/data/sample.vcf where I added the contig key to the header:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##contig=<ID=19,length=1000>
##contig=<ID=20,length=40000>
##contig=<ID=21,length=10000>
##contig=<ID=X,length=1000>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
19	111	.	A	C	9.6	.	.	GT:HQ	0|0:10,15	0|0:10,10	0/1:3,3
19	112	.	A	G	10	.	.	GT:HQ	0|0:10,10	0|0:10,10	0/1:3,3
20	4370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	7330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3:.,.
20	10696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4:.,.
20	30237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:.:56,60	0|0:48:4:51,51	0/0:61:2:.,.
20	34567	microsat1	G	GA,GAC	50	PASS	NS=3;DP=9;AA=G;AN=6;AC=3,1	GT:GQ:DP	0/1:.:4	0/2:17:2	./.:40:3
20	35237	.	T	.	.	.	.	GT	0/0	0|0	./.
X	10	rsTest	AC	A,ATG,C	10	PASS	.	GT	0	0/1	0|2

Convert to Zarr

bgzip sample.extrachrom.vcf 
tabix sample.extrachrom.vcf.gz
vcf2zarr convert sample.extrachrom.vcf.gz sample.extrachrom.vcz

and run code below to reproduce

import sgkit
import numpy as np

ds = sgkit.load_dataset("sample.extrachrom.vcz")
ds["sample_cohort"] = xr.DataArray(np.full(ds.dims['samples'], 0), dims="samples")
ds = sgkit.window_by_position(ds, size=10)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions