Description
Hello,
I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity:
import allel as al
g = al.GenotypeArray([[[0, 0], [0, 0]],
[[0, 0], [0, 1]],
[[0, 0], [1, 1]],
[[0, 1], [1, 1]],
[[1, 1], [1, 1]],
[[0, 0], [1, 2]],
[[0, 1], [1, 2]],
[[0, 1], [-1, -1]],
[[-1, -1], [-1, -1]]])
ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
al.sequence_diversity(pos, ac, start=21, stop=24)
Because the selected range resides between two variants, pos.locate_range()
errors out with a KeyError
. Here is the full traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-2-0cac1b30194e> in <module>
12 pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
13
---> 14 al.sequence_diversity(pos, ac, start=21, stop=24)
~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/stats/diversity.py in sequence_diversity(pos, ac, start, stop, is_accessible)
266 # deal with subregion
267 if start is not None or stop is not None:
--> 268 loc = pos.locate_range(start, stop)
269 pos = pos[loc]
270 ac = ac[loc]
~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/model/ndarray.py in locate_range(self, start, stop)
3609
3610 if stop_index - start_index == 0:
-> 3611 raise KeyError(start, stop)
3612
3613 loc = slice(start_index, stop_index)
KeyError: (21, 24)
Generally, I think addressing this by modifying the behavior through pos.locate_range()
is probably the wrong approach. Instead, we could wrap this in a try/except
block for sequence_diversity()
and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should be np.nan
instead? With the usual caveats this is a bit of a hack to deal with missingness values....).
Additionally, it's worth pointing out that this behavior is similar if no bases are accessible, but the range does include variants:
import allel as al
import numpy as np
g = al.GenotypeArray([[[0, 0], [0, 0]],
[[0, 0], [0, 1]],
[[0, 0], [1, 1]],
[[0, 1], [1, 1]],
[[1, 1], [1, 1]],
[[0, 0], [1, 2]],
[[0, 1], [1, 2]],
[[0, 1], [-1, -1]],
[[-1, -1], [-1, -1]]])
ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
al.sequence_diversity(pos, ac, start=12, stop=24, is_accessible=np.repeat(False, 30))
The issue here is that mask_inaccessible() is removing all positions that are inaccessible. In this case, I think a np.nan
might be a more desired return value too?
I am curious what you think.
thanks,
Vince