Calculating the sequence diversity between two variants returns a key error

Hello,

I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity: 

```python
import allel as al
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=21, stop=24)
```

Because the selected range resides between two variants, `pos.locate_range()` errors out with a `KeyError`.  Here is the full traceback:

```python

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-0cac1b30194e> in <module>
     12 pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
     13 
---> 14 al.sequence_diversity(pos, ac, start=21, stop=24)

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/stats/diversity.py in sequence_diversity(pos, ac, start, stop, is_accessible)
    266     # deal with subregion
    267     if start is not None or stop is not None:
--> 268         loc = pos.locate_range(start, stop)
    269         pos = pos[loc]
    270         ac = ac[loc]

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/model/ndarray.py in locate_range(self, start, stop)
   3609 
   3610         if stop_index - start_index == 0:
-> 3611             raise KeyError(start, stop)
   3612 
   3613         loc = slice(start_index, stop_index)

KeyError: (21, 24)
```

Generally, I think addressing this by modifying the behavior through `pos.locate_range()` is probably the wrong approach. Instead, we could wrap this in a `try/except` block for `sequence_diversity()` and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should be `np.nan` instead? With the usual caveats this is a bit of a hack to deal with missingness values....).

Additionally, it's worth pointing out that this behavior is similar *if no bases are accessible*, but the range does include variants:

```python
import allel as al
import numpy as np
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=12, stop=24, is_accessible=np.repeat(False, 30))
```

The issue here is that [mask_inaccessible()](https://github.com/cggh/scikit-allel/blob/master/allel/stats/diversity.py#L264) is removing all positions that are inaccessible. In this case, I think a `np.nan` might be a more desired return value too? 

I am curious what you think.

thanks, 
Vince



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calculating the sequence diversity between two variants returns a key error #356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Calculating the sequence diversity between two variants returns a key error #356

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions