Skip to content

Calculating the sequence diversity between two variants returns a key error #356

Open
@vsbuffalo

Description

@vsbuffalo

Hello,

I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity:

import allel as al
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=21, stop=24)

Because the selected range resides between two variants, pos.locate_range() errors out with a KeyError. Here is the full traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-0cac1b30194e> in <module>
     12 pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
     13 
---> 14 al.sequence_diversity(pos, ac, start=21, stop=24)

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/stats/diversity.py in sequence_diversity(pos, ac, start, stop, is_accessible)
    266     # deal with subregion
    267     if start is not None or stop is not None:
--> 268         loc = pos.locate_range(start, stop)
    269         pos = pos[loc]
    270         ac = ac[loc]

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/model/ndarray.py in locate_range(self, start, stop)
   3609 
   3610         if stop_index - start_index == 0:
-> 3611             raise KeyError(start, stop)
   3612 
   3613         loc = slice(start_index, stop_index)

KeyError: (21, 24)

Generally, I think addressing this by modifying the behavior through pos.locate_range() is probably the wrong approach. Instead, we could wrap this in a try/except block for sequence_diversity() and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should be np.nan instead? With the usual caveats this is a bit of a hack to deal with missingness values....).

Additionally, it's worth pointing out that this behavior is similar if no bases are accessible, but the range does include variants:

import allel as al
import numpy as np
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=12, stop=24, is_accessible=np.repeat(False, 30))

The issue here is that mask_inaccessible() is removing all positions that are inaccessible. In this case, I think a np.nan might be a more desired return value too?

I am curious what you think.

thanks,
Vince

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions