Example popgen notebook #602

tomwhite · 2021-06-08T11:02:14Z

It would be good to have something that demonstrates a popgen workflow.

I converted one of @alimanfoo's MalariaGEN analyses to use sgkit a while back (#232). The problem with converting that to JupyterBook (#500) and including it as a part of the doc build is that the dataset is ~15GB in size, and the whole thing takes a while to run. So something smaller would be preferable - in the same way that #463 uses a cutdown dataset, for example.

hammer · 2021-09-23T14:15:30Z

More notebooks to bring over at https://github.com/pystatgen/sgkit/discussions/674#discussioncomment-1343388

hammer · 2021-09-28T15:37:16Z

I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file).

alimanfoo · 2021-09-28T20:29:13Z

I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file).

That file has just the sites, no sample genotypes I'm afraid.

Did you want to start the examples from VCFs? The MalariaGEN Ag3 data release provides one VCF per sample, so you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before zarr conversion. A bit more info about VCF downloads here.

Alternatively you could get an sgkit-style xarray dataset directly from the malariagen_data API, example here. By changing the sample_sets parameter you can select a smaller number of samples if you wanted. Data is all in GCS though, so even with a smaller number of samples it's probably not ideal if you want to run the notebook as part of the doc build.

Worth considering running the notebook manually, outside of the doc build?

hammer · 2021-09-28T21:28:34Z

Hey @alimanfoo! Indeed I discovered the lack of samples in that VCF file after a quick inspection.

I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for sgkit users.

My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook?

alimanfoo · 2021-09-29T14:46:25Z

I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for sgkit users.

No problem. Out of interest, were there any differences between the xarray dataset returned from our snp_calls() API method and the sgkit model? If so would be good to know, that method is intended to be a bridge to using sgkit on our data.

My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook?

You could start with a single contig - 3L is probably good as smallest autosomal contig - and you could also choose a single sample set - e.g., AG1000G-BF-B might be a good one, it has 102 samples, with representation from all three mosquito species we sampled. There are also smaller sample sets if you want even slimmer, see here for what's in each of the sample sets.

AlksIDo · 2021-12-28T13:51:00Z

you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before Zarr conversion.

@alimanfoo May I ask you to clarify this point? Is it a sgkit function, or does a standard tool like BCFtools serve the purpose better?

tomwhite · 2022-12-02T14:52:32Z

Link to notebook: https://github.com/tomwhite/shiny-train/blob/sgkit/notebooks/gwss/sgkit_h12.ipynb

tomwhite added the documentation Improvements or additions to documentation label Jun 8, 2021

hammer assigned tomwhite Nov 21, 2022

hammer mentioned this issue Nov 21, 2022

Use cases sgkit-dev/sgkit-publication#6

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example popgen notebook #602

Example popgen notebook #602

tomwhite commented Jun 8, 2021

hammer commented Sep 23, 2021

Uh oh!

hammer commented Sep 28, 2021 •

edited

Loading

Uh oh!

alimanfoo commented Sep 28, 2021

Uh oh!

hammer commented Sep 28, 2021

Uh oh!

alimanfoo commented Sep 29, 2021

Uh oh!

AlksIDo commented Dec 28, 2021 •

edited

Loading

Uh oh!

tomwhite commented Dec 2, 2022

Uh oh!

Example popgen notebook #602

Example popgen notebook #602

Comments

tomwhite commented Jun 8, 2021

hammer commented Sep 23, 2021

Uh oh!

hammer commented Sep 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alimanfoo commented Sep 28, 2021

Uh oh!

hammer commented Sep 28, 2021

Uh oh!

alimanfoo commented Sep 29, 2021

Uh oh!

AlksIDo commented Dec 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomwhite commented Dec 2, 2022

Uh oh!

hammer commented Sep 28, 2021 •

edited

Loading

AlksIDo commented Dec 28, 2021 •

edited

Loading