Skip to content

Example popgen notebook #602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #6
tomwhite opened this issue Jun 8, 2021 · 7 comments
Open
Tracked by #6

Example popgen notebook #602

tomwhite opened this issue Jun 8, 2021 · 7 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@tomwhite
Copy link
Collaborator

tomwhite commented Jun 8, 2021

It would be good to have something that demonstrates a popgen workflow.

I converted one of @alimanfoo's MalariaGEN analyses to use sgkit a while back (#232). The problem with converting that to JupyterBook (#500) and including it as a part of the doc build is that the dataset is ~15GB in size, and the whole thing takes a while to run. So something smaller would be preferable - in the same way that #463 uses a cutdown dataset, for example.

@tomwhite tomwhite added the documentation Improvements or additions to documentation label Jun 8, 2021
@hammer
Copy link
Contributor

hammer commented Sep 23, 2021

@hammer
Copy link
Contributor

hammer commented Sep 28, 2021

I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file).

@alimanfoo
Copy link
Collaborator

I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file).

That file has just the sites, no sample genotypes I'm afraid.

Did you want to start the examples from VCFs? The MalariaGEN Ag3 data release provides one VCF per sample, so you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before zarr conversion. A bit more info about VCF downloads here.

Alternatively you could get an sgkit-style xarray dataset directly from the malariagen_data API, example here. By changing the sample_sets parameter you can select a smaller number of samples if you wanted. Data is all in GCS though, so even with a smaller number of samples it's probably not ideal if you want to run the notebook as part of the doc build.

Worth considering running the notebook manually, outside of the doc build?

@hammer
Copy link
Contributor

hammer commented Sep 28, 2021

Hey @alimanfoo! Indeed I discovered the lack of samples in that VCF file after a quick inspection.

I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for sgkit users.

My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook?

@alimanfoo
Copy link
Collaborator

I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for sgkit users.

No problem. Out of interest, were there any differences between the xarray dataset returned from our snp_calls() API method and the sgkit model? If so would be good to know, that method is intended to be a bridge to using sgkit on our data.

My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook?

You could start with a single contig - 3L is probably good as smallest autosomal contig - and you could also choose a single sample set - e.g., AG1000G-BF-B might be a good one, it has 102 samples, with representation from all three mosquito species we sampled. There are also smaller sample sets if you want even slimmer, see here for what's in each of the sample sets.

@AlksIDo
Copy link

AlksIDo commented Dec 28, 2021

you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before Zarr conversion.

@alimanfoo May I ask you to clarify this point? Is it a sgkit function, or does a standard tool like BCFtools serve the purpose better?

@tomwhite
Copy link
Collaborator Author

tomwhite commented Dec 2, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants