-
Notifications
You must be signed in to change notification settings - Fork 35
Example popgen notebook #602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
More notebooks to bring over at https://github.com/pystatgen/sgkit/discussions/674#discussioncomment-1343388 |
I found ag.allsites.nonN.vcf.gz which is 718 MB. Perhaps we could grab a subset of the samples in this file and drive the MalariaGEN analyses from that? (Update: nope, turns out there are no samples in this file). |
That file has just the sites, no sample genotypes I'm afraid. Did you want to start the examples from VCFs? The MalariaGEN Ag3 data release provides one VCF per sample, so you could build a dataset of any number of samples if you wanted, although there is a step required to merge the per-sample VCFs before zarr conversion. A bit more info about VCF downloads here. Alternatively you could get an sgkit-style xarray dataset directly from the Worth considering running the notebook manually, outside of the doc build? |
Hey @alimanfoo! Indeed I discovered the lack of samples in that VCF file after a quick inspection. I saw that you provide the data in Xarray format and serialized as Zarr, but I noticed some small discrepancies between your data model and ours, and thought working from VCF might be less confusing for My current plan is to use your Xarray interface to select a subset of the data and massage it into our format, then save as Zarr and make that file available someplace. Is there a logical subset of the data that you think might make sense to use as a starting point for a demonstration notebook? |
No problem. Out of interest, were there any differences between the xarray dataset returned from our
You could start with a single contig - 3L is probably good as smallest autosomal contig - and you could also choose a single sample set - e.g., AG1000G-BF-B might be a good one, it has 102 samples, with representation from all three mosquito species we sampled. There are also smaller sample sets if you want even slimmer, see here for what's in each of the sample sets. |
@alimanfoo May I ask you to clarify this point? Is it a sgkit function, or does a standard tool like BCFtools serve the purpose better? |
It would be good to have something that demonstrates a popgen workflow.
I converted one of @alimanfoo's MalariaGEN analyses to use sgkit a while back (#232). The problem with converting that to JupyterBook (#500) and including it as a part of the doc build is that the dataset is ~15GB in size, and the whole thing takes a while to run. So something smaller would be preferable - in the same way that #463 uses a cutdown dataset, for example.
The text was updated successfully, but these errors were encountered: