Skip to content

TCR generation probabilities #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RachelKarchin opened this issue Mar 20, 2025 · 7 comments
Open

TCR generation probabilities #11

RachelKarchin opened this issue Mar 20, 2025 · 7 comments
Assignees

Comments

@RachelKarchin
Copy link
Contributor

No description provided.

@dltamayo
Copy link
Collaborator

TCR generation probability: the process of VDJ recombination results in certain CDR3 sequences having a higher probability of occurring. This probability can be computed.

Implement module to calculate TCR generation probability for each CDR3 sequence in sample. @yuvalel will advise what software package to use.

Related: https://www.pnas.org/doi/pdf/10.1073/pnas.1409572111

@yuvalel
Copy link
Collaborator

yuvalel commented Apr 8, 2025

For TCR generation probability we should use the OLGA package - https://github.com/statbiophys/OLGA
It computes fast the generation probability for clones, based on previous recombination models.
It can use different clone definitions, we should just use the CDR3 amino acid chain as the clone.

@dltamayo
Copy link
Collaborator

dltamayo commented Apr 8, 2025

Input to Olga: vector of CDR3 aa chains for clones in one sample
Output: vector of probabilities for clones of that sample

@RachelKarchin
Copy link
Contributor Author

Next steps:
histogram of generation probabilties for a single sample

Identify a baseline repertoire.
Emerson dataset ! (use all individuals)
prepare histograms of each Emerson individual

is the test sample "healthy"?

Image

ks test?

@favorov
Copy link
Collaborator

favorov commented Apr 24, 2025

There is a very reliable measure of distribution-to-distribution distance, OT (Optimal Transport) aka Wassershtein distance. Having all the pairwise Emerson-to-Emarson sample distances calculated, we can position any new sample relative to Emerson samples looking at its sample-to-Emerson samples distances.
The OT distance for 1-D distribution has some simple form, kind of $\int_0^1 |F_1(x)-F_2(x)| dx$, where F are cumulative probabilities.

KevinMLanderos added a commit that referenced this issue Apr 25, 2025
Initial implementation #10 and #11, update tests, refactoring
@dimalvovs
Copy link
Collaborator

dimalvovs commented Apr 30, 2025

@KevinMLanderos do you know where the Emerson dataset to test the idea ?

@dltamayo
Copy link
Collaborator

dltamayo commented May 7, 2025

@KevinMLanderos do you know where the Emerson dataset to test the idea ?

@dimalvovs I think this is the dataset. n=666 in original cohort and n=120 in independent validation cohort:
https://clients.adaptivebiotech.com/pub/emerson-2017-natgen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants