Skip to content

Adding TCRDist3 Functionality to Pipeline #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RachelKarchin opened this issue Jan 22, 2025 · 7 comments
Open

Adding TCRDist3 Functionality to Pipeline #1

RachelKarchin opened this issue Jan 22, 2025 · 7 comments

Comments

@RachelKarchin
Copy link
Contributor

This the path to the TCRdist docker: https://tcrdist3.readthedocs.io/en/latest/docker.html (quay.io/kmayerb/tcrdist3:0.1.9) we are planning to base on.

@yuvalel
Copy link
Collaborator

yuvalel commented Jan 22, 2025

Basic functionality requested - a user has a few TCR samples and want to explore the internal distances in each samples. This could be a biomarker of either response or potential to response and could be correlated vs clinical covariates.
Specifically we would like the user to be able to visualize distances within a repertoire sample(s) according to this breakdown:

  1. Input - repertoire sample (minimal example - just bulk beta CDR3s), output- distance matrix. Calculate every pairwise distance in the sample and output it as a matrix.
  2. Visualize distance matrix as a heatplot with or without clustering
  3. Histogram the distances and plot as a distribution over distance.

@KevinMLanderos
Copy link
Collaborator

Title: "Adding TCRdist3 Functionality to Cirro"

Introduction: Implementing TCRdist3 to TCRtoolkit will aid analysis when having TCR samples and want to explore the internal distances in each samples. Results could show a biomarker of either response or potential to response and could be correlated vs clinical covariates.

Outline of problem to be addressed: Specifically, we would like the user to calculate every pairwise distance in a bulk sample and output it as a matrix. We would also like to be able to visualize distances as a heatplot with or without clustering.

Expected inputs:

  • bulk-TCR repertoire data in tabular format (minimal example: CDR3b); can include V,J genes info and metadata.
  • Parameters: organism, chains (beta), etc...

Commands to be run/implemented:
(TCRdist3 documentation: https://tcrdist3.readthedocs.io/en/latest/tcrdistances.html)

  • TCRrep: Will give tcr distances
  • Plotting function for heatmap (e.g. pheatmap in R)
  • Dendogram function for clustering

Expected outputs:

  • Distance matrix
  • Clustering result: Computed from distances
  • Heatmap of the TCR distance matrix

@dltamayo
Copy link
Collaborator

Clarifications (meeting with @yuvalel 2025.02.14):

  • use reference database http://imgt.org for gene names
  • conduct distance matrix and clustering within each sample
  • conduct hierarchical clustering based on distance matrix (potentially include option to cluster by sequence similarity in the future)
  • additional deliverables: plot distribution - histogram with y-axis log scale

@dltamayo
Copy link
Collaborator

dltamayo commented Feb 28, 2025

Deliverables from this current issue:

  • Raw distance matrix file
  • Clustered distance matrix file
  • Heatmaps - clustered/unclustered
  • Histogram

@dltamayo dltamayo changed the title Adding TCRDist3 Functionality to Cirro Adding TCRDist3 Functionality to Pipeline Mar 20, 2025
@dltamayo
Copy link
Collaborator

Moving conversation about dense vs sparse matrix to this issue for easier tracking.
This test run of TCRtoolkit-Bulk using the full Yost 2019 subset completed successfully. However, the full distance matrices generated per sample vary considerably in size, depending on the number of clones in each sample. Ex: 725 clones -> 2 MB matrix, 77,551 clones -> 22 GB matrix.

While generating the matrices is still computationally possible on the cloud, downstream processes such as clustering may take a much longer time.

Possible alternatives:

  • create a histogram of distances to aggregate matrix values
  • convert full distance matrix -> similarity matrix -> sparse matrix (convert values past a certain threshold to 0)
  • generate sparse matrix from the start

@dimalvovs
Copy link
Collaborator

@yuvalel would you have a preference on the above? There's a built in method fot that in tcrdist3, called sparse implementation

@yuvalel
Copy link
Collaborator

yuvalel commented Apr 8, 2025

Following our discussion yesterday, I think we should try the sparse implementation of TCRdist.
For that a radius should be chosen based on the distance metric chosen, beyond which distances are dropped for being too far to consider the two TCRs as similar in any way.
Our suggestion is at this stage to implement two distance metrics, and give the user the option of choosing which one to use - TCRdist3 default one and levenshtein distance (edit distance). For both an option for using the sparse implementation should be available. My suggestion for radius is 50 for TCRdist3 metric and 6 for levenshtein distance. We should check those thresholds vs distributions of all pairwise distances in a few samples, to see that it keeps enough of the short distance pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants