Skip to content

Incorrect evaluation workflow for essential genes in estimateEssentialGenes #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JHL-452b opened this issue Feb 7, 2025 · 5 comments
Labels

Comments

@JHL-452b
Copy link
Collaborator

JHL-452b commented Feb 7, 2025

Current behavior:

Recently, I have been working on evaluating essential genes. I've found that there are issues with the current evaluation workflow (also in auto-tasks in github) in estimateEssentialGenes.

ihuman = readYAMLmodel('model/Human-GEM.yml');
taskStruct = parseTaskList('data/metabolicTasks/metabolicTasks_Essential.txt');
[eGenes, INIT_output] = estimateEssentialGenes(ihuman, 'Hart2015_RNAseq.txt', taskStruct);
results = evaluateHart2015Essentiality(eGenes);

I found that the output context-specific models were very strange, with only a small amount of content as you can see below.

cell_type DLD1 GBM HCT116 HELA RPE1
genes 475 475 475 475 475
rxns 250 250 250 250 250
mets 339 339 339 339 339

Further investigation revealed that the reason for this result is due to the fourth parameter useGeneSymbol of the estimateEssentialGenes function defaulting as true, which then converts the genes in the template model into geneSymbol format. However, in reality, the genes in the Hart2015_RNAseq.txt data are in the 'ENSG0000' format, leading to no gene matches and thus no gene expression being detected by default.

So, I manually tried changing the fourth parameter to false, and while the content of the resulting model was much more normal.

cell_type DLD1 GBM HCT116 HELA RPE1
genes 1734 1731 1772 1743 1669
rxns 6870 6265 6888 6902 6097
mets 5680 4986 5649 5665 4845

However, the result of essential gene evaluation turned out to be all zeros because the genes in in Hart2015_TableS2.xlsx (Experimental result) are geneSymbol format. So, I believe that after the model is generated, all genes in the model (include template model) need to be converted into GeneSymbol format before performing the essential gene evaluation.

@JHL-452b JHL-452b added the bug label Feb 7, 2025
@JHL-452b
Copy link
Collaborator Author

I've noticed that currently, the gene shortNames in the model and the function replaceGrRules are used to complete the gene ID conversion.

idMapping = [model.genes, model.geneShortNames];
[grRules,genes,rxnGeneMat] = replaceGrRules(model.grRules,idMapping);
model.grRules = grRules;
model.genes = genes;
model.rxnGeneMat = rxnGeneMat;

Currently, one Ensembl ID of a gene in Human-GEM corresponds to one Symbol ID (shortName). However, in reality, one Ensembl ID of a gene might correspond to multiple Symbol IDs. This could lead to a situation where the Symbol ID converted through the aforementioned method does not exist in the Hart2015 experimental dataset. For example:

Ensembl All_Symbols Symbol in model(whether in Hart2015) Symbol (whether in Hart2015)
ENSG00000014257 ['ACP3', 'ACP-3', 'ACPP', 'PAP', 'TM-PAP'] ACP3 (No) ACPP (Yes)
ENSG00000031698 ['SARS1', 'SARS', 'SERS'] SARS1 (No) SARS (Yes)
ENSG00000035687 ['ADSS2', 'ADSS'] ADSS2 (No) ADSS (Yes)
ENSG00000065427 ['KARS1', 'DFNB89', 'KARS', 'KARS2'] KARS1 (No) KARS (Yes)
ENSG00000066379 ['POLR1H', 'A12.2', 'HTEX-6', 'hZR14', 'RPA12', 'tctex-6', 'ZNRD1'] POLR1H (No) ZNRD1 (Yes)
ENSG00000090861 ['AARS1', 'AARS', 'AlaRS', 'CMT2N'] AARS1 (No) AARS (Yes)
ENSG00000104331 ['BPNT2', 'FLJ20421', 'gPAPP', 'IMPA3', 'IMPAD1'] BPNT2 (No) IMPAD1 (Yes)
ENSG00000104522 ['GFUS', 'FX', 'P35B', 'SDR4E1', 'TSTA3'] GFUS (No) TSTA3 (Yes)
ENSG00000106105 ['GARS1', 'CMT2D', 'DSMAV', 'GARS', 'GlyRS', 'SMAD1'] GARS1 (No) GARS (Yes)
ENSG00000110619 ['CARS1', 'CARS'] CARS1 (No) CARS (Yes)
ENSG00000113163 ['CERT1', 'CERT', 'COL4A3BP', 'GPBP', 'STARD11'] CERT1 (No) COL4A3BP (Yes)

These examples could lead to errors in the essentiality assessment of some genes in the model, resulting in inaccurate evaluation of essential genes.

@JHL-452b
Copy link
Collaborator Author

I believe rather than determining which Ensembl ID corresponds to which Symbol ID for each gene, it would be better to convert all Symbol IDs in the Hart2015 experimental dataset directly into Ensembl IDs. This approach not only resolves the issue of incorrect matching mentioned above but also eliminates the need to convert gene IDs within the workflow.

I have organized and prepared the conversion results of the Hart2015 experimental dataset. You can refer to here:
Hart2015_TableS2_add_ID.xlsx

@JHL-452b
Copy link
Collaborator Author

I have already organized the results of the two methods.

First is about changing Ensembl ID to Symbol ID in all tissue-specific models (same as the current workflow)

cellLine TP TN FP FN accuracy sensitivity specificity F1 MCC
DLD1 94 2085 132 224 0.8596 0.2956 0.9405 0.3456 0.2744
GBM 84 2053 147 250 0.8433 0.2515 0.9332 0.2973 0.217
HCT116 110 2113 122 245 0.8583 0.3099 0.9454 0.3748 0.3074
HELA 82 2156 154 200 0.8634 0.2908 0.9333 0.3166 0.2426
RPE1 63 2095 166 210 0.8516 0.2308 0.9266 0.251 0.1702
all 40 2308 165 79 0.9059 0.3361 0.9333 0.2469 0.2089

Second is about keeping the Ensembl ID and using the new mapping data.

I have organized and prepared the conversion results of the Hart2015 experimental dataset. You can refer to here:
Hart2015_TableS2_add_ID.xlsx

cellLine TP TN FP FN accuracy sensitivity specificity F1 MCC
DLD1 120 2160 140 235 0.8588 0.338 0.9391 0.3902 0.3174
GBM 108 2130 156 261 0.8429 0.2927 0.9318 0.3412 0.2595
HCT116 140 2191 127 258 0.8582 0.3518 0.9452 0.4211 0.3527
HELA 108 2234 164 212 0.8617 0.3375 0.9316 0.3649 0.289
RPE1 87 2167 179 222 0.849 0.2816 0.9237 0.3026 0.2192
all 60 2396 181 81 0.9036 0.4255 0.9298 0.3141 0.2772

Obviously, the second result is superior to the first one. This indicates that the gene Symbol IDs not present in the experimental dataset, as previously described, have a negative impact on the assessment of essential genes by the model. Therefore, I recommend using the Ensembl ID of genes as the standard ID for assessing gene essentiality.

@feiranl
Copy link
Collaborator

feiranl commented Feb 18, 2025

@johan-gson Johan, Could you check this? Thanks!

@johan-gson
Copy link
Collaborator

Hi Jiahao,

Nice work, I'm sure you are right, thank you for a thorough investigation! I have been away from this quite some time, and actually never worked with the Hart dataset. @feiranl, is there any way we can place this new file somewhere where it is accessible and update any code/docs to refer to this file instead?

I also recommend looking at the DepMap data, there are many more cell lines there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants