Skip to content

v2: Clarification MultiModel format #622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

v2: Clarification MultiModel format #622

wants to merge 8 commits into from

Conversation

FFroehlich
Copy link
Collaborator

See #609

One could also think about adding the option to specify models for individual conditions, but that definitely sounds like extension territory.

@FFroehlich FFroehlich requested a review from a team as a code owner April 22, 2025 15:55
Copy link
Member

@dilpath dilpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • There is a small inconsistency, since the "PEtab" way would be to have a new models table, I think.

  • Instead of a modelId column, the user could instead associate one measurement file with one model file in the YAML directly.

But the suggestion is fine for me, especially since I have no use cases currently.

@matthiaskoenig matthiaskoenig self-requested a review April 22, 2025 16:49
Copy link
Contributor

@matthiaskoenig matthiaskoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if adding the modelId to the measurementTable is the best solution.
I was thinking about adding the modelId column to the conditionTable, parameterTable and observableTable. I.e. wherever an id is referenced, it should be possible to provide an optional modelId to specify which model the id should be from. This would allow you to validate the table against the model. Otherwise you have to go back from the measurementTable to the other table to figure out which model was meant for which variable. This makes it very difficult to work with multiple models.

I could imagine having a problem with multiple models where parameters from each model are used, as well as observables from each model, and in the time series conditions are applied to each model. One wants to see from which model each information is coming in the respective table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but a lot of diff in the file which makes it difficult to see the actual change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

briefly, removed the problems list/array and moved all properties to the highest level.

@dweindl
Copy link
Member

dweindl commented Apr 23, 2025

  • There is a small inconsistency, since the "PEtab" way would be to have a new models table, I think.

I'd see the yaml file as kind of the "model table" here:

# [...]
  model_files:
    model1:
      language: sbml
      location: model1.xml
    model2:
      language: sbml
      location: model2.xml
# [...]

@dweindl
Copy link
Member

dweindl commented Apr 23, 2025

  • Instead of a modelId column, the user could instead associate one measurement file with one model file in the YAML directly.

That was the original idea behind the problems list in the yaml file:

format_version: 2.0.0
parameter_file: all_parameters.tsv
problems:
- condition_files: [conditions_for_model1.tsv]
  experiment_files: [experiments_for_model1.tsv]
  measurement_files: [measurements_for_model1.tsv]
  model_files:
    model1:
      language: sbml
      location: model1.xml
  observable_files: [observables_for_model1.tsv]
- condition_files: [conditions_for_model2.tsv]
  experiment_files: [experiments_for_model2.tsv]
  measurement_files: [measurements_for_model2.tsv]
  model_files:
    model2:
      language: sbml
      location: model2.xml
  observable_files: [observables_for_model2.tsv]

However, I am not sure what is preferable.

introduced in the observable table may be referenced, but circular definitions
must be avoided.

In problems with multiple models, symbols not defined in the currently simulated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not rather throw an error?
If I for example there is in the observable formula a specie not present in the model, it feels like an error has been made by the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I considered that alternative. This specification allows reuse of observables across models, the user can always introduce model specific observables. We could provide different options for linting that are more or less strict.

@sebapersson
Copy link
Contributor

I was thinking about adding the modelId column to the conditionTable, parameterTable and observableTable. I.e. wherever an id is referenced, it should be possible to provide an optional modelId to specify which model the id should be from. This would allow you to validate the table against the model. Otherwise you have to go back from the measurementTable to the other table to figure out which model was meant for which variable. This makes it very difficult to work with multiple models.

If modelId is added to the parameterTable, would it not then be quite involved to have the same parameter in multiple models (e.g. one would have to add multiple model-ids?). I think having the modelId only in the measurements table is doable, however, the drawback as pointed out here is that it will be quite a bit of work for the importer to figure out which parameters and species appear in each model.

@matthiaskoenig
Copy link
Contributor

That was the original idea behind the problems list in the yaml file.

As I understand it one will have a single optimization problem which uses two models. I.e. in the end there is a single cost function which has contributions of the different models. Having a list of problems implies these are separate optimization problems with multiple cost functions which could be optimized independently.

If modelId is added to the parameterTable, would it not then be quite involved to have the same parameter in multiple models (e.g. one would have to add multiple model-ids?).

That is a good point. I think we have to write down clearly the use cases we want to implement with the multiple models and then figure out the best way on how to encode this.
Personally, I am always more in favour of having a more fine grained control (i.e. by adding model ids to the different tables) which allows in the long run much more flexibility and implement use cases nobody thought about. By adding the modelId on the measurements table we allow only to distribute measurements to different models, but nothing else. By having modelIds on the separate tables a lot of different things could be done in the future without changing the format much. E.g. adding multiple modelIds to indicate that parameters or conditions should be shared between models.

My use cases are:

  1. Smaller models by distribution of problem in submodels (shared parameters). I have multiple models which map to different datasets, and there are some parameters which occur in the different models (shared parameters). I want to optimize some of the parameter which occur in all models as a single parameter (e.g. hepatic blood flow, ...). Allowing to split the problem makes the ODE integration of the submodels and code generation much faster. Code generation via LLVM and other C++ code optimization does not scale linearly, i.e. code generation scales > 0(n). These becomes the bottleneck for large models (we have to wait >12 hours for some code generation with roadrunner with simulation taking a few seconds if we have > 100 000 state variables).
  2. Shared experiments between models. I have different models and want to run the same experiments over it. No shared parameters, but shared experiments. Could be encoded as separate optimization problems. Duplicating all the information is error prone. Similar as 3.
  3. Encoding multiple optimization problems in a single file. I want to change parameters/settings of the optimization problem/model and run all the variants. Something in the direction of a sensitivity analyses to settings (hyperparameter scans). E.g. test out different rate laws, different prioris, ... I could generate a lot of different PETab problems, but it would be much nicer to encode these in a single PETab by providing the different models. This would require different cost functions for the different models/settings and goes in the direction of model selection. This does not require multiple models in a single problem, but multiple problems in a single file. If one has larger models a ton of unnecessary data is generated with the model files being the largest assets for the optimization problem.

For me the main reason is to 1. be able to split large models in submodels for part of the cost function calculation. This is mainly an optimization aspect and could perhaps also be done on a tool bases if there is some metadata. 2. Encode multiple problems in a single PETab problem (this is mainly convenience and reuse of information). Basically a level on top of a single optimization problem.

@matthiaskoenig
Copy link
Contributor

Optimizing a single parameter from multiple models could require a mapping of the parameters, because the parameters do not have to be named identically in the different models (especially if existing models from different sources are used). The more I think about the multiple model problems the more complicated it gets. We really have to think this through.

@FFroehlich
Copy link
Collaborator Author

FFroehlich commented Apr 23, 2025

That was the original idea behind the problems list in the yaml file.

As I understand it one will have a single optimization problem which uses two models. I.e. in the end there is a single cost function which has contributions of the different models. Having a list of problems implies these are separate optimization problems with multiple cost functions which could be optimized independently.

Yes.

If modelId is added to the parameterTable, would it not then be quite involved to have the same parameter in multiple models (e.g. one would have to add multiple model-ids?).

That is a good point. I think we have to write down clearly the use cases we want to implement with the multiple models and then figure out the best way on how to encode this. Personally, I am always more in favour of having a more fine grained control (i.e. by adding model ids to the different tables) which allows in the long run much more flexibility and implement use cases nobody thought about. By adding the modelId on the measurements table we allow only to distribute measurements to different models, but nothing else. By having modelIds on the separate tables a lot of different things could be done in the future without changing the format much. E.g. adding multiple modelIds to indicate that parameters or conditions should be shared between models.

All of the tables are now set up in a way that allows adding of extra columns/metadata. Any user can easily start with "master tables" with some metadata and programmatically generate the necessary tables, I don't think we need to formalise that process.

My use cases are:

  1. Smaller models by distribution of problem in submodels (shared parameters). I have multiple models which map to different datasets, and there are some parameters which occur in the different models (shared parameters). I want to optimize some of the parameter which occur in all models as a single parameter (e.g. hepatic blood flow, ...). Allowing to split the problem makes the ODE integration of the submodels and code generation much faster. Code generation via LLVM and other C++ code optimization does not scale linearly, i.e. code generation scales > 0(n). These becomes the bottleneck for large models (we have to wait >12 hours for some code generation with roadrunner with simulation taking a few seconds if we have > 100 000 state variables).

This is similar to the use cases that we have and what I was trying to address.

  1. Shared experiments between models. I have different models and want to run the same experiments over it. No shared parameters, but shared experiments. Could be encoded as separate optimization problems. Duplicating all the information is error prone. Similar as 3.

I think this could easily be solved by having multiple yaml files and reusing tables across problems.

  1. Encoding multiple optimization problems in a single file. I want to change parameters/settings of the optimization problem/model and run all the variants. Something in the direction of a sensitivity analyses to settings (hyperparameter scans). E.g. test out different rate laws, different prioris, ... I could generate a lot of different PETab problems, but it would be much nicer to encode these in a single PETab by providing the different models. This would require different cost functions for the different models/settings and goes in the direction of model selection. This does not require multiple models in a single problem, but multiple problems in a single file. If one has larger models a ton of unnecessary data is generated with the model files being the largest assets for the optimization problem.

I thought this was addressed by PEtab-select, but I am not familiar with the details.

@FFroehlich
Copy link
Collaborator Author

Optimizing a single parameter from multiple models could require a mapping of the parameters, because the parameters do not have to be named identically in the different models (especially if existing models from different sources are used). The more I think about the multiple model problems the more complicated it gets. We really have to think this through.

This would already be covered by the implementation above with the mapping performed via the condition table. Things get problematic if the same parameter IDs are reused across models but with different meanings, but at the end of the day that's just asking for trouble.

FFroehlich and others added 2 commits April 23, 2025 10:35
@dweindl dweindl added this to the PEtab 2.0.0 milestone May 21, 2025
Copy link
Member

@dilpath dilpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for me, no strong opinion for or against

Copy link
Member

@dweindl dweindl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I am happy to get rid of the problems list in the yaml file for the single-model case, I thought the editor meeting discussion ended with going back the old style of having different sets of condition, observables, ... files for each model with only the parameter table being shared. (Which would not preclude referencing the same observable/condition/experiment file from multiple sub-problems).

I can live with either and would follow those who already have a clear application in mind.

@FFroehlich
Copy link
Collaborator Author

While I am happy to get rid of the problems list in the yaml file for the single-model case, I thought the editor meeting discussion ended with going back the old style of having different sets of condition, observables, ... files for each model with only the parameter table being shared. (Which would not preclude referencing the same observable/condition/experiment file from multiple sub-problems).

I can live with either and would follow those who already have a clear application in mind.

That wasn't my takeaway. I thought we mainly discussed about how forgiving conditions/observables are in terms of validation.

Thinking about it, splitting up the problem at the file level is actually simpler to implement and more verbose for the user, so that definitely is a viable alternative that is equivalent in terms of what can be expressed. But yes, would be great if we don't have a problems list for the single-model case.

We could drop the model id column from the measurements table and introducing an additional sub-problems table with model id as columns and files as rows with True/False values to indicate which file applies to which model? Default would be to apply tables to all models, minimising the number of rows that have to be added.

@FFroehlich FFroehlich changed the title Clarification MultiModel format v2: Clarification MultiModel format May 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants