Behaviour of `batch_eval_metrics` determines the `include_for_metrics` behaviour #37683

prabhuteja12 · 2025-04-22T16:35:31Z

System Info

Hello!

In the evaluation_loop method, there is an interplay between the batch_eval_metrics and include_for_metrics arguments. When include_for_metrics is set to inputs, what is sent to compute_metrics is the all_inputs object which only contains the input_ids ) whereas when batch_eval_metrics is set to True, compute_metrics is called with inputs which includes the input_ids and also any other inputs that are passed to the model.

This behaviour is inconsistent. Can you please look into this?

@zach-huggingface @SunMarc

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Any trainer example with both

include_for_metrics = ["inputs"]
batch_eval_metrics = True

should cause this.

Expected behavior

Ideally, batch_eval_metrics shouldn't dictate how the compute_metrics is called. Can this inconsistency be fixed?

The text was updated successfully, but these errors were encountered:

SunMarc · 2025-04-23T15:43:09Z

Thanks for the report ! Can you share a minimal reproducer ? This is a bit strange since we are passing inputs_host which comes from inputs_decode and we should only select input_ids

            main_input_name = getattr(self.model, "main_input_name", "input_ids")
            inputs_decode = (
                self._prepare_input(inputs[main_input_name]) if "inputs" in args.include_for_metrics else None
            )

            if self.args.batch_eval_metrics:
                if self.compute_metrics is not None and preds_host is not None and labels_host is not None:
                    is_last_step = self.accelerator.gradient_state.end_of_dataloader
                    batch_kwargs = {}
                    batch_kwargs["losses"] = losses_host if "loss" in args.include_for_metrics else None
                    batch_kwargs["inputs"] = inputs_host if "inputs" in args.include_for_metrics else None

prabhuteja12 · 2025-04-25T09:09:58Z

In the L4351 , the inputs_decode variable is the one that is attached to the all_inputs in L4365. This only has one column main_input_name which includes just the input_ids attributed from the input.

On the other hand, in L4388 the code is
batch_kwargs["inputs"] = inputs if "inputs" in args.include_for_metrics else None when self.args.batch_eval_metrics is set. This inputs is not the same as the inputs_decode above. For a MWE, I don't have one at hand right away, but if this explanation doesn't help, I'll try to create one for this.

SunMarc · 2025-04-25T15:25:49Z

Indeed, we should pass inputs_decode there instead. Do you want to open a PR to fix this @prabhuteja12 ? Thanks for the explanation !

prabhuteja12 · 2025-04-25T16:01:03Z

@SunMarc Aha! I was hoping you wouldn't say that we use input_decode everywhere. Is there a reason why the entire inputs are not passed, instead of one column?

prabhuteja12 · 2025-04-28T09:17:05Z

Hello,

This discrepancy seems to lead to another issue when using distributed training: In L436, gather is done on only inputs_decode but not all inputs. But inputs is what is passed to compute_metrics as described previously. A simple inputs = self.gather_function(inputs) solves the issue locally. What do you think about this problem?

prabhuteja12 added the bug label Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of `batch_eval_metrics` determines the `include_for_metrics` behaviour #37683

Behaviour of `batch_eval_metrics` determines the `include_for_metrics` behaviour #37683

prabhuteja12 commented Apr 22, 2025

SunMarc commented Apr 23, 2025

prabhuteja12 commented Apr 25, 2025

SunMarc commented Apr 25, 2025 •

edited

Loading

prabhuteja12 commented Apr 25, 2025

prabhuteja12 commented Apr 28, 2025

Behaviour of batch_eval_metrics determines the include_for_metrics behaviour #37683

Behaviour of batch_eval_metrics determines the include_for_metrics behaviour #37683

Comments

prabhuteja12 commented Apr 22, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

SunMarc commented Apr 23, 2025

prabhuteja12 commented Apr 25, 2025

SunMarc commented Apr 25, 2025 • edited Loading

prabhuteja12 commented Apr 25, 2025

prabhuteja12 commented Apr 28, 2025

Behaviour of `batch_eval_metrics` determines the `include_for_metrics` behaviour #37683

Behaviour of `batch_eval_metrics` determines the `include_for_metrics` behaviour #37683

SunMarc commented Apr 25, 2025 •

edited

Loading