Skip to content

Behaviour of batch_eval_metrics determines the include_for_metrics behaviour #37683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 4 tasks
prabhuteja12 opened this issue Apr 22, 2025 · 5 comments
Open
1 of 4 tasks
Labels

Comments

@prabhuteja12
Copy link

System Info

Hello!

In the evaluation_loop method, there is an interplay between the batch_eval_metrics and include_for_metrics arguments. When include_for_metrics is set to inputs, what is sent to compute_metrics is the all_inputs object which only contains the input_ids ) whereas when batch_eval_metrics is set to True, compute_metrics is called with inputs which includes the input_ids and also any other inputs that are passed to the model.

This behaviour is inconsistent. Can you please look into this?

@zach-huggingface @SunMarc

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Any trainer example with both

include_for_metrics = ["inputs"]
batch_eval_metrics = True

should cause this.

Expected behavior

Ideally, batch_eval_metrics shouldn't dictate how the compute_metrics is called. Can this inconsistency be fixed?

@SunMarc
Copy link
Member

SunMarc commented Apr 23, 2025

Thanks for the report ! Can you share a minimal reproducer ? This is a bit strange since we are passing inputs_host which comes from inputs_decode and we should only select input_ids

            main_input_name = getattr(self.model, "main_input_name", "input_ids")
            inputs_decode = (
                self._prepare_input(inputs[main_input_name]) if "inputs" in args.include_for_metrics else None
            )
            if self.args.batch_eval_metrics:
                if self.compute_metrics is not None and preds_host is not None and labels_host is not None:
                    is_last_step = self.accelerator.gradient_state.end_of_dataloader
                    batch_kwargs = {}
                    batch_kwargs["losses"] = losses_host if "loss" in args.include_for_metrics else None
                    batch_kwargs["inputs"] = inputs_host if "inputs" in args.include_for_metrics else None

@prabhuteja12
Copy link
Author

In the L4351 , the inputs_decode variable is the one that is attached to the all_inputs in L4365. This only has one column main_input_name which includes just the input_ids attributed from the input.

On the other hand, in L4388 the code is
batch_kwargs["inputs"] = inputs if "inputs" in args.include_for_metrics else None when self.args.batch_eval_metrics is set. This inputs is not the same as the inputs_decode above. For a MWE, I don't have one at hand right away, but if this explanation doesn't help, I'll try to create one for this.

@SunMarc
Copy link
Member

SunMarc commented Apr 25, 2025

Indeed, we should pass inputs_decode there instead. Do you want to open a PR to fix this @prabhuteja12 ? Thanks for the explanation !

@prabhuteja12
Copy link
Author

@SunMarc Aha! I was hoping you wouldn't say that we use input_decode everywhere. Is there a reason why the entire inputs are not passed, instead of one column?

@prabhuteja12
Copy link
Author

Hello,

This discrepancy seems to lead to another issue when using distributed training: In L436, gather is done on only inputs_decode but not all inputs. But inputs is what is passed to compute_metrics as described previously. A simple inputs = self.gather_function(inputs) solves the issue locally. What do you think about this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants