don't return logits for benchmark script (#151075)

shunting314 · facebook-github-bot · commit edf32331a0ea · 2025-04-15T17:07:29.000-07:00
Summary: PT2 benchmark scripts has a pattern like: ``` def forward_and_backward_pass(self, mod, inputs, collect_outputs=True): cloned_inputs = clone_inputs(inputs) self.optimizer_zero_grad(mod) with self.autocast(**self.autocast_arg): pred = mod(**cloned_inputs) loss = self.compute_loss(pred) self.grad_scaler.scale(loss).backward() self.optimizer_step() if collect_outputs: return collect_results(mod, pred, loss, cloned_inputs) return None ``` for training. The collect_outputs argument is True only for accuracy testing and it's false for performance testing. For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow. A few bad things if we keep logits here 1. the peak memory will be higher since the logits is large and we can not release its memory earlier. 2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph Actually I think it's fine to not return logits at all. - For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients. - Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us. On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d)) Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a) - HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement - I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check X-link: pytorch/pytorch#151075 Approved by: https://github.com/eellison, https://github.com/jansel Reviewed By: Camyll Differential Revision: D73068291 fbshipit-source-id: 709218784f3f0673f434cf4fc5094f0fd64dfeee
diff --git a/userbenchmark/dynamo/dynamobench/huggingface.py b/userbenchmark/dynamo/dynamobench/huggingface.py
@@ -536,7 +536,7 @@ def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
         self.grad_scaler.scale(loss).backward()
         self.optimizer_step()
         if collect_outputs:
-            return collect_results(mod, pred, loss, cloned_inputs)
+            return collect_results(mod, None, loss, cloned_inputs)
         return None
 
 
diff --git a/userbenchmark/dynamo/dynamobench/timm_models.py b/userbenchmark/dynamo/dynamobench/timm_models.py
@@ -428,7 +428,7 @@ def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
         self.grad_scaler.scale(loss).backward()
         self.optimizer_step()
         if collect_outputs:
-            return collect_results(mod, pred, loss, cloned_inputs)
+            return collect_results(mod, None, loss, cloned_inputs)
         return None
 
 
diff --git a/userbenchmark/dynamo/dynamobench/torchbench.py b/userbenchmark/dynamo/dynamobench/torchbench.py
@@ -85,8 +85,9 @@ def process_hf_whisper_output(out):
     out_ret = []
     for i, elem in enumerate(out):
         if i == 0:
-            assert isinstance(elem, dict)
-            out_ret.append({k: v for k, v in elem.items() if k != "logits"})
+            if elem is not None:
+                assert isinstance(elem, dict)
+                out_ret.append({k: v for k, v in elem.items() if k != "logits"})
         elif i != 1:
             out_ret.append(elem)
 
@@ -470,7 +471,7 @@ def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
         self.grad_scaler.scale(loss).backward()
         self.optimizer_step()
         if collect_outputs:
-            return collect_results(mod, pred, loss, cloned_inputs)
+            return collect_results(mod, None, loss, cloned_inputs)
         return None