Integrate Automated QDQ benchmark - part 3.1 by willg-nv · Pull Request #837 · NVIDIA/Model-Optimizer

willg-nv · 2026-02-02T02:51:54Z

What does this PR do?

This PR integrates benchmark module to QDQ autotunner. This benchamrk module is used to evaluate ONNX model perf.
This PR is 1/3 of #703. Once all small PRs are merged #703 could be closed.

PR 3.1: #837
PR 3.2 #838
PR 3.3: #839

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: No, document will be added in part 4.
Did you update Changelog?: No, change log will be updated when all changes are merged.

Additional Information

Summary by CodeRabbit

New Features
- Added ONNX quantization autotuning capabilities with a consolidated module providing streamlined import paths for core components.
- Introduced unified benchmarking framework supporting TensorRT-based model evaluation with both command-line and Python API implementations.
- Added support for timing cache persistence, custom plugin libraries, shape validation, and dynamic input shape configuration for flexible model testing and optimization.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-02-02T02:51:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-02T02:52:13Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

This PR introduces a benchmarking framework for ONNX model evaluation within the quantization autotuning subsystem. It provides an abstract Benchmark interface with two concrete implementations: TrtExecBenchmark for CLI-based benchmarking via trtexec, and TensorRTPyBenchmark for Python API-based benchmarking via TensorRT. A package initializer consolidates public exports from the autotune submodules.

Changes

Cohort / File(s)	Summary
Public API Exports `modelopt/onnx/quantization/autotune/__init__.py`	Centralizes and re-exports 17 public entities from the autotune subsystem, including benchmark classes, region abstractions, configuration, error types, and search orchestration driver.
Benchmarking Framework `modelopt/onnx/quantization/autotune/benchmark.py`	Introduces abstract `Benchmark` base class and two concrete implementations: `TrtExecBenchmark` (CLI-based timing via trtexec) and `TensorRTPyBenchmark` (Python API-based timing via TensorRT with CUDA integration, plugin loading, dynamic shape support, and statistics collection).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TensorRTPyBenchmark
    participant TensorRT as TensorRT Engine
    participant CUDA as CUDA Runtime
    participant FileSystem

    User->>TensorRTPyBenchmark: run(onnx_model_bytes)
    TensorRTPyBenchmark->>FileSystem: Load/create timing cache
    TensorRTPyBenchmark->>TensorRT: Load plugins if configured
    TensorRTPyBenchmark->>TensorRT: Build engine from ONNX
    TensorRTPyBenchmark->>TensorRT: Create context
    
    TensorRTPyBenchmark->>CUDA: Allocate device memory
    TensorRTPyBenchmark->>TensorRT: Warmup runs (N iterations)
    CUDA-->>TensorRTPyBenchmark: Kernel execution
    
    TensorRTPyBenchmark->>TensorRT: Timing runs (M iterations)
    loop Collect Latencies
        TensorRT->>CUDA: Execute inference
        CUDA-->>TensorRTPyBenchmark: Timing data
    end
    
    TensorRTPyBenchmark->>TensorRTPyBenchmark: Calculate statistics (min/max/mean/median)
    TensorRTPyBenchmark->>FileSystem: Save timing cache & results
    TensorRTPyBenchmark-->>User: Return median latency (float)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Integrate Automated QDQ benchmark - part 3.1' directly reflects the main change: introducing benchmark module (TrtExecBenchmark, TensorRTPyBenchmark) into the QDQ autotuner package, as confirmed by the added files and PR objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 207-212: The parameter flush_timing_cache on TrtExecBenchmark.run
is declared but unused; either remove it from the signature or implement the
same behavior as TensorRTPyBenchmark.run by flushing the timing cache before
running. To implement, add a small helper (e.g., _flush_timing_cache) or call
the appropriate trtexec/engine cache clear routine at the start of
TrtExecBenchmark.run when flush_timing_cache is True, then proceed with the
existing benchmarking flow; if you prefer removal, delete the flush_timing_cache
parameter from TrtExecBenchmark.run and any callers to keep APIs consistent with
actual behavior.
- Around line 262-266: The code raises NameError when log_file is None because
it references the undefined variable output; in the block that handles a failed
regex match (where you have match = re.search(self.latency_pattern,
result.stdout, ...)), replace the use of output with the already-available
result.stdout (or otherwise initialize output before the conditional) so the
logger.debug call uses a defined variable; update the code around the match
check that references log_file/result.stdout and self.logger to log
result.stdout instead of output when log_file is None.
- Around line 676-691: The _save_timing_cache method has inverted and wrong
checks, a typo, unsafe finally cleanup, and incorrect serialization logic:
ensure a builder config is created before use (assign config =
self.builder.create_builder_config() before the try or inside try but guard
finally), only attempt to combine when self._timing_cache is not None (change
the condition from if self._timing_cache is None to if self._timing_cache is not
None), call output_cache.combine(self._timing_cache, ignore_errors=True) (fix
typo combine), then serialize the resulting cache (or if the intent is to
persist the existing cache directly, serialize self._timing_cache) and write it
to self.timing_cache_file; finally, in the finally block only delete or close
config if it was successfully created (e.g., check if 'config' in locals() or
use a variable initialized to None) to avoid UnboundLocalError.
- Line 145: The function parameter trtexec_args currently uses a mutable default
list (trtexec_args: list = []) which can be shared across calls; change the
signature to use None (e.g., trtexec_args: Optional[list] = None) and inside the
function (in the body of the same function that declares trtexec_args)
initialize it to an empty list if None (trtexec_args = trtexec_args or [])
before use; update the type hint to Optional[list] (or List[str]) and add the
necessary typing import if absent, ensuring all references to trtexec_args in
the function use the newly initialized local list.

🧹 Nitpick comments (4)

modelopt/onnx/quantization/autotune/benchmark.py (4)
97-98: Unnecessary global statement.

The global logger declaration is not needed here since logger is already available at module scope from the import on line 57. The global keyword is only required when you intend to reassign a module-level variable inside a function/method.
Suggested fix
-        global logger
         self.timing_cache_file = timing_cache_file or "/tmp/trtexec_timing.cache"  # nosec B108
543-544: Potential dtype mismatch for integer input tensors.

np.random.randn(size) generates float64 values, which are then cast via .astype(dtype). For integer dtypes, this truncates floats to integers (all becoming 0 for values in (-1, 1)). Consider using np.random.randint for integer types or ensuring the random data is appropriate for the tensor's dtype.
Suggested approach
                 if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
-                    np.copyto(host_mem, np.random.randn(size).astype(dtype))
+                    if np.issubdtype(dtype, np.integer):
+                        np.copyto(host_mem, np.random.randint(0, 100, size=size, dtype=dtype))
+                    else:
+                        np.copyto(host_mem, np.random.randn(size).astype(dtype))
                     inputs.append({"host": host_mem, "device": device_mem, "name": tensor_name})
104-115: Base class signature mismatch with subclasses.

The abstract run() method has signature run(path_or_bytes, log_file), but both TrtExecBenchmark.run() and TensorRTPyBenchmark.run() add a flush_timing_cache parameter. This breaks Liskov substitution - code using Benchmark type hints won't know about the extra parameter. Consider adding the parameter to the base class signature.
Suggested fix for base class
     `@abstractmethod`
-    def run(self, path_or_bytes: str | bytes, log_file: str | None = None) -> float:
+    def run(self, path_or_bytes: str | bytes, log_file: str | None = None, flush_timing_cache: bool = False) -> float:
         """Run benchmark on the given ONNX model.

         Args:
             path_or_bytes: Path to the ONNX model (str) or raw model data (bytes)
             log_file: Optional path to save benchmark logs
+            flush_timing_cache: Whether to flush timing cache to disk after build

         Returns:
             Measured latency in milliseconds, or float("inf") on failure
         """
185-191: Inconsistent plugin validation behavior between benchmark implementations.

TrtExecBenchmark logs a warning and continues when a plugin library is not found (line 188-189), while TensorRTPyBenchmark raises FileNotFoundError (line 359). Consider aligning the behavior - failing fast is generally safer for configuration errors.

modelopt/onnx/quantization/autotune/benchmark.py

Signed-off-by: Will Guo <willg@nvidia.com>

willg-nv requested a review from a team as a code owner February 2, 2026 02:51

willg-nv requested a review from gcunhase February 2, 2026 02:51

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

This was referenced Feb 2, 2026

Integrate Automated QDQ autotuner - part 3.2 #838

Open

Integrate Automated QDQ placement tool - part 3.3 #839

Open

Integrate Automated QDQ placement tool - part 3.1

d7e16ac

Signed-off-by: Will Guo <willg@nvidia.com>

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from 1311d20 to d7e16ac Compare February 2, 2026 13:49

add warning when remote autotuner is not available

5e8dc40

Signed-off-by: Will Guo <willg@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Automated QDQ benchmark - part 3.1#837

Integrate Automated QDQ benchmark - part 3.1#837
willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part3.1

willg-nv commented Feb 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willg-nv commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

willg-nv commented Feb 2, 2026 •

edited

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading