Skip to content

Integrate Automated QDQ benchmark - part 3.1#837

Open
willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part3.1
Open

Integrate Automated QDQ benchmark - part 3.1#837
willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-integrate-auto-qdq-placement-part3.1

Conversation

@willg-nv
Copy link
Contributor

@willg-nv willg-nv commented Feb 2, 2026

What does this PR do?

This PR integrates benchmark module to QDQ autotunner. This benchamrk module is used to evaluate ONNX model perf.
This PR is 1/3 of #703. Once all small PRs are merged #703 could be closed.

PR 3.1: #837
PR 3.2 #838
PR 3.3: #839

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No, document will be added in part 4.
  • Did you update Changelog?: No, change log will be updated when all changes are merged.

Additional Information

Summary by CodeRabbit

  • New Features
    • Added ONNX quantization autotuning capabilities with a consolidated module providing streamlined import paths for core components.
    • Introduced unified benchmarking framework supporting TensorRT-based model evaluation with both command-line and Python API implementations.
    • Added support for timing cache persistence, custom plugin libraries, shape validation, and dynamic input shape configuration for flexible model testing and optimization.

✏️ Tip: You can customize this high-level summary in your review settings.

@willg-nv willg-nv requested a review from a team as a code owner February 2, 2026 02:51
@willg-nv willg-nv requested a review from gcunhase February 2, 2026 02:51
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 2, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This PR introduces a benchmarking framework for ONNX model evaluation within the quantization autotuning subsystem. It provides an abstract Benchmark interface with two concrete implementations: TrtExecBenchmark for CLI-based benchmarking via trtexec, and TensorRTPyBenchmark for Python API-based benchmarking via TensorRT. A package initializer consolidates public exports from the autotune submodules.

Changes

Cohort / File(s) Summary
Public API Exports
modelopt/onnx/quantization/autotune/__init__.py
Centralizes and re-exports 17 public entities from the autotune subsystem, including benchmark classes, region abstractions, configuration, error types, and search orchestration driver.
Benchmarking Framework
modelopt/onnx/quantization/autotune/benchmark.py
Introduces abstract Benchmark base class and two concrete implementations: TrtExecBenchmark (CLI-based timing via trtexec) and TensorRTPyBenchmark (Python API-based timing via TensorRT with CUDA integration, plugin loading, dynamic shape support, and statistics collection).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant TensorRTPyBenchmark
    participant TensorRT as TensorRT Engine
    participant CUDA as CUDA Runtime
    participant FileSystem

    User->>TensorRTPyBenchmark: run(onnx_model_bytes)
    TensorRTPyBenchmark->>FileSystem: Load/create timing cache
    TensorRTPyBenchmark->>TensorRT: Load plugins if configured
    TensorRTPyBenchmark->>TensorRT: Build engine from ONNX
    TensorRTPyBenchmark->>TensorRT: Create context
    
    TensorRTPyBenchmark->>CUDA: Allocate device memory
    TensorRTPyBenchmark->>TensorRT: Warmup runs (N iterations)
    CUDA-->>TensorRTPyBenchmark: Kernel execution
    
    TensorRTPyBenchmark->>TensorRT: Timing runs (M iterations)
    loop Collect Latencies
        TensorRT->>CUDA: Execute inference
        CUDA-->>TensorRTPyBenchmark: Timing data
    end
    
    TensorRTPyBenchmark->>TensorRTPyBenchmark: Calculate statistics (min/max/mean/median)
    TensorRTPyBenchmark->>FileSystem: Save timing cache & results
    TensorRTPyBenchmark-->>User: Return median latency (float)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Integrate Automated QDQ benchmark - part 3.1' directly reflects the main change: introducing benchmark module (TrtExecBenchmark, TensorRTPyBenchmark) into the QDQ autotuner package, as confirmed by the added files and PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 207-212: The parameter flush_timing_cache on TrtExecBenchmark.run
is declared but unused; either remove it from the signature or implement the
same behavior as TensorRTPyBenchmark.run by flushing the timing cache before
running. To implement, add a small helper (e.g., _flush_timing_cache) or call
the appropriate trtexec/engine cache clear routine at the start of
TrtExecBenchmark.run when flush_timing_cache is True, then proceed with the
existing benchmarking flow; if you prefer removal, delete the flush_timing_cache
parameter from TrtExecBenchmark.run and any callers to keep APIs consistent with
actual behavior.
- Around line 262-266: The code raises NameError when log_file is None because
it references the undefined variable output; in the block that handles a failed
regex match (where you have match = re.search(self.latency_pattern,
result.stdout, ...)), replace the use of output with the already-available
result.stdout (or otherwise initialize output before the conditional) so the
logger.debug call uses a defined variable; update the code around the match
check that references log_file/result.stdout and self.logger to log
result.stdout instead of output when log_file is None.
- Around line 676-691: The _save_timing_cache method has inverted and wrong
checks, a typo, unsafe finally cleanup, and incorrect serialization logic:
ensure a builder config is created before use (assign config =
self.builder.create_builder_config() before the try or inside try but guard
finally), only attempt to combine when self._timing_cache is not None (change
the condition from if self._timing_cache is None to if self._timing_cache is not
None), call output_cache.combine(self._timing_cache, ignore_errors=True) (fix
typo combine), then serialize the resulting cache (or if the intent is to
persist the existing cache directly, serialize self._timing_cache) and write it
to self.timing_cache_file; finally, in the finally block only delete or close
config if it was successfully created (e.g., check if 'config' in locals() or
use a variable initialized to None) to avoid UnboundLocalError.
- Line 145: The function parameter trtexec_args currently uses a mutable default
list (trtexec_args: list = []) which can be shared across calls; change the
signature to use None (e.g., trtexec_args: Optional[list] = None) and inside the
function (in the body of the same function that declares trtexec_args)
initialize it to an empty list if None (trtexec_args = trtexec_args or [])
before use; update the type hint to Optional[list] (or List[str]) and add the
necessary typing import if absent, ensuring all references to trtexec_args in
the function use the newly initialized local list.
🧹 Nitpick comments (4)
modelopt/onnx/quantization/autotune/benchmark.py (4)

97-98: Unnecessary global statement.

The global logger declaration is not needed here since logger is already available at module scope from the import on line 57. The global keyword is only required when you intend to reassign a module-level variable inside a function/method.

Suggested fix
-        global logger
         self.timing_cache_file = timing_cache_file or "/tmp/trtexec_timing.cache"  # nosec B108

543-544: Potential dtype mismatch for integer input tensors.

np.random.randn(size) generates float64 values, which are then cast via .astype(dtype). For integer dtypes, this truncates floats to integers (all becoming 0 for values in (-1, 1)). Consider using np.random.randint for integer types or ensuring the random data is appropriate for the tensor's dtype.

Suggested approach
                 if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
-                    np.copyto(host_mem, np.random.randn(size).astype(dtype))
+                    if np.issubdtype(dtype, np.integer):
+                        np.copyto(host_mem, np.random.randint(0, 100, size=size, dtype=dtype))
+                    else:
+                        np.copyto(host_mem, np.random.randn(size).astype(dtype))
                     inputs.append({"host": host_mem, "device": device_mem, "name": tensor_name})

104-115: Base class signature mismatch with subclasses.

The abstract run() method has signature run(path_or_bytes, log_file), but both TrtExecBenchmark.run() and TensorRTPyBenchmark.run() add a flush_timing_cache parameter. This breaks Liskov substitution - code using Benchmark type hints won't know about the extra parameter. Consider adding the parameter to the base class signature.

Suggested fix for base class
     `@abstractmethod`
-    def run(self, path_or_bytes: str | bytes, log_file: str | None = None) -> float:
+    def run(self, path_or_bytes: str | bytes, log_file: str | None = None, flush_timing_cache: bool = False) -> float:
         """Run benchmark on the given ONNX model.

         Args:
             path_or_bytes: Path to the ONNX model (str) or raw model data (bytes)
             log_file: Optional path to save benchmark logs
+            flush_timing_cache: Whether to flush timing cache to disk after build

         Returns:
             Measured latency in milliseconds, or float("inf") on failure
         """

185-191: Inconsistent plugin validation behavior between benchmark implementations.

TrtExecBenchmark logs a warning and continues when a plugin library is not found (line 188-189), while TensorRTPyBenchmark raises FileNotFoundError (line 359). Consider aligning the behavior - failing fast is generally safer for configuration errors.

Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part3.1 branch from 1311d20 to d7e16ac Compare February 2, 2026 13:49
Signed-off-by: Will Guo <willg@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant