Integrate Automated QDQ benchmark - part 3.1#837
Integrate Automated QDQ benchmark - part 3.1#837willg-nv wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis PR introduces a benchmarking framework for ONNX model evaluation within the quantization autotuning subsystem. It provides an abstract Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TensorRTPyBenchmark
participant TensorRT as TensorRT Engine
participant CUDA as CUDA Runtime
participant FileSystem
User->>TensorRTPyBenchmark: run(onnx_model_bytes)
TensorRTPyBenchmark->>FileSystem: Load/create timing cache
TensorRTPyBenchmark->>TensorRT: Load plugins if configured
TensorRTPyBenchmark->>TensorRT: Build engine from ONNX
TensorRTPyBenchmark->>TensorRT: Create context
TensorRTPyBenchmark->>CUDA: Allocate device memory
TensorRTPyBenchmark->>TensorRT: Warmup runs (N iterations)
CUDA-->>TensorRTPyBenchmark: Kernel execution
TensorRTPyBenchmark->>TensorRT: Timing runs (M iterations)
loop Collect Latencies
TensorRT->>CUDA: Execute inference
CUDA-->>TensorRTPyBenchmark: Timing data
end
TensorRTPyBenchmark->>TensorRTPyBenchmark: Calculate statistics (min/max/mean/median)
TensorRTPyBenchmark->>FileSystem: Save timing cache & results
TensorRTPyBenchmark-->>User: Return median latency (float)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Around line 207-212: The parameter flush_timing_cache on TrtExecBenchmark.run
is declared but unused; either remove it from the signature or implement the
same behavior as TensorRTPyBenchmark.run by flushing the timing cache before
running. To implement, add a small helper (e.g., _flush_timing_cache) or call
the appropriate trtexec/engine cache clear routine at the start of
TrtExecBenchmark.run when flush_timing_cache is True, then proceed with the
existing benchmarking flow; if you prefer removal, delete the flush_timing_cache
parameter from TrtExecBenchmark.run and any callers to keep APIs consistent with
actual behavior.
- Around line 262-266: The code raises NameError when log_file is None because
it references the undefined variable output; in the block that handles a failed
regex match (where you have match = re.search(self.latency_pattern,
result.stdout, ...)), replace the use of output with the already-available
result.stdout (or otherwise initialize output before the conditional) so the
logger.debug call uses a defined variable; update the code around the match
check that references log_file/result.stdout and self.logger to log
result.stdout instead of output when log_file is None.
- Around line 676-691: The _save_timing_cache method has inverted and wrong
checks, a typo, unsafe finally cleanup, and incorrect serialization logic:
ensure a builder config is created before use (assign config =
self.builder.create_builder_config() before the try or inside try but guard
finally), only attempt to combine when self._timing_cache is not None (change
the condition from if self._timing_cache is None to if self._timing_cache is not
None), call output_cache.combine(self._timing_cache, ignore_errors=True) (fix
typo combine), then serialize the resulting cache (or if the intent is to
persist the existing cache directly, serialize self._timing_cache) and write it
to self.timing_cache_file; finally, in the finally block only delete or close
config if it was successfully created (e.g., check if 'config' in locals() or
use a variable initialized to None) to avoid UnboundLocalError.
- Line 145: The function parameter trtexec_args currently uses a mutable default
list (trtexec_args: list = []) which can be shared across calls; change the
signature to use None (e.g., trtexec_args: Optional[list] = None) and inside the
function (in the body of the same function that declares trtexec_args)
initialize it to an empty list if None (trtexec_args = trtexec_args or [])
before use; update the type hint to Optional[list] (or List[str]) and add the
necessary typing import if absent, ensuring all references to trtexec_args in
the function use the newly initialized local list.
🧹 Nitpick comments (4)
modelopt/onnx/quantization/autotune/benchmark.py (4)
97-98: Unnecessaryglobalstatement.The
global loggerdeclaration is not needed here sinceloggeris already available at module scope from the import on line 57. Theglobalkeyword is only required when you intend to reassign a module-level variable inside a function/method.Suggested fix
- global logger self.timing_cache_file = timing_cache_file or "/tmp/trtexec_timing.cache" # nosec B108
543-544: Potential dtype mismatch for integer input tensors.
np.random.randn(size)generatesfloat64values, which are then cast via.astype(dtype). For integer dtypes, this truncates floats to integers (all becoming 0 for values in (-1, 1)). Consider usingnp.random.randintfor integer types or ensuring the random data is appropriate for the tensor's dtype.Suggested approach
if engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT: - np.copyto(host_mem, np.random.randn(size).astype(dtype)) + if np.issubdtype(dtype, np.integer): + np.copyto(host_mem, np.random.randint(0, 100, size=size, dtype=dtype)) + else: + np.copyto(host_mem, np.random.randn(size).astype(dtype)) inputs.append({"host": host_mem, "device": device_mem, "name": tensor_name})
104-115: Base class signature mismatch with subclasses.The abstract
run()method has signaturerun(path_or_bytes, log_file), but bothTrtExecBenchmark.run()andTensorRTPyBenchmark.run()add aflush_timing_cacheparameter. This breaks Liskov substitution - code usingBenchmarktype hints won't know about the extra parameter. Consider adding the parameter to the base class signature.Suggested fix for base class
`@abstractmethod` - def run(self, path_or_bytes: str | bytes, log_file: str | None = None) -> float: + def run(self, path_or_bytes: str | bytes, log_file: str | None = None, flush_timing_cache: bool = False) -> float: """Run benchmark on the given ONNX model. Args: path_or_bytes: Path to the ONNX model (str) or raw model data (bytes) log_file: Optional path to save benchmark logs + flush_timing_cache: Whether to flush timing cache to disk after build Returns: Measured latency in milliseconds, or float("inf") on failure """
185-191: Inconsistent plugin validation behavior between benchmark implementations.
TrtExecBenchmarklogs a warning and continues when a plugin library is not found (line 188-189), whileTensorRTPyBenchmarkraisesFileNotFoundError(line 359). Consider aligning the behavior - failing fast is generally safer for configuration errors.
Signed-off-by: Will Guo <willg@nvidia.com>
1311d20 to
d7e16ac
Compare
Signed-off-by: Will Guo <willg@nvidia.com>
What does this PR do?
This PR integrates benchmark module to QDQ autotunner. This benchamrk module is used to evaluate ONNX model perf.
This PR is 1/3 of #703. Once all small PRs are merged #703 could be closed.
PR 3.1: #837
PR 3.2 #838
PR 3.3: #839
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.