feat: Baseten contrib third-party-dataset support by michaelfeil · Pull Request #851 · NVIDIA/Model-Optimizer

michaelfeil · 2026-02-03T22:57:00Z

What does this PR do?

Type of change: ?

Overview: ?
Background: the outage on cnn_dailymail has a couple of issues with out platform, with 20+ customers asking about breaking support for own datasets.

We still require trt-engines to support, which means e.g. the modelopt version cannot be upgraded to latest easily.
Applying the chat template on short context or out-of-distribution material leads to a stronger degradation on harder or generation tasks. This is visible when e.g. running bfcl benchmark, where the model is generating a lot of tokens, the degration with modelopt is large, e.g. 5%+ tokens.

Going forward, we are looking to get stable support for customer-brought datasets into modelopt, because the existing options do not suffice. A hard validation is not desirable, many baseten users will see things like abisee/cnn_dailymail is not in supported datasets, because we use modelopt 0.35.x.

Usage

llm_ptq --dataset baseten/quant_calibration_dataset_v1

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Extended dataset loading to support third‑party and custom datasets, including message-style, prompt, and text formats.
- Tokenizer-aware processing and an option to apply a chat template for message-style datasets.
- Data-loading pipeline now propagates tokenizer context through sample retrieval.
Bug Fixes / Improvements
- Improved warnings and validation for datasets that lack expected structures or chat-template support.

copy-pr-bot · 2026-02-03T22:57:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-02-03T22:57:33Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

Adds support for loading unregistered (third-party) datasets by introducing an internal helper to extract texts/messages and extends get_dataset_samples() with apply_chat_template and tokenizer parameters; internal callers (including get_dataset_dataloader) were updated to forward the tokenizer and chat-template handling is validated for dataset structure.

Changes

Cohort / File(s)	Summary
Dataset utilities `modelopt/torch/utils/dataset_utils.py`	Added `_third_party_get_dataset_samples(dataset_name, num_samples, tokenizer)` to handle third‑party datasets with `messages`, `prompts`, or text columns. Updated `get_dataset_samples(..., *, apply_chat_template: bool = False, tokenizer=None)` to delegate to the new helper for unsupported datasets, emit warnings for unsupported structures, and apply/chat-template validation. Internal calls updated to pass `tokenizer`. Also updated `get_dataset_dataloader` to propagate `tokenizer` when fetching samples.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant get_dataset_samples
    participant SUPPORTED_CONFIG
    participant ThirdPartyHandler as _third_party_get_dataset_samples
    participant DatasetLoader
    participant Tokenizer

    Caller->>get_dataset_samples: get_dataset_samples(name, num_samples, apply_chat_template, tokenizer)
    get_dataset_samples->>SUPPORTED_CONFIG: is dataset registered?
    alt registered
        get_dataset_samples->>DatasetLoader: load via config
        DatasetLoader-->>get_dataset_samples: texts
    else not registered
        get_dataset_samples->>ThirdPartyHandler: delegate (name, num_samples, tokenizer)
        ThirdPartyHandler->>DatasetLoader: load third-party dataset
        DatasetLoader-->>ThirdPartyHandler: raw records
        alt records contain messages
            ThirdPartyHandler->>Tokenizer: require tokenizer with chat template
            Tokenizer-->>ThirdPartyHandler: formatted texts
        else records contain prompts or text column
            ThirdPartyHandler->>ThirdPartyHandler: extract prompt/text column
        else unsupported structure
            ThirdPartyHandler-->>get_dataset_samples: raise error / warn
        end
        ThirdPartyHandler-->>get_dataset_samples: texts
    end
    get_dataset_samples-->>Caller: return texts

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding third-party dataset support to the codebase, which aligns with the feature changes in dataset_utils.py.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 166-170: Fix the typo in the NotImplementedError message inside
dataset_utils.py where the error is raised: change "thrid-party" to
"third-party" in the string that references dataset_name and
get_supported_datasets(); update the message attached to the raise in the block
that checks supported datasets (the f"Dataset {dataset_name} is not
supported..." raise) so the wording reads "third-party" correctly.
- Around line 121-123: Fix the typo in the warning message emitted in
dataset_utils.py: update the string in the warn(...) call that currently reads
"Loading third-party datset {dataset_name} ..." to "Loading third-party dataset
{dataset_name} ..." so the variable dataset_name and the call to
get_supported_datasets() remain unchanged and the warning text is correct.
- Around line 152-159: The error messages incorrectly say "Column {i}" while `i`
is the row/sample index; update both ValueError messages to refer to "Row {i}"
or "Sample {i}" (e.g., "Row {i} in dataset {dataset_name} has no messages..."
and "Row {i} in dataset {dataset_name} has empty text...") where `i`,
`dataset_name`, and the tokenizer.apply_chat_template call appear in
dataset_utils.py so logs accurately reflect the offending sample.
- Around line 145-164: The code ignores the num_samples parameter and thus
iterates the entire (possibly streaming) dataset; update the branches that build
texts to respect num_samples: in the "messages" branch (loop over
dataset/enumerate where tokenizer.apply_chat_template is used) stop collecting
once i >= num_samples (or use itertools.islice to limit iteration), and in the
"prompt" and "text" branches replace the full-list comprehensions with a loop
that appends each sample["prompt"] or sample["text"] and breaks when the
collected count reaches num_samples; ensure num_samples=None means no limit and
keep the same validations (empty messages/text) when samples are collected.

🧹 Nitpick comments (2)

modelopt/torch/utils/dataset_utils.py (2)
131-141: Redundant texts = [] initialization.

texts is initialized at line 131, then re-initialized at line 141 inside the messages branch. The first initialization at line 131 can be removed since each branch handles its own initialization.
♻️ Proposed fix
-    texts = []
     if "messages" in dataset.column_names:
         if tokenizer is None:
             raise ValueError(
195-201: Behavior inconsistency between supported and third-party datasets regarding apply_chat_template.

For supported datasets, apply_chat_template is controlled by the caller. For third-party datasets, chat templates are automatically applied when a messages column exists, ignoring the apply_chat_template parameter.

If this is intentional (since third-party datasets with messages columns require chat templates to be useful), consider documenting this behavior in the docstring. Otherwise, consider passing apply_chat_template to _third_party_get_dataset_samples for consistent behavior.

modelopt/torch/utils/dataset_utils.py

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

michaelfeil · 2026-02-03T23:30:47Z

@CodeRabbit review again.

coderabbitai · 2026-02-03T23:31:00Z

@michaelfeil, I'll review the changes in this PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

michaelfeil · 2026-02-03T23:31:36Z

@CodeRabbit full review

coderabbitai · 2026-02-03T23:31:42Z

✅ Actions performed

Full review triggered.

kevalmorabia97 · 2026-02-04T21:36:52Z

modelopt/torch/utils/dataset_utils.py

 ]


+def _third_party_get_dataset_samples(


Thanks for contributing!

I think there are 2 things being done here which we could simplify further:

Support tools in tokenizer.apply_chat_template which we can add to the existing get_dataset_samples() with optional tools_key in the dataset config in SUPPORTED_DATASET_CONFIG

Custom dataset - I think we just need to import and overwrite SUPPORTED_DATASET_CONFIG from this file then we dont need separate function for third party datasets

Thanks for reviewing my PR again and fast response.

We have like 30+ datasets, some are even from customers. How can we make sure this is added? Do i need to register them in the config for each one of them? The goal of this PR is to allow datasets that are not in SUPPORTED_DATASET_CONFIG, and just duck-type the datasets.

https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-KimiK2.5
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-GLM-4.7
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-Kimi-K2-v2
https://huggingface.co/datasets/baseten/quant_calibration_dataset_v1

I think it would be best supported with a separate function.

Motivation behind this (See Baseten-nvidia channel) https://basetenlabs.slack.com/archives/C04BUDD86FR/p1770160944160069?thread_ts=1770160618.661969&cid=C04BUDD86FR.

We essentially never want a 'cnn_dailymail` not available scenario to impact customers on baseten again. SUPPORTED_DATASET_CONFIG has brought us a ton of pain, and monkey patching a local variable in global import + trt mpi magic is not working well.

After the incident with modelopt, I am super biased, but i think SUPPORTED_DATASET_CONFIG is just not a good concept, and the concatination of the user-strings via \n was a quality issue in the ptq version of the model (obviosuly needs to follow the exact chat template tokens for outlier free amax)

@kevalmorabia97 Is there an update here?

michaelfeil requested a review from a team as a code owner February 3, 2026 22:57

michaelfeil requested a review from kevalmorabia97 February 3, 2026 22:57

michaelfeil force-pushed the mf/support-third-party-dataset branch from 58008a4 to 6a0059b Compare February 3, 2026 22:58

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

modelopt/torch/utils/dataset_utils.py Show resolved Hide resolved

modelopt/torch/utils/dataset_utils.py Show resolved Hide resolved

modelopt/torch/utils/dataset_utils.py Show resolved Hide resolved

modelopt/torch/utils/dataset_utils.py Show resolved Hide resolved

third-party-dataset support, for shuffle

a5eb313

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

michaelfeil force-pushed the mf/support-third-party-dataset branch from 6a0059b to a5eb313 Compare February 3, 2026 23:07

michaelfeil added 2 commits February 3, 2026 15:13

add suggestions from coderabbit

00ee933

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

add suggestions from coderabbit

560b503

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>

kevalmorabia97 reviewed Feb 4, 2026

View reviewed changes

		]


		def _third_party_get_dataset_samples(

Conversation

michaelfeil commented Feb 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michaelfeil commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Uh oh!

michaelfeil commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Uh oh!

kevalmorabia97 Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelfeil Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelfeil Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

michaelfeil Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

michaelfeil Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelfeil commented Feb 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 3, 2026 •

edited

Loading

kevalmorabia97 Feb 4, 2026 •

edited

Loading

michaelfeil Feb 4, 2026 •

edited

Loading