Skip to content

feat: Baseten contrib third-party-dataset support#851

Open
michaelfeil wants to merge 3 commits intoNVIDIA:mainfrom
michaelfeil:mf/support-third-party-dataset
Open

feat: Baseten contrib third-party-dataset support#851
michaelfeil wants to merge 3 commits intoNVIDIA:mainfrom
michaelfeil:mf/support-third-party-dataset

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Feb 3, 2026

What does this PR do?

Type of change: ?

Overview: ?
Background: the outage on cnn_dailymail has a couple of issues with out platform, with 20+ customers asking about breaking support for own datasets.

  • We still require trt-engines to support, which means e.g. the modelopt version cannot be upgraded to latest easily.
  • Applying the chat template on short context or out-of-distribution material leads to a stronger degradation on harder or generation tasks. This is visible when e.g. running bfcl benchmark, where the model is generating a lot of tokens, the degration with modelopt is large, e.g. 5%+ tokens.

Going forward, we are looking to get stable support for customer-brought datasets into modelopt, because the existing options do not suffice. A hard validation is not desirable, many baseten users will see things like abisee/cnn_dailymail is not in supported datasets, because we use modelopt 0.35.x.

Usage

llm_ptq --dataset baseten/quant_calibration_dataset_v1

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Extended dataset loading to support third‑party and custom datasets, including message-style, prompt, and text formats.
    • Tokenizer-aware processing and an option to apply a chat template for message-style datasets.
    • Data-loading pipeline now propagates tokenizer context through sample retrieval.
  • Bug Fixes / Improvements

    • Improved warnings and validation for datasets that lack expected structures or chat-template support.

@michaelfeil michaelfeil requested a review from a team as a code owner February 3, 2026 22:57
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

Adds support for loading unregistered (third-party) datasets by introducing an internal helper to extract texts/messages and extends get_dataset_samples() with apply_chat_template and tokenizer parameters; internal callers (including get_dataset_dataloader) were updated to forward the tokenizer and chat-template handling is validated for dataset structure.

Changes

Cohort / File(s) Summary
Dataset utilities
modelopt/torch/utils/dataset_utils.py
Added _third_party_get_dataset_samples(dataset_name, num_samples, tokenizer) to handle third‑party datasets with messages, prompts, or text columns. Updated get_dataset_samples(..., *, apply_chat_template: bool = False, tokenizer=None) to delegate to the new helper for unsupported datasets, emit warnings for unsupported structures, and apply/chat-template validation. Internal calls updated to pass tokenizer. Also updated get_dataset_dataloader to propagate tokenizer when fetching samples.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant get_dataset_samples
    participant SUPPORTED_CONFIG
    participant ThirdPartyHandler as _third_party_get_dataset_samples
    participant DatasetLoader
    participant Tokenizer

    Caller->>get_dataset_samples: get_dataset_samples(name, num_samples, apply_chat_template, tokenizer)
    get_dataset_samples->>SUPPORTED_CONFIG: is dataset registered?
    alt registered
        get_dataset_samples->>DatasetLoader: load via config
        DatasetLoader-->>get_dataset_samples: texts
    else not registered
        get_dataset_samples->>ThirdPartyHandler: delegate (name, num_samples, tokenizer)
        ThirdPartyHandler->>DatasetLoader: load third-party dataset
        DatasetLoader-->>ThirdPartyHandler: raw records
        alt records contain messages
            ThirdPartyHandler->>Tokenizer: require tokenizer with chat template
            Tokenizer-->>ThirdPartyHandler: formatted texts
        else records contain prompts or text column
            ThirdPartyHandler->>ThirdPartyHandler: extract prompt/text column
        else unsupported structure
            ThirdPartyHandler-->>get_dataset_samples: raise error / warn
        end
        ThirdPartyHandler-->>get_dataset_samples: texts
    end
    get_dataset_samples-->>Caller: return texts
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding third-party dataset support to the codebase, which aligns with the feature changes in dataset_utils.py.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@michaelfeil michaelfeil force-pushed the mf/support-third-party-dataset branch from 58008a4 to 6a0059b Compare February 3, 2026 22:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 166-170: Fix the typo in the NotImplementedError message inside
dataset_utils.py where the error is raised: change "thrid-party" to
"third-party" in the string that references dataset_name and
get_supported_datasets(); update the message attached to the raise in the block
that checks supported datasets (the f"Dataset {dataset_name} is not
supported..." raise) so the wording reads "third-party" correctly.
- Around line 121-123: Fix the typo in the warning message emitted in
dataset_utils.py: update the string in the warn(...) call that currently reads
"Loading third-party datset {dataset_name} ..." to "Loading third-party dataset
{dataset_name} ..." so the variable dataset_name and the call to
get_supported_datasets() remain unchanged and the warning text is correct.
- Around line 152-159: The error messages incorrectly say "Column {i}" while `i`
is the row/sample index; update both ValueError messages to refer to "Row {i}"
or "Sample {i}" (e.g., "Row {i} in dataset {dataset_name} has no messages..."
and "Row {i} in dataset {dataset_name} has empty text...") where `i`,
`dataset_name`, and the tokenizer.apply_chat_template call appear in
dataset_utils.py so logs accurately reflect the offending sample.
- Around line 145-164: The code ignores the num_samples parameter and thus
iterates the entire (possibly streaming) dataset; update the branches that build
texts to respect num_samples: in the "messages" branch (loop over
dataset/enumerate where tokenizer.apply_chat_template is used) stop collecting
once i >= num_samples (or use itertools.islice to limit iteration), and in the
"prompt" and "text" branches replace the full-list comprehensions with a loop
that appends each sample["prompt"] or sample["text"] and breaks when the
collected count reaches num_samples; ensure num_samples=None means no limit and
keep the same validations (empty messages/text) when samples are collected.
🧹 Nitpick comments (2)
modelopt/torch/utils/dataset_utils.py (2)

131-141: Redundant texts = [] initialization.

texts is initialized at line 131, then re-initialized at line 141 inside the messages branch. The first initialization at line 131 can be removed since each branch handles its own initialization.

♻️ Proposed fix
-    texts = []
     if "messages" in dataset.column_names:
         if tokenizer is None:
             raise ValueError(

195-201: Behavior inconsistency between supported and third-party datasets regarding apply_chat_template.

For supported datasets, apply_chat_template is controlled by the caller. For third-party datasets, chat templates are automatically applied when a messages column exists, ignoring the apply_chat_template parameter.

If this is intentional (since third-party datasets with messages columns require chat templates to be useful), consider documenting this behavior in the docstring. Otherwise, consider passing apply_chat_template to _third_party_get_dataset_samples for consistent behavior.

Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
@michaelfeil michaelfeil force-pushed the mf/support-third-party-dataset branch from 6a0059b to a5eb313 Compare February 3, 2026 23:07
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
@michaelfeil
Copy link
Contributor Author

@CodeRabbit review again.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

@michaelfeil, I'll review the changes in this PR now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@michaelfeil
Copy link
Contributor Author

@CodeRabbit full review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

✅ Actions performed

Full review triggered.

]


def _third_party_get_dataset_samples(
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing!

I think there are 2 things being done here which we could simplify further:

  1. Support tools in tokenizer.apply_chat_template which we can add to the existing get_dataset_samples() with optional tools_key in the dataset config in SUPPORTED_DATASET_CONFIG
  2. Custom dataset - I think we just need to import and overwrite SUPPORTED_DATASET_CONFIG from this file then we dont need separate function for third party datasets

Copy link
Contributor Author

@michaelfeil michaelfeil Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing my PR again and fast response.

We have like 30+ datasets, some are even from customers. How can we make sure this is added? Do i need to register them in the config for each one of them? The goal of this PR is to allow datasets that are not in SUPPORTED_DATASET_CONFIG, and just duck-type the datasets.

https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-KimiK2.5
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-GLM-4.7
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-Kimi-K2-v2
https://huggingface.co/datasets/baseten/quant_calibration_dataset_v1

I think it would be best supported with a separate function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation behind this (See Baseten-nvidia channel) https://basetenlabs.slack.com/archives/C04BUDD86FR/p1770160944160069?thread_ts=1770160618.661969&cid=C04BUDD86FR.

We essentially never want a 'cnn_dailymail` not available scenario to impact customers on baseten again. SUPPORTED_DATASET_CONFIG has brought us a ton of pain, and monkey patching a local variable in global import + trt mpi magic is not working well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the incident with modelopt, I am super biased, but i think SUPPORTED_DATASET_CONFIG is just not a good concept, and the concatination of the user-strings via \n was a quality issue in the ptq version of the model (obviosuly needs to follow the exact chat template tokens for outlier free amax)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevalmorabia97 Is there an update here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants