feat: Baseten contrib third-party-dataset support#851
feat: Baseten contrib third-party-dataset support#851michaelfeil wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughAdds support for loading unregistered (third-party) datasets by introducing an internal helper to extract texts/messages and extends Changes
Sequence DiagramsequenceDiagram
participant Caller
participant get_dataset_samples
participant SUPPORTED_CONFIG
participant ThirdPartyHandler as _third_party_get_dataset_samples
participant DatasetLoader
participant Tokenizer
Caller->>get_dataset_samples: get_dataset_samples(name, num_samples, apply_chat_template, tokenizer)
get_dataset_samples->>SUPPORTED_CONFIG: is dataset registered?
alt registered
get_dataset_samples->>DatasetLoader: load via config
DatasetLoader-->>get_dataset_samples: texts
else not registered
get_dataset_samples->>ThirdPartyHandler: delegate (name, num_samples, tokenizer)
ThirdPartyHandler->>DatasetLoader: load third-party dataset
DatasetLoader-->>ThirdPartyHandler: raw records
alt records contain messages
ThirdPartyHandler->>Tokenizer: require tokenizer with chat template
Tokenizer-->>ThirdPartyHandler: formatted texts
else records contain prompts or text column
ThirdPartyHandler->>ThirdPartyHandler: extract prompt/text column
else unsupported structure
ThirdPartyHandler-->>get_dataset_samples: raise error / warn
end
ThirdPartyHandler-->>get_dataset_samples: texts
end
get_dataset_samples-->>Caller: return texts
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
58008a4 to
6a0059b
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 166-170: Fix the typo in the NotImplementedError message inside
dataset_utils.py where the error is raised: change "thrid-party" to
"third-party" in the string that references dataset_name and
get_supported_datasets(); update the message attached to the raise in the block
that checks supported datasets (the f"Dataset {dataset_name} is not
supported..." raise) so the wording reads "third-party" correctly.
- Around line 121-123: Fix the typo in the warning message emitted in
dataset_utils.py: update the string in the warn(...) call that currently reads
"Loading third-party datset {dataset_name} ..." to "Loading third-party dataset
{dataset_name} ..." so the variable dataset_name and the call to
get_supported_datasets() remain unchanged and the warning text is correct.
- Around line 152-159: The error messages incorrectly say "Column {i}" while `i`
is the row/sample index; update both ValueError messages to refer to "Row {i}"
or "Sample {i}" (e.g., "Row {i} in dataset {dataset_name} has no messages..."
and "Row {i} in dataset {dataset_name} has empty text...") where `i`,
`dataset_name`, and the tokenizer.apply_chat_template call appear in
dataset_utils.py so logs accurately reflect the offending sample.
- Around line 145-164: The code ignores the num_samples parameter and thus
iterates the entire (possibly streaming) dataset; update the branches that build
texts to respect num_samples: in the "messages" branch (loop over
dataset/enumerate where tokenizer.apply_chat_template is used) stop collecting
once i >= num_samples (or use itertools.islice to limit iteration), and in the
"prompt" and "text" branches replace the full-list comprehensions with a loop
that appends each sample["prompt"] or sample["text"] and breaks when the
collected count reaches num_samples; ensure num_samples=None means no limit and
keep the same validations (empty messages/text) when samples are collected.
🧹 Nitpick comments (2)
modelopt/torch/utils/dataset_utils.py (2)
131-141: Redundanttexts = []initialization.
textsis initialized at line 131, then re-initialized at line 141 inside themessagesbranch. The first initialization at line 131 can be removed since each branch handles its own initialization.♻️ Proposed fix
- texts = [] if "messages" in dataset.column_names: if tokenizer is None: raise ValueError(
195-201: Behavior inconsistency between supported and third-party datasets regardingapply_chat_template.For supported datasets,
apply_chat_templateis controlled by the caller. For third-party datasets, chat templates are automatically applied when amessagescolumn exists, ignoring theapply_chat_templateparameter.If this is intentional (since third-party datasets with
messagescolumns require chat templates to be useful), consider documenting this behavior in the docstring. Otherwise, consider passingapply_chat_templateto_third_party_get_dataset_samplesfor consistent behavior.
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
6a0059b to
a5eb313
Compare
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
Signed-off-by: michaelfeil <63565275+michaelfeil@users.noreply.github.com>
|
@CodeRabbit review again. |
|
✅ Actions performedReview triggered.
|
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
| ] | ||
|
|
||
|
|
||
| def _third_party_get_dataset_samples( |
There was a problem hiding this comment.
Thanks for contributing!
I think there are 2 things being done here which we could simplify further:
- Support
toolsintokenizer.apply_chat_templatewhich we can add to the existingget_dataset_samples()with optionaltools_keyin the dataset config inSUPPORTED_DATASET_CONFIG - Custom dataset - I think we just need to import and overwrite
SUPPORTED_DATASET_CONFIGfrom this file then we dont need separate function for third party datasets
There was a problem hiding this comment.
Thanks for reviewing my PR again and fast response.
We have like 30+ datasets, some are even from customers. How can we make sure this is added? Do i need to register them in the config for each one of them? The goal of this PR is to allow datasets that are not in SUPPORTED_DATASET_CONFIG, and just duck-type the datasets.
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-KimiK2.5
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-GLM-4.7
https://huggingface.co/datasets/baseten/ptq-on-policy-nemotron-Kimi-K2-v2
https://huggingface.co/datasets/baseten/quant_calibration_dataset_v1
I think it would be best supported with a separate function.
There was a problem hiding this comment.
Motivation behind this (See Baseten-nvidia channel) https://basetenlabs.slack.com/archives/C04BUDD86FR/p1770160944160069?thread_ts=1770160618.661969&cid=C04BUDD86FR.
We essentially never want a 'cnn_dailymail` not available scenario to impact customers on baseten again. SUPPORTED_DATASET_CONFIG has brought us a ton of pain, and monkey patching a local variable in global import + trt mpi magic is not working well.
There was a problem hiding this comment.
After the incident with modelopt, I am super biased, but i think SUPPORTED_DATASET_CONFIG is just not a good concept, and the concatination of the user-strings via \n was a quality issue in the ptq version of the model (obviosuly needs to follow the exact chat template tokens for outlier free amax)
There was a problem hiding this comment.
@kevalmorabia97 Is there an update here?
What does this PR do?
Type of change: ?
Overview: ?
Background: the outage on
cnn_dailymailhas a couple of issues with out platform, with 20+ customers asking about breaking support for own datasets.Going forward, we are looking to get stable support for customer-brought datasets into modelopt, because the existing options do not suffice. A hard validation is not desirable, many baseten users will see things like
abisee/cnn_dailymailis not in supported datasets, because we use modelopt 0.35.x.Usage
llm_ptq --dataset baseten/quant_calibration_dataset_v1# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Bug Fixes / Improvements