Improve error message when tokenize_*_data config doesn't match dataset by aireenmei · Pull Request #3856 · AI-Hypercomputer/maxtext

aireenmei · 2026-05-08T21:26:15Z

Description

Currently when the flag tokenize_{train/eval}_data doesn't match with the dataset (pre-tokenized or not), the data pipeline often crash without clear error message, for example: https://buganizer.corp.google.com/issues/509946269#comment10

Tests

Manually tested

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2026-05-08T21:36:01Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

The PR significantly improves error messages when data tokenization configuration doesn't match the dataset format, which is a great UX improvement. It consolidates validation logic into NormalizeFeatures and adds specific validation for TFDS.

🔍 General Feedback

The refactoring of NormalizeFeatures into a universal validator is a good architectural move as it centralizes the data type checks.
One small regression is the loss of the descriptive "missing column" error message when KeepFeatures was removed, which I've suggested adding back.
Overall, the code is clean and the error messages are very helpful and actionable for users.

codecov · 2026-05-08T21:42:15Z

Codecov Report

❌ Patch coverage is 43.75000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/input_pipeline/input_pipeline_utils.py	40.00%	10 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

… dataset

aireenmei force-pushed the aireen/tokenize_error branch 2 times, most recently from c81e688 to c23cd2f Compare May 8, 2026 21:35

aireenmei added the gemini-review label May 8, 2026

github-actions Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/maxtext/input_pipeline/input_pipeline_utils.py

Comment thread src/maxtext/input_pipeline/input_pipeline_utils.py Outdated

aireenmei force-pushed the aireen/tokenize_error branch 2 times, most recently from e884f05 to f5df57b Compare May 9, 2026 01:11

Improve error message when tokenize_train/eval_data doesn't match the…

2b02e64

… dataset

aireenmei force-pushed the aireen/tokenize_error branch from f5df57b to 2b02e64 Compare May 9, 2026 01:12

aireenmei marked this pull request as ready for review May 9, 2026 01:14

aireenmei requested review from NicoGrande, SurbhiJainUSC, richjames0 and shralex as code owners May 9, 2026 01:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error message when tokenize_*_data config doesn't match dataset#3856

Improve error message when tokenize_*_data config doesn't match dataset#3856
aireenmei wants to merge 1 commit intomainfrom
aireen/tokenize_error

aireenmei commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aireenmei commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aireenmei commented May 8, 2026 •

edited

Loading

codecov Bot commented May 8, 2026 •

edited

Loading