Skip to content

Improve error message when tokenize_*_data config doesn't match dataset#3856

Open
aireenmei wants to merge 1 commit intomainfrom
aireen/tokenize_error
Open

Improve error message when tokenize_*_data config doesn't match dataset#3856
aireenmei wants to merge 1 commit intomainfrom
aireen/tokenize_error

Conversation

@aireenmei
Copy link
Copy Markdown
Collaborator

@aireenmei aireenmei commented May 8, 2026

Description

Currently when the flag tokenize_{train/eval}_data doesn't match with the dataset (pre-tokenized or not), the data pipeline often crash without clear error message, for example: https://buganizer.corp.google.com/issues/509946269#comment10

Tests

Manually tested

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@aireenmei aireenmei force-pushed the aireen/tokenize_error branch 2 times, most recently from c81e688 to c23cd2f Compare May 8, 2026 21:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

The PR significantly improves error messages when data tokenization configuration doesn't match the dataset format, which is a great UX improvement. It consolidates validation logic into NormalizeFeatures and adds specific validation for TFDS.

🔍 General Feedback

  • The refactoring of NormalizeFeatures into a universal validator is a good architectural move as it centralizes the data type checks.
  • One small regression is the loss of the descriptive "missing column" error message when KeepFeatures was removed, which I've suggested adding back.
  • Overall, the code is clean and the error messages are very helpful and actionable for users.

Comment thread src/maxtext/input_pipeline/input_pipeline_utils.py
Comment thread src/maxtext/input_pipeline/input_pipeline_utils.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 43.75000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/input_pipeline/input_pipeline_utils.py 40.00% 10 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

@aireenmei aireenmei force-pushed the aireen/tokenize_error branch 2 times, most recently from e884f05 to f5df57b Compare May 9, 2026 01:11
@aireenmei aireenmei force-pushed the aireen/tokenize_error branch from f5df57b to 2b02e64 Compare May 9, 2026 01:12
@aireenmei aireenmei marked this pull request as ready for review May 9, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant