Update chardet to 6.0.0 by pyup-bot · Pull Request #495 · Al1rios/libpythonpro

pyup-bot · 2026-02-22T05:38:41Z

This PR updates chardet from 3.0.4 to 6.0.0.

Changelog

6.0.0

Features

- **Unified single-byte charset detection**: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case `Latin1Prober` and `MacRomanProber` heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding *and* the language for all supported single-byte encodings.
- **38 new languages**: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
- **`EncodingEra` filtering**: New `encoding_era` parameter to `detect` allows filtering by an `EncodingEra` flag enum (`MODERN_WEB`, `LEGACY_ISO`, `LEGACY_MAC`, `LEGACY_REGIONAL`, `DOS`, `MAINFRAME`, `ALL`) allows callers to restrict detection to encodings from a specific era. `detect()` and `detect_all()` default to `MODERN_WEB`. The new `MODERN_WEB` default should drastically improve accuracy for users who are not working with legacy data. The tiers are:
- `MODERN_WEB`: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)
- `LEGACY_ISO`: ISO-8859-x, KOI8-R/U (legacy but well-known standards)
- `LEGACY_MAC`: Mac-specific encodings (MacRoman, MacCyrillic, etc.)
- `LEGACY_REGIONAL`: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)
- `DOS`: DOS/OEM code pages (CP437, CP850, CP866, etc.)
- `MAINFRAME`: EBCDIC variants (CP037, CP500, etc.)
- **`--encoding-era` CLI flag**: The `chardetect` CLI now accepts `-e`/`--encoding-era` to control which encoding eras are considered during detection.
- **`max_bytes` and `chunk_size` parameters**: `detect()`, `detect_all()`, and `UniversalDetector` now accept `max_bytes` (default 200KB) and `chunk_size` (default 64KB) parameters for controlling how much data is examined. (314, bysiber)
- **Encoding era preference tie-breaking**: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
- **Charset metadata registry**: New `chardet.metadata.charsets` module provides structured metadata about all supported encodings, including their era classification and language filter.
- **`should_rename_legacy` now defaults intelligently**: When set to `None` (the new default), legacy renaming is automatically enabled when `encoding_era` is `MODERN_WEB`.
- **Direct GB18030 support**: Replaced the redundant GB2312 prober with a proper GB18030 prober.
- **EBCDIC detection**: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
- **Binary file detection**: Added basic binary file detection to abort analysis earlier on non-text files.
- **Python 3.12, 3.13, and 3.14 support** (283, hugovk; 311)
- **GitHub Codespace support** (312, oxygen-dioxide)

Fixes

- **Fix CP949 state machine**: Corrected the state machine for Korean CP949 encoding detection. (268, nenw)
- **Fix SJIS distribution analysis**: Fixed `SJISDistributionAnalysis` discarding valid second-byte range &gt;= 0x80. (315, bysiber)
- **Fix UTF-16/32 detection for non-ASCII-heavy text**: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a `MIN_RATIO` threshold alongside the existing `EXPECTED_RATIO`.
- **Fix `get_charset` crash**: Resolved a crash when looking up unknown charset names.
- **Fix GB18030 `char_len_table`**: Corrected the character length table for GB18030 multi-byte sequences.
- **Fix UTF-8 state machine**: Updated to be more spec-compliant.
- **Fix `detect_all()` returning inactive probers**: Results from probers that determined &quot;definitely not this encoding&quot; are now excluded.
- **Fix early cutoff bug**: Resolved an issue where detection could terminate prematurely.
- **Default UTF-8 fallback**: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.

Breaking changes

- **Dropped Python 3.7, 3.8, and 3.9 support**: Now requires Python 3.10+. (283, hugovk)
- **Removed `Latin1Prober` and `MacRomanProber`**: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by `SingleByteCharSetProber` with trained language models, giving better accuracy and language identification.
- **Removed EUC-TW support**: EUC-TW encoding detection has been removed as it is extremely rare in practice.
- **`LanguageFilter.NONE` removed**: Use specific language filters or `LanguageFilter.ALL` instead.
- **Enum types changed**: `InputState`, `ProbingState`, `MachineState`, `SequenceLikelihood`, and `CharacterCategory` are now `IntEnum` (previously plain classes or `Enum`). `LanguageFilter` values changed from hardcoded hex to `auto()`.
- **`detect()` default behavior change**: `detect()` now defaults to `encoding_era=EncodingEra.MODERN_WEB` and `should_rename_legacy=None` (auto-enabled for `MODERN_WEB`), whereas previously it defaulted to considering all encodings with no legacy renaming.

Misc changes

- **Switched from Poetry/setuptools to uv + hatchling**: Build system modernized with `hatch-vcs` for version management.
- **License text updated**: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. (304, 307, musicinmybrain)
- **CulturaX-based model training**: The `create_language_model.py` training script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models.
- **`Language` class converted to frozen dataclass**: The language metadata class now uses `dataclass(frozen=True)` with `num_training_docs` and `num_training_chars` fields replacing `wiki_start_pages`.
- **Test infrastructure**: Added `pytest-timeout` and `pytest-xdist` for faster parallel test execution. Reorganized test data directories.

Contributors

Thank you to everyone who contributed to this release!

- dan-blanchard (Dan Blanchard)
- bysiber (Kadir Can Ozden)
- musicinmybrain (Ben Beasley)
- hugovk (Hugo van Kemenade)
- oxygen-dioxide
- nenw

And a special thanks to helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.

5.2.0

Adds support for running chardet CLI via `python -m chardet` (0e9b7bc20366163efcc221281201baff4100fe19, dan-blanchard)

5.1.0

Features
- Add `should_rename_legacy` argument to most functions, which will rename older encodings to their more modern equivalents (e.g., `GB2312` becomes `GB18030`) (264, dan-blanchard)
- Add capital letter sharp S and ISO-8859-15 support (222, SimonWaldherr)
- Add a prober for MacRoman encoding (5 updated as c292b52a97e57c95429ef559af36845019b88b33, Rob Speer and dan-blanchard )
- Add `--minimal` flag to `chardetect` command (214, dan-blanchard)
- Add type annotations to the project and run mypy on CI (261, jdufresne)
- Add support for Python 3.11 (274, hugovk)

Fixes
- Clarify LGPL version in License trove classifier (255, musicinmybrain)
- Remove support for EOL Python 3.6 (260, jdufresne)
- Remove unnecessary guards for non-falsey values (259, jdufresne)

Misc changes
- Switch to Python 3.10 release in GitHub actions (257, jdufresne)
- Remove setup.py in favor of build package (262, jdufresne)
- Run tests on macos, Windows, and 3.11-dev (267, dan-blanchard)

5.0.0

⚠️ This release is the first release of chardet that no longer supports Python &lt; 3.6 ⚠️

In addition to that change, it features the following user-facing changes:

- Added a prober for Johab Korean (207, grizlupo)
- Added a prober for UTF-16/32 BE/LE (109, 206, jpz) 
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, and Turkish, which should help prevent future errors with those languages
- Improved XML tag filtering, which should improve accuracy for XML files (208)
- Tweaked `SingleByteCharSetProber` confidence to match latest uchardet (209)
- Made `detect_all` return child prober confidences (210)
- Updated examples in docs (223, domdfcoding)
- Documentation fixes (212, 224, 225, 226, 220, 221, 244 from too many to mention)
- Minor performance improvements (252, deedy5)
- Add support for Python 3.10 when testing (232, jdufresne)
- Lots of little development cycle improvements, mostly thanks to jdufresne

4.0.0

Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:

Links

PyPI: https://pypi.org/project/chardet
Changelog: https://data.safetycli.com/changelogs/chardet/

pyup-bot · 2026-02-22T15:36:26Z

Closing this in favor of #496

Update chardet from 3.0.4 to 6.0.0

c30b60c

pyup-bot mentioned this pull request Feb 22, 2026

Update chardet to 5.0.0 #288

Closed

pyup-bot closed this Feb 22, 2026

Al1rios deleted the pyup-update-chardet-3.0.4-to-6.0.0 branch February 22, 2026 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Update chardet to 6.0.0#495

Update chardet to 6.0.0#495
pyup-bot wants to merge 1 commit intomasterfrom
pyup-update-chardet-3.0.4-to-6.0.0

pyup-bot commented Feb 22, 2026

Uh oh!

pyup-bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

pyup-bot commented Feb 22, 2026

6.0.0

5.2.0

5.1.0

5.0.0

4.0.0

Uh oh!

pyup-bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant