Skip to content

Comments

Update chardet to 6.0.0#495

Closed
pyup-bot wants to merge 1 commit intomasterfrom
pyup-update-chardet-3.0.4-to-6.0.0
Closed

Update chardet to 6.0.0#495
pyup-bot wants to merge 1 commit intomasterfrom
pyup-update-chardet-3.0.4-to-6.0.0

Conversation

@pyup-bot
Copy link
Collaborator

This PR updates chardet from 3.0.4 to 6.0.0.

Changelog

6.0.0

Features

- **Unified single-byte charset detection**: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case `Latin1Prober` and `MacRomanProber` heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding *and* the language for all supported single-byte encodings.
- **38 new languages**: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
- **`EncodingEra` filtering**: New `encoding_era` parameter to `detect` allows filtering by an `EncodingEra` flag enum (`MODERN_WEB`, `LEGACY_ISO`, `LEGACY_MAC`, `LEGACY_REGIONAL`, `DOS`, `MAINFRAME`, `ALL`) allows callers to restrict detection to encodings from a specific era. `detect()` and `detect_all()` default to `MODERN_WEB`. The new `MODERN_WEB` default should drastically improve accuracy for users who are not working with legacy data. The tiers are:
- `MODERN_WEB`: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)
- `LEGACY_ISO`: ISO-8859-x, KOI8-R/U (legacy but well-known standards)
- `LEGACY_MAC`: Mac-specific encodings (MacRoman, MacCyrillic, etc.)
- `LEGACY_REGIONAL`: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)
- `DOS`: DOS/OEM code pages (CP437, CP850, CP866, etc.)
- `MAINFRAME`: EBCDIC variants (CP037, CP500, etc.)
- **`--encoding-era` CLI flag**: The `chardetect` CLI now accepts `-e`/`--encoding-era` to control which encoding eras are considered during detection.
- **`max_bytes` and `chunk_size` parameters**: `detect()`, `detect_all()`, and `UniversalDetector` now accept `max_bytes` (default 200KB) and `chunk_size` (default 64KB) parameters for controlling how much data is examined. (314, bysiber)
- **Encoding era preference tie-breaking**: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
- **Charset metadata registry**: New `chardet.metadata.charsets` module provides structured metadata about all supported encodings, including their era classification and language filter.
- **`should_rename_legacy` now defaults intelligently**: When set to `None` (the new default), legacy renaming is automatically enabled when `encoding_era` is `MODERN_WEB`.
- **Direct GB18030 support**: Replaced the redundant GB2312 prober with a proper GB18030 prober.
- **EBCDIC detection**: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
- **Binary file detection**: Added basic binary file detection to abort analysis earlier on non-text files.
- **Python 3.12, 3.13, and 3.14 support** (283, hugovk; 311)
- **GitHub Codespace support** (312, oxygen-dioxide)

Fixes

- **Fix CP949 state machine**: Corrected the state machine for Korean CP949 encoding detection. (268, nenw)
- **Fix SJIS distribution analysis**: Fixed `SJISDistributionAnalysis` discarding valid second-byte range >= 0x80. (315, bysiber)
- **Fix UTF-16/32 detection for non-ASCII-heavy text**: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a `MIN_RATIO` threshold alongside the existing `EXPECTED_RATIO`.
- **Fix `get_charset` crash**: Resolved a crash when looking up unknown charset names.
- **Fix GB18030 `char_len_table`**: Corrected the character length table for GB18030 multi-byte sequences.
- **Fix UTF-8 state machine**: Updated to be more spec-compliant.
- **Fix `detect_all()` returning inactive probers**: Results from probers that determined "definitely not this encoding" are now excluded.
- **Fix early cutoff bug**: Resolved an issue where detection could terminate prematurely.
- **Default UTF-8 fallback**: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.

Breaking changes

- **Dropped Python 3.7, 3.8, and 3.9 support**: Now requires Python 3.10+. (283, hugovk)
- **Removed `Latin1Prober` and `MacRomanProber`**: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by `SingleByteCharSetProber` with trained language models, giving better accuracy and language identification.
- **Removed EUC-TW support**: EUC-TW encoding detection has been removed as it is extremely rare in practice.
- **`LanguageFilter.NONE` removed**: Use specific language filters or `LanguageFilter.ALL` instead.
- **Enum types changed**: `InputState`, `ProbingState`, `MachineState`, `SequenceLikelihood`, and `CharacterCategory` are now `IntEnum` (previously plain classes or `Enum`). `LanguageFilter` values changed from hardcoded hex to `auto()`.
- **`detect()` default behavior change**: `detect()` now defaults to `encoding_era=EncodingEra.MODERN_WEB` and `should_rename_legacy=None` (auto-enabled for `MODERN_WEB`), whereas previously it defaulted to considering all encodings with no legacy renaming.

Misc changes

- **Switched from Poetry/setuptools to uv + hatchling**: Build system modernized with `hatch-vcs` for version management.
- **License text updated**: Updated LGPLv2.1 license text and FSF notices to use URL instead of mailing address. (304, 307, musicinmybrain)
- **CulturaX-based model training**: The `create_language_model.py` training script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models.
- **`Language` class converted to frozen dataclass**: The language metadata class now uses `dataclass(frozen=True)` with `num_training_docs` and `num_training_chars` fields replacing `wiki_start_pages`.
- **Test infrastructure**: Added `pytest-timeout` and `pytest-xdist` for faster parallel test execution. Reorganized test data directories.

Contributors

Thank you to everyone who contributed to this release!

- dan-blanchard (Dan Blanchard)
- bysiber (Kadir Can Ozden)
- musicinmybrain (Ben Beasley)
- hugovk (Hugo van Kemenade)
- oxygen-dioxide
- nenw

And a special thanks to helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.

5.2.0

Adds support for running chardet CLI via `python -m chardet` (0e9b7bc20366163efcc221281201baff4100fe19, dan-blanchard)

5.1.0

Features
- Add `should_rename_legacy` argument to most functions, which will rename older encodings to their more modern equivalents (e.g., `GB2312` becomes `GB18030`) (264, dan-blanchard)
- Add capital letter sharp S and ISO-8859-15 support (222, SimonWaldherr)
- Add a prober for MacRoman encoding (5 updated as c292b52a97e57c95429ef559af36845019b88b33, Rob Speer and dan-blanchard )
- Add `--minimal` flag to `chardetect` command (214, dan-blanchard)
- Add type annotations to the project and run mypy on CI (261, jdufresne)
- Add support for Python 3.11 (274, hugovk)

Fixes
- Clarify LGPL version in License trove classifier (255, musicinmybrain)
- Remove support for EOL Python 3.6 (260, jdufresne)
- Remove unnecessary guards for non-falsey values (259, jdufresne)

Misc changes
- Switch to Python 3.10 release in GitHub actions (257, jdufresne)
- Remove setup.py in favor of build package (262, jdufresne)
- Run tests on macos, Windows, and 3.11-dev (267, dan-blanchard)

5.0.0

⚠️ This release is the first release of chardet that no longer supports Python < 3.6 ⚠️

In addition to that change, it features the following user-facing changes:

- Added a prober for Johab Korean (207, grizlupo)
- Added a prober for UTF-16/32 BE/LE (109, 206, jpz) 
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak, Slovene, Greek, and Turkish, which should help prevent future errors with those languages
- Improved XML tag filtering, which should improve accuracy for XML files (208)
- Tweaked `SingleByteCharSetProber` confidence to match latest uchardet (209)
- Made `detect_all` return child prober confidences (210)
- Updated examples in docs (223, domdfcoding)
- Documentation fixes (212, 224, 225, 226, 220, 221, 244 from too many to mention)
- Minor performance improvements (252, deedy5)
- Add support for Python 3.10 when testing (232, jdufresne)
- Lots of little development cycle improvements, mostly thanks to jdufresne

4.0.0

Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
Links

@pyup-bot pyup-bot mentioned this pull request Feb 22, 2026
@pyup-bot
Copy link
Collaborator Author

Closing this in favor of #496

@pyup-bot pyup-bot closed this Feb 22, 2026
@Al1rios Al1rios deleted the pyup-update-chardet-3.0.4-to-6.0.0 branch February 22, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant