Open
Conversation
30bcb33 to
b5c6b1c
Compare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



This PR contains the following updates:
~=5.2.0→~=6.0.0.post1Release Notes
chardet/chardet (chardet)
v6.0.0Compare Source
Features
Latin1ProberandMacRomanProberheuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.EncodingErafiltering: Newencoding_eraparameter todetectallows filtering by anEncodingEraflag enum (MODERN_WEB,LEGACY_ISO,LEGACY_MAC,LEGACY_REGIONAL,DOS,MAINFRAME,ALL) allows callers to restrict detection to encodings from a specific era.detect()anddetect_all()default toMODERN_WEB. The newMODERN_WEBdefault should drastically improve accuracy for users who are not working with legacy data. The tiers are:MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)MAINFRAME: EBCDIC variants (CP037, CP500, etc.)--encoding-eraCLI flag: ThechardetectCLI now accepts-e/--encoding-erato control which encoding eras are considered during detection.max_bytesandchunk_sizeparameters:detect(),detect_all(), andUniversalDetectornow acceptmax_bytes(default 200KB) andchunk_size(default 64KB) parameters for controlling how much data is examined. (#314, @bysiber)chardet.metadata.charsetsmodule provides structured metadata about all supported encodings, including their era classification and language filter.should_rename_legacynow defaults intelligently: When set toNone(the new default), legacy renaming is automatically enabled whenencoding_eraisMODERN_WEB.Fixes
SJISDistributionAnalysisdiscarding valid second-byte range >= 0x80. (#315, @bysiber)MIN_RATIOthreshold alongside the existingEXPECTED_RATIO.get_charsetcrash: Resolved a crash when looking up unknown charset names.char_len_table: Corrected the character length table for GB18030 multi-byte sequences.detect_all()returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.Breaking changes
Latin1ProberandMacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected bySingleByteCharSetProberwith trained language models, giving better accuracy and language identification.LanguageFilter.NONEremoved: Use specific language filters orLanguageFilter.ALLinstead.InputState,ProbingState,MachineState,SequenceLikelihood, andCharacterCategoryare nowIntEnum(previously plain classes orEnum).LanguageFiltervalues changed from hardcoded hex toauto().detect()default behavior change:detect()now defaults toencoding_era=EncodingEra.MODERN_WEBandshould_rename_legacy=None(auto-enabled forMODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.Misc changes
hatch-vcsfor version management.create_language_model.pytraining script was rewritten to use the CulturaX multilingual corpus instead of Wikipedia, producing higher quality bigram frequency models.Languageclass converted to frozen dataclass: The language metadata class now uses@dataclass(frozen=True)withnum_training_docsandnum_training_charsfields replacingwiki_start_pages.pytest-timeoutandpytest-xdistfor faster parallel test execution. Reorganized test data directories.Contributors
Thank you to everyone who contributed to this release!
And a special thanks to @helour, whose earlier Latin-1 prober work from an abandoned PR helped inform the approach taken in this release.
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.