Skip to content

fix: exclude Arabic diacritics from spam character-density check#16

Open
codectified wants to merge 1 commit into
sunnah-com:mainfrom
codectified:main
Open

fix: exclude Arabic diacritics from spam character-density check#16
codectified wants to merge 1 commit into
sunnah-com:mainfrom
codectified:main

Conversation

@codectified

Copy link
Copy Markdown
Collaborator

Unicode combining marks (category Mn) are the harakat — vowel diacritics attached to Arabic letters. They are not punctuation, but c.isalpha() returns False for them, so they were counted as special characters.

A heavily voweled Arabic query like اللَّهُمَّ صَلِّ عَلَى مُحَمَّدٍ has ~52% Mn characters and was wrongly rejected with a 400. The Basmala similarly triggered the filter.

Fix: add unicodedata.category(c) != 'Mn' guard so combining marks are not counted as special characters in the density calculation.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

…check

Unicode combining marks (category Mn) are the harakat — vowel diacritics
attached to Arabic letters. c.isalpha() returns False for them, so they
were counted as special characters in the density check.

A heavily voweled Arabic query like اللَّهُمَّ صَلِّ عَلَى مُحَمَّدٍ has
~52% Mn characters and was wrongly rejected with 400. Among the top 100
queries by search volume, 12 of 30 Arabic queries were being rejected,
including لَا إِلَهَ إِلَّا اَللَّهُ (sunnah-com#8, 152k searches) and the Basmala.

Fix: add unicodedata.category(c) != 'Mn' guard so combining marks are
not counted as punctuation/symbols in the density ratio.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant