This project implements an Arabic language processing and knowledge graph construction pipeline for the MindRoots system. The pipeline processes various Arabic text corpora and lexical resources to build a comprehensive Neo4j graph database of Arabic linguistic data.
The system processes multiple data sources into a unified Neo4j knowledge graph:
- Corpus Data: Quranic text with morphological annotations
- Lexical Data: Hans Wehr dictionary, Lane's Lexicon
- Root System: Arabic triliteral/quadriliteral root classification
- Morphological Forms: Arabic verb patterns (wazn) and word forms
Core Node Types:
Root: Arabic roots with normalized properties (arabic,n_root)Word: Lexical items linked to roots (arabic_no_diacritics,root_id)CorpusItem: Text instances from various corpora (corpus_id,lemma,n_root)Form: Morphological patterns and verb forms
Key Relationships:
(Root)-[:HAS_WORD]->(Word): Root-to-word associations(CorpusItem)-[:HAS_WORD]->(Word): Corpus-to-lexicon links(Word)-[:HAS_FORM]->(Form): Morphological patterns
linkquranwords.py: Links Quranic corpus items to word nodes via lemma matchinglink99names.py: Links the 99 Names of Allah to corresponding wordscreatecorpusnodeandlink.py: Creates corpus entries and establishes links
importquran.py: Imports Quranic text dataimportqitems.py: Imports corpus itemsaddHansWehr.py: Integrates Hans Wehr dictionary data
batchprocess.py: Batch processing for OpenAI API callsopenaibatches_*.py: Various batch processing scripts for different data typesupdatewordlabels.py: Updates word classification labels
- Root Normalization: Updated to use
n_rootproperty for consistent matching - Enhanced Logging: Dual logging to both file and terminal with detailed query tracking
- Batch Processing: Optimized batch sizes and error handling
- Statistics Tracking: Comprehensive progress and success rate monitoring
- Corpus-to-lexicon linking via
linkquranwords.py - Root validation and word node creation
- Morphological pattern extraction and classification
- Historical Issue: Initial orthography inconsistencies in Lane lexicon data
- Solution: Normalized all root representations to
n_rootproperty - Impact: Eliminated false negatives in root-word matching
- Systematic stripping of Arabic diacritics for matching
- Preservation of original forms in separate properties
- Unicode normalization (NFKD) for consistent processing
- Neo4j: Graph database backend
- python-dotenv: Environment variable management
- OpenAI API: Batch processing for text analysis
- Custom utilities: Arabic text processing and normalization
- Comprehensive logging with timestamps
- Dual output (file + terminal)
- Progress tracking with statistics
- Error handling and recovery
- 50-item batches for optimal performance
- Database connection pooling
- Graceful interruption handling
- Automatic retry logic for failed operations
cd mindroots
python linkquranwords.py # Link corpus items to words
python importquran.py # Import Quranic data
python batchprocess.py # Process data via OpenAI APIRequired environment variables:
NEO4J_URI: Database connection stringNEO4J_USER: Database usernameNEO4J_PASS: Database passwordOPENAI_API_KEY: For batch processing
- Cross-corpus validation and linking
- Enhanced morphological analysis
- Semantic relationship extraction
- Performance optimization for large datasets
- Multi-dialectal Arabic support
- Historical linguistic analysis
- Cross-linguistic root relationships
- Advanced NLP model integration