DataLint is a production-ready machine learning model designed to prevent the ingestion of erroneous or malicious data in CSV files. Built with Rust for optimal performance, it provides powerful CSV file validation capabilities by detecting erroneous, malicious, or anomalous data patterns using advanced AI techniques.
- π AI-Powered Detection: Leverages pre-trained neural networks for intelligent data anomaly detection, use TinyBERT tokenizer for efficient data indexing
- β‘ High Performance: Built with Rust for maximum speed and memory efficiency
- π CSV Processing: Specialized for CSV file validation and analysis
- π‘οΈ Security Focus: Identifies potentially dangerous or malicious data patterns
- π§ Production Ready: Optimized for server-side deployment in production environments
- π JSON Output: Generates detailed analysis reports in JSON format
- Data Quality Assurance: Validate CSV imports before processing
- Security Scanning: Detect potentially malicious data injections
- Data Pipeline Integration: Automated validation in ETL processes
- Compliance Checking: Ensure data meets quality standards
- Anomaly Detection: Identify outliers and unusual patterns
- AI Model: Pre-trained PyTorch model for data anomaly detection
- Tokenizer: JSON-formatted vocabulary file for data indexing and tokenization
- PyTorch Runtime: Required DLLs and libraries for model inference
git clone https://github.com/Maxime-Cllt/DataLint.git
cd DataLint# Development build
cargo build
# Optimized release build (recommended for production)
cargo build --releaseCreate a config.json file in the same directory as the executable:
{
"model_path": "C:\\Users\\model\\neural\\perfage_ia",
"vocabulary_path": "C:\\Users\\tokenizer\\tokenizer.json"
}| Option | Description |
|---|---|
model_path |
Path to the pre-trained PyTorch model directory |
vocabulary_path |
Path to the tokenizer JSON file for data processing |
# Using cargo (development)
cargo run --release "input_file.csv" "output_report.json"
# Using compiled executable (production)
./target/release/DataLint "input_file.csv" "output_report.json"
# On Windows
.\target\release\DataLint.exe "input_file.csv" "output_report.json"- Input File: Path to the CSV file to be validated
- Output File: Path where the JSON analysis report will be saved
# Analyze a customer data file
./DataLint "data/customers.csv" "reports/customer_analysis.json"
# Validate uploaded user data
./DataLint "uploads/user_data.csv" "validation/results.json"DataLint generates detailed JSON reports with the following structure:
{
"analysed_file": "file.csv",
"ai_analyze": 1000,
"regex_analyze": 1000,
"time_ms": 1234,
"anomalies": [
{
"value": "#ERROR!",
"column": "\"Phone\"",
"score": 0.9670525,
"line": 71049
},
{
"value": "??",
"column": "\"Comment\"",
"score": 0.90427655,
"line": 75392
}
]
}- Install PyTorch: Follow the official installation guide
- Copy DLLs: Place all PyTorch DLL files in the same directory as the DataLint executable
torch_cpu.dlltorch_cuda.dll(if using GPU)c10.dllfbgemm.dll- Additional dependency DLLs as required
To build DataLint from source, ensure you have Rust and Cargo installed, then run:
cargo build --releaseThe tests directory is tested using the command :
cargo testCode is benchmarked using the criterion crate. To run benchmarks, use:
cargo bench- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.