Skip to content

Maxime-Cllt/DataLint

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š DataLint

High-performance CSV data validation and anomaly detection tool

Rust PyTorch Version License

πŸš€ Overview

DataLint is a production-ready machine learning model designed to prevent the ingestion of erroneous or malicious data in CSV files. Built with Rust for optimal performance, it provides powerful CSV file validation capabilities by detecting erroneous, malicious, or anomalous data patterns using advanced AI techniques.

✨ Key Features

  • πŸ” AI-Powered Detection: Leverages pre-trained neural networks for intelligent data anomaly detection, use TinyBERT tokenizer for efficient data indexing
  • ⚑ High Performance: Built with Rust for maximum speed and memory efficiency
  • πŸ“ CSV Processing: Specialized for CSV file validation and analysis
  • πŸ›‘οΈ Security Focus: Identifies potentially dangerous or malicious data patterns
  • πŸ”§ Production Ready: Optimized for server-side deployment in production environments
  • πŸ“Š JSON Output: Generates detailed analysis reports in JSON format

🎯 Use Cases

  • Data Quality Assurance: Validate CSV imports before processing
  • Security Scanning: Detect potentially malicious data injections
  • Data Pipeline Integration: Automated validation in ETL processes
  • Compliance Checking: Ensure data meets quality standards
  • Anomaly Detection: Identify outliers and unusual patterns

πŸ“‹ Prerequisites

Required Tools

  • Rust (latest stable version)
  • Cargo (included with Rust)

External Dependencies

  • AI Model: Pre-trained PyTorch model for data anomaly detection
  • Tokenizer: JSON-formatted vocabulary file for data indexing and tokenization
  • PyTorch Runtime: Required DLLs and libraries for model inference

πŸ› οΈ Installation

1. Clone the Repository

git clone https://github.com/Maxime-Cllt/DataLint.git
cd DataLint

2. Build the Project

# Development build
cargo build

# Optimized release build (recommended for production)
cargo build --release

βš™οΈ Configuration

Create a config.json file in the same directory as the executable:

{
  "model_path": "C:\\Users\\model\\neural\\perfage_ia",
  "vocabulary_path": "C:\\Users\\tokenizer\\tokenizer.json"
}

Configuration Options

Option Description
model_path Path to the pre-trained PyTorch model directory
vocabulary_path Path to the tokenizer JSON file for data processing

πŸš€ Usage

Command Line Interface

# Using cargo (development)
cargo run --release "input_file.csv" "output_report.json"

# Using compiled executable (production)
./target/release/DataLint "input_file.csv" "output_report.json"

# On Windows
.\target\release\DataLint.exe "input_file.csv" "output_report.json"

Parameters

  • Input File: Path to the CSV file to be validated
  • Output File: Path where the JSON analysis report will be saved

Example Usage

# Analyze a customer data file
./DataLint "data/customers.csv" "reports/customer_analysis.json"

# Validate uploaded user data
./DataLint "uploads/user_data.csv" "validation/results.json"

πŸ“Š Output Format

DataLint generates detailed JSON reports with the following structure:

{
  "analysed_file": "file.csv",
  "ai_analyze": 1000,
  "regex_analyze": 1000,
  "time_ms": 1234,
  "anomalies": [
    {
      "value": "#ERROR!",
      "column": "\"Phone\"",
      "score": 0.9670525,
      "line": 71049
    },
    {
      "value": "??",
      "column": "\"Comment\"",
      "score": 0.90427655,
      "line": 75392
    }
  ]
}

πŸ—οΈ Dependencies Setup

PyTorch Installation

  1. Install PyTorch: Follow the official installation guide
  2. Copy DLLs: Place all PyTorch DLL files in the same directory as the DataLint executable

Required PyTorch DLLs (Windows)

  • torch_cpu.dll
  • torch_cuda.dll (if using GPU)
  • c10.dll
  • fbgemm.dll
  • Additional dependency DLLs as required

πŸ”§ Development

Building from Source

To build DataLint from source, ensure you have Rust and Cargo installed, then run:

cargo build --release

πŸ§ͺ Code quality

Unit Tests available

The tests directory is tested using the command :

cargo test

Benchmarking available

Code is benchmarked using the criterion crate. To run benchmarks, use:

cargo bench

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

About

Unsafe value program detection in CSV file

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages