📊 DataLint

High-performance CSV data validation and anomaly detection tool

🚀 Overview

DataLint is a production-ready machine learning model designed to prevent the ingestion of erroneous or malicious data in CSV files. Built with Rust for optimal performance, it provides powerful CSV file validation capabilities by detecting erroneous, malicious, or anomalous data patterns using advanced AI techniques.

✨ Key Features

🔍 AI-Powered Detection: Leverages pre-trained neural networks for intelligent data anomaly detection, use TinyBERT tokenizer for efficient data indexing
⚡ High Performance: Built with Rust for maximum speed and memory efficiency
📁 CSV Processing: Specialized for CSV file validation and analysis
🛡️ Security Focus: Identifies potentially dangerous or malicious data patterns
🔧 Production Ready: Optimized for server-side deployment in production environments
📊 JSON Output: Generates detailed analysis reports in JSON format

🎯 Use Cases

Data Quality Assurance: Validate CSV imports before processing
Security Scanning: Detect potentially malicious data injections
Data Pipeline Integration: Automated validation in ETL processes
Compliance Checking: Ensure data meets quality standards
Anomaly Detection: Identify outliers and unusual patterns

📋 Prerequisites

Required Tools

Rust (latest stable version)
Cargo (included with Rust)

External Dependencies

AI Model: Pre-trained PyTorch model for data anomaly detection
Tokenizer: JSON-formatted vocabulary file for data indexing and tokenization
PyTorch Runtime: Required DLLs and libraries for model inference

🛠️ Installation

1. Clone the Repository

git clone https://github.com/Maxime-Cllt/DataLint.git
cd DataLint

2. Build the Project

# Development build
cargo build

# Optimized release build (recommended for production)
cargo build --release

⚙️ Configuration

Create a config.json file in the same directory as the executable:

{
  "model_path": "C:\\Users\\model\\neural\\perfage_ia",
  "vocabulary_path": "C:\\Users\\tokenizer\\tokenizer.json"
}

Configuration Options

Option	Description
`model_path`	Path to the pre-trained PyTorch model directory
`vocabulary_path`	Path to the tokenizer JSON file for data processing

🚀 Usage

Command Line Interface

# Using cargo (development)
cargo run --release "input_file.csv" "output_report.json"

# Using compiled executable (production)
./target/release/DataLint "input_file.csv" "output_report.json"

# On Windows
.\target\release\DataLint.exe "input_file.csv" "output_report.json"

Parameters

Input File: Path to the CSV file to be validated
Output File: Path where the JSON analysis report will be saved

Example Usage

# Analyze a customer data file
./DataLint "data/customers.csv" "reports/customer_analysis.json"

# Validate uploaded user data
./DataLint "uploads/user_data.csv" "validation/results.json"

📊 Output Format

DataLint generates detailed JSON reports with the following structure:

{
  "analysed_file": "file.csv",
  "ai_analyze": 1000,
  "regex_analyze": 1000,
  "time_ms": 1234,
  "anomalies": [
    {
      "value": "#ERROR!",
      "column": "\"Phone\"",
      "score": 0.9670525,
      "line": 71049
    },
    {
      "value": "??",
      "column": "\"Comment\"",
      "score": 0.90427655,
      "line": 75392
    }
  ]
}

🏗️ Dependencies Setup

PyTorch Installation

Install PyTorch: Follow the official installation guide
Copy DLLs: Place all PyTorch DLL files in the same directory as the DataLint executable

Required PyTorch DLLs (Windows)

torch_cpu.dll
torch_cuda.dll (if using GPU)
c10.dll
fbgemm.dll
Additional dependency DLLs as required

🔧 Development

Building from Source

To build DataLint from source, ensure you have Rust and Cargo installed, then run:

cargo build --release

🧪 Code quality

Unit Tests available

The tests directory is tested using the command :

cargo test

Benchmarking available

Code is benchmarked using the criterion crate. To run benchmarks, use:

cargo bench

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.cargo		.cargo
.github		.github
.idea		.idea
model		model
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
config.json		config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 DataLint

🚀 Overview

✨ Key Features

🎯 Use Cases

📋 Prerequisites

Required Tools

External Dependencies

🛠️ Installation

1. Clone the Repository

2. Build the Project

⚙️ Configuration

Configuration Options

🚀 Usage

Command Line Interface

Parameters

Example Usage

📊 Output Format

🏗️ Dependencies Setup

PyTorch Installation

Required PyTorch DLLs (Windows)

🔧 Development

Building from Source

🧪 Code quality

Unit Tests available

Benchmarking available

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 DataLint

🚀 Overview

✨ Key Features

🎯 Use Cases

📋 Prerequisites

Required Tools

External Dependencies

🛠️ Installation

1. Clone the Repository

2. Build the Project

⚙️ Configuration

Configuration Options

🚀 Usage

Command Line Interface

Parameters

Example Usage

📊 Output Format

🏗️ Dependencies Setup

PyTorch Installation

Required PyTorch DLLs (Windows)

🔧 Development

Building from Source

🧪 Code quality

Unit Tests available

Benchmarking available

🤝 Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages