Skip to content

nerdjerry/DataSentinel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ DataSentinel

A multi-agent orchestration platform for comprehensive data quality analysis using AutoGen and Snowflake.

🌟 Overview

DataSentinel uses a sophisticated multi-agent system to perform deep data quality analysis on Snowflake databases. It coordinates specialized AI agents through four distinct phases to investigate, profile, analyze, and report on data quality issues.

πŸš€ Quick Start

Using Streamlit Web Interface (Recommended)

The easiest way to use DataSentinel is through the Streamlit web interface:

macOS/Linux:

./run_streamlit.sh

Or manually:

streamlit run streamlit_app.py

The web interface will open at http://localhost:8501 where you can:

  • Enter data quality goals interactively
  • Monitor the 4-phase workflow in real-time
  • View execution logs and metrics
  • Download generated reports

πŸ“‹ Features

Multi-Agent Architecture

  • PlannerAgent: Creates comprehensive execution plans
  • DataAgent: Executes SQL queries to gather evidence
  • DataProfilingAgent: Generates statistical profiles using ydata-profiling
  • SummarizerAgent: Synthesizes findings and identifies issues
  • ReportAgent: Creates professional HTML reports

4-Phase Workflow

  1. Planning πŸ“‹: Break down goals into query and profiling tasks
  2. Investigation πŸ”: Execute tasks concurrently for fast results
  3. Analysis πŸ“Š: Correlate findings and identify data quality issues
  4. Reporting πŸ“„: Generate comprehensive HTML reports

Key Capabilities

  • βœ… Concurrent task execution for optimal performance
  • βœ… Structured Pydantic outputs for type safety
  • βœ… Real-time progress monitoring (Streamlit)
  • βœ… Professional HTML report generation
  • βœ… Statistical profiling with ydata-profiling
  • βœ… Snowflake integration with connection pooling
  • βœ… Comprehensive error handling and logging

πŸ”§ Installation

Prerequisites

  • Python 3.11 or higher
  • Snowflake account with access credentials
  • OpenAI API key

Setup

  1. Clone the repository

    git clone <repository-url>
    cd DataSentinel
  2. Create virtual environment

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure environment variables

    Create a .env file in the project root:

    # Snowflake Connection
    SNOWFLAKE_ACCOUNT=your_account.snowflakecomputing.com
    SNOWFLAKE_USER=your_username
    SNOWFLAKE_PASSWORD=your_password
    SNOWFLAKE_WAREHOUSE=COMPUTE_WH
    SNOWFLAKE_DATABASE=YOUR_DB
    SNOWFLAKE_SCHEMA=PUBLIC
    SNOWFLAKE_ROLE=SYSADMIN
    
    # OpenAI API
    OPENAI_API_KEY=your_openai_api_key

πŸ“¦ Dependencies

# AutoGen Framework
autogen-core==0.7.5
autogen-agentchat==0.7.5
autogen-ext[openai]==0.7.5

# Snowflake Integration
snowflake-connector-python==3.18.0
snowflake-sqlalchemy==1.7.7

# Data Analysis
pandas==2.3.3
ydata-profiling==4.17.0

# Web Interface
streamlit==1.39.0

# Configuration
python-dotenv==1.1.1

🎯 Usage Examples

Example 1: Missing Values Analysis

Streamlit UI:

  1. Open the app: ./run_streamlit.sh
  2. Enter goal: "Analyze missing values in the RIDEBOOKING table"
  3. Click "Run Analysis"
  4. Watch the progress through each phase

Example 2: Data Distribution Analysis

goal = "Check data distribution and identify outliers in booking amounts"
results = await orchestrator.run_analysis(goal)

Example 3: Completeness Check

goal = "Verify data completeness across all critical columns"
results = await orchestrator.run_analysis(goal)

πŸ“Š Workflow Phases

Phase 1: Planning πŸ“‹

User Goal β†’ PlannerAgent
  ↓
Loads schema.json
  ↓
Creates plan:
  - Query tasks for DataAgent
  - Profiling tasks for DataProfilingAgent
  - Execution sequence
  - Success criteria

Phase 2: Investigation πŸ”

Plan β†’ [DataAgent + DataProfilingAgent]
  ↓
Concurrent execution:
  - Multiple query tasks (async)
  - Multiple profiling tasks (async)
  ↓
Results aggregated

Phase 3: Analysis πŸ“Š

Combined Results β†’ SummarizerAgent
  ↓
Correlates findings
Identifies issues
Assigns severity
Creates recommendations

Phase 4: Reporting πŸ“„

All Results β†’ ReportAgent
  ↓
Generates HTML report with:
  - Executive summary
  - Data profiles
  - Quality assessment
  - Issues & severity
  - Recommendations
  - Visualizations

πŸ“ Project Structure

DataSentinel/
β”œβ”€β”€ agent/                          # Agent implementations
β”‚   β”œβ”€β”€ Orchestrator.py            # Workflow coordinator
β”‚   β”œβ”€β”€ PlannerAgent.py            # Planning agent
β”‚   β”œβ”€β”€ DataAgent.py               # Query execution agent
β”‚   β”œβ”€β”€ DataProfilingAgent.py      # Profiling agent
β”‚   β”œβ”€β”€ SummarizerAgent.py         # Analysis agent
β”‚   β”œβ”€β”€ ReportAgent.py             # Report generation agent
β”‚   β”œβ”€β”€ model/                     # Model factory
β”‚   └── tool/                      # Tools and engines
β”œβ”€β”€ tests/                         # Unit tests
β”œβ”€β”€ metadata/                      # Schema definitions
β”œβ”€β”€ ge_reports/                    # Generated reports
β”œβ”€β”€ streamlit_app.py               # Web interface
β”œβ”€β”€ WorkflowRunner.py              # CLI runner
β”œβ”€β”€ run_streamlit.sh               # Streamlit launcher (Unix)
β”œβ”€β”€ run_streamlit.bat              # Streamlit launcher (Windows)
β”œβ”€β”€ requirements.txt               # Dependencies
β”œβ”€β”€ ARCHITECTURE.md                # Architecture documentation
β”œβ”€β”€ STREAMLIT_README.md            # Streamlit guide
└── README.md                      # This file

πŸ§ͺ Testing

Run all tests:

./run_tests.sh

Run specific tests:

python -m pytest tests/agent/PlannerAgent_test.py
python -m pytest tests/agent/DataAgent_test.py
python -m pytest tests/agent/ReportAgent_test.py

πŸ“ˆ Output

DataSentinel generates several types of output:

HTML Reports

  • Main Report: ge_reports/data_quality_report_*.html
    • Executive summary
    • Detailed findings
    • Visualizations
    • Recommendations

Profiling Reports

  • HTML Profile: ge_reports/*_profile_*.html

    • Statistical analysis
    • Distribution charts
    • Correlation matrices
    • Missing value analysis
  • JSON Profile: ge_reports/*_profile_*.json

    • Raw profiling data
    • Metrics and statistics

Workflow Results

  • Results JSON: ge_reports/workflow_results_*.json
    • Complete workflow output
    • All phase results
    • Timestamps

πŸ”’ Security

  • βœ… All credentials stored in environment variables
  • βœ… No hardcoded secrets
  • βœ… Encrypted Snowflake connections
  • βœ… Role-based access control (RBAC)
  • βœ… Secure report storage

Important: Never commit .env files to version control!

⚑ Performance

  • Concurrent Execution: Query and profiling tasks run in parallel using asyncio
  • Connection Pooling: Efficient Snowflake connection management
  • Smart Sampling: 100k row limit for profiling to balance speed and accuracy
  • Caching: Query result caching in SnowflakeQueryEngine
  • Error Isolation: Individual task failures don't crash the workflow

πŸ› Troubleshooting

Common Issues

Import Error:

ModuleNotFoundError: No module named 'agent'

Solution: Run from project root directory

Connection Error:

Snowflake connection failed

Solution: Check .env file credentials and network connectivity

Streamlit Not Found:

streamlit: command not found

Solution: Install Streamlit: pip install streamlit

πŸ“– Documentation

πŸ› οΈ Technology Stack

  • Python 3.11+: Programming language
  • AutoGen 0.7.5: Multi-agent framework
  • OpenAI GPT-4o-mini: Language model (via gpt-5-mini)
  • Streamlit 1.39.0: Web interface
  • ydata-profiling 4.17.0: Statistical profiling
  • Snowflake: Cloud data warehouse
  • Pandas 2.3.3: Data manipulation

🚧 Roadmap

  • Support for additional databases (PostgreSQL, MySQL, BigQuery)
  • Custom agent plugins
  • Real-time monitoring dashboard
  • Scheduled workflow execution
  • Multi-user collaboration features
  • Advanced caching strategies
  • Distributed agent execution

🀝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

πŸ“„ License

[Your License Here]

πŸ‘₯ Authors

[Prateek Singhal]

πŸ™ Acknowledgments

  • AutoGen team for the multi-agent framework
  • ydata-profiling for statistical profiling capabilities
  • Streamlit for the amazing web framework

Version: 2.1
Last Updated: October 16, 2025
Status: Production-Ready

Happy Data Quality Analysis! πŸ›‘οΈ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published