A multi-agent orchestration platform for comprehensive data quality analysis using AutoGen and Snowflake.
DataSentinel uses a sophisticated multi-agent system to perform deep data quality analysis on Snowflake databases. It coordinates specialized AI agents through four distinct phases to investigate, profile, analyze, and report on data quality issues.
The easiest way to use DataSentinel is through the Streamlit web interface:
macOS/Linux:
./run_streamlit.shOr manually:
streamlit run streamlit_app.pyThe web interface will open at http://localhost:8501 where you can:
- Enter data quality goals interactively
- Monitor the 4-phase workflow in real-time
- View execution logs and metrics
- Download generated reports
- PlannerAgent: Creates comprehensive execution plans
- DataAgent: Executes SQL queries to gather evidence
- DataProfilingAgent: Generates statistical profiles using ydata-profiling
- SummarizerAgent: Synthesizes findings and identifies issues
- ReportAgent: Creates professional HTML reports
- Planning π: Break down goals into query and profiling tasks
- Investigation π: Execute tasks concurrently for fast results
- Analysis π: Correlate findings and identify data quality issues
- Reporting π: Generate comprehensive HTML reports
- β Concurrent task execution for optimal performance
- β Structured Pydantic outputs for type safety
- β Real-time progress monitoring (Streamlit)
- β Professional HTML report generation
- β Statistical profiling with ydata-profiling
- β Snowflake integration with connection pooling
- β Comprehensive error handling and logging
- Python 3.11 or higher
- Snowflake account with access credentials
- OpenAI API key
-
Clone the repository
git clone <repository-url> cd DataSentinel
-
Create virtual environment
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
Create a
.envfile in the project root:# Snowflake Connection SNOWFLAKE_ACCOUNT=your_account.snowflakecomputing.com SNOWFLAKE_USER=your_username SNOWFLAKE_PASSWORD=your_password SNOWFLAKE_WAREHOUSE=COMPUTE_WH SNOWFLAKE_DATABASE=YOUR_DB SNOWFLAKE_SCHEMA=PUBLIC SNOWFLAKE_ROLE=SYSADMIN # OpenAI API OPENAI_API_KEY=your_openai_api_key
# AutoGen Framework
autogen-core==0.7.5
autogen-agentchat==0.7.5
autogen-ext[openai]==0.7.5
# Snowflake Integration
snowflake-connector-python==3.18.0
snowflake-sqlalchemy==1.7.7
# Data Analysis
pandas==2.3.3
ydata-profiling==4.17.0
# Web Interface
streamlit==1.39.0
# Configuration
python-dotenv==1.1.1
Streamlit UI:
- Open the app:
./run_streamlit.sh - Enter goal: "Analyze missing values in the RIDEBOOKING table"
- Click "Run Analysis"
- Watch the progress through each phase
goal = "Check data distribution and identify outliers in booking amounts"
results = await orchestrator.run_analysis(goal)goal = "Verify data completeness across all critical columns"
results = await orchestrator.run_analysis(goal)User Goal β PlannerAgent
β
Loads schema.json
β
Creates plan:
- Query tasks for DataAgent
- Profiling tasks for DataProfilingAgent
- Execution sequence
- Success criteria
Plan β [DataAgent + DataProfilingAgent]
β
Concurrent execution:
- Multiple query tasks (async)
- Multiple profiling tasks (async)
β
Results aggregated
Combined Results β SummarizerAgent
β
Correlates findings
Identifies issues
Assigns severity
Creates recommendations
All Results β ReportAgent
β
Generates HTML report with:
- Executive summary
- Data profiles
- Quality assessment
- Issues & severity
- Recommendations
- Visualizations
DataSentinel/
βββ agent/ # Agent implementations
β βββ Orchestrator.py # Workflow coordinator
β βββ PlannerAgent.py # Planning agent
β βββ DataAgent.py # Query execution agent
β βββ DataProfilingAgent.py # Profiling agent
β βββ SummarizerAgent.py # Analysis agent
β βββ ReportAgent.py # Report generation agent
β βββ model/ # Model factory
β βββ tool/ # Tools and engines
βββ tests/ # Unit tests
βββ metadata/ # Schema definitions
βββ ge_reports/ # Generated reports
βββ streamlit_app.py # Web interface
βββ WorkflowRunner.py # CLI runner
βββ run_streamlit.sh # Streamlit launcher (Unix)
βββ run_streamlit.bat # Streamlit launcher (Windows)
βββ requirements.txt # Dependencies
βββ ARCHITECTURE.md # Architecture documentation
βββ STREAMLIT_README.md # Streamlit guide
βββ README.md # This file
Run all tests:
./run_tests.shRun specific tests:
python -m pytest tests/agent/PlannerAgent_test.py
python -m pytest tests/agent/DataAgent_test.py
python -m pytest tests/agent/ReportAgent_test.pyDataSentinel generates several types of output:
- Main Report:
ge_reports/data_quality_report_*.html- Executive summary
- Detailed findings
- Visualizations
- Recommendations
-
HTML Profile:
ge_reports/*_profile_*.html- Statistical analysis
- Distribution charts
- Correlation matrices
- Missing value analysis
-
JSON Profile:
ge_reports/*_profile_*.json- Raw profiling data
- Metrics and statistics
- Results JSON:
ge_reports/workflow_results_*.json- Complete workflow output
- All phase results
- Timestamps
- β All credentials stored in environment variables
- β No hardcoded secrets
- β Encrypted Snowflake connections
- β Role-based access control (RBAC)
- β Secure report storage
Important: Never commit .env files to version control!
- Concurrent Execution: Query and profiling tasks run in parallel using asyncio
- Connection Pooling: Efficient Snowflake connection management
- Smart Sampling: 100k row limit for profiling to balance speed and accuracy
- Caching: Query result caching in SnowflakeQueryEngine
- Error Isolation: Individual task failures don't crash the workflow
Import Error:
ModuleNotFoundError: No module named 'agent'
Solution: Run from project root directory
Connection Error:
Snowflake connection failed
Solution: Check .env file credentials and network connectivity
Streamlit Not Found:
streamlit: command not found
Solution: Install Streamlit: pip install streamlit
- ARCHITECTURE.md: Detailed system architecture
- AutoGen Documentation: AutoGen framework
- ydata-profiling Docs: Profiling library
- Python 3.11+: Programming language
- AutoGen 0.7.5: Multi-agent framework
- OpenAI GPT-4o-mini: Language model (via gpt-5-mini)
- Streamlit 1.39.0: Web interface
- ydata-profiling 4.17.0: Statistical profiling
- Snowflake: Cloud data warehouse
- Pandas 2.3.3: Data manipulation
- Support for additional databases (PostgreSQL, MySQL, BigQuery)
- Custom agent plugins
- Real-time monitoring dashboard
- Scheduled workflow execution
- Multi-user collaboration features
- Advanced caching strategies
- Distributed agent execution
Contributions are welcome! Please feel free to submit issues or pull requests.
[Your License Here]
[Prateek Singhal]
- AutoGen team for the multi-agent framework
- ydata-profiling for statistical profiling capabilities
- Streamlit for the amazing web framework
Version: 2.1
Last Updated: October 16, 2025
Status: Production-Ready
Happy Data Quality Analysis! π‘οΈ