Skip to content

An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

License

Notifications You must be signed in to change notification settings

InternScience/SciDataCopilot

Repository files navigation

SciDataCopilot

SciDataCopilot is a multi-agent system that turns natural-language research requests into executable scientific data workflows.

It is designed for end-to-end automation: requirement understanding, data discovery/acquisition, hybrid planning (tool + code), execution, and result integration.

SciDataCopilot Framework Architecture

HomePaperDocs

Core Features

  • LLM-first intent understanding with structured requirement split:
    • data requirements (what data is needed)
    • processing requirements (what analysis/transformation is needed)
  • Hybrid plan execution with explicit step dependencies:
    • tool steps (Tool Lake executors)
    • code steps (LLM-generated Python with execute-repair loop)
  • Multi-agent orchestration via LangGraph state workflow.
  • Knowledge-driven routing with Data Lake + Tool Lake + Case Lake.
  • Built-in support for multiple scenarios:
    • tabular/Polar-style data processing
    • EEG/MNE workflows
    • acquisition workflows (for example UniProt/PDF/JSON pipelines)
  • Data health and quality tracking across baseline vs post-processing stages.

Architecture

Multi-Agent Coordination Workflow

User Prompt
   |
   v
DataAccessAgent
  - Resolve datasets (local path -> Data Lake -> acquisition tools)
   |
   v   
IntentParsingAgent
  - Requirement split
  - Case retrieval/adaptation
  - Hybrid plan generation + validation
   |
   v
DataAccessAgent
  - Profile/inspect data
   |
   v
DataProcessingAgent
  - Execute hybrid steps in dependency order
  - Tool execution + code generation/debug
  - Artifact registry + run logs
   |
   v
DataIntegrationAgent
  - Consolidate outputs
  - Quality assessment
  - Analysis/recommendations

Core Agents

Agent File Responsibility Core Functions Example Outputs
IntentParsingAgent agents/intent_parsing_agent.py Parse intent, split requirements, generate/repair hybrid plans RequirementAnalyzer, CaseRetriever, PlanGenerator, StrategyReviewer processing_plan.json, structured requirement dict
DataAccessAgent agents/data_access_agent.py Resolve and inspect data sources before execution Data Lake search, tool-based acquisition, modality mapping, profiling data_profile.json, perception_report.txt, resolved/unresolved sources
DataProcessingAgent agents/data_processing_agent.py Execute hybrid plan (tool + code), keep artifacts consistent topological execution, ExecuteRepairLoop, fallback strategy, step logs hybrid_execution.json, step_<id>.py, step_<id>_result.json
DataIntegrationAgent agents/data_integration_agent.py Integrate outputs and evaluate final quality strategy analysis, output assembly, quality comparison integration_analysis.txt, final output files, recommendations

Key Runtime Components

  • sci_data_copilot.py: main entry point, state machine wiring, CLI.
  • core/plan_schema.py: authoritative schema for structured requirements and hybrid plan steps.
  • tools/plan_generator.py: LLM plan generation + schema validation/repair.
  • tools/strategy_reviewer.py: tool I/O compatibility and dependency checks.
  • tools/tool_registry.py + tools/tool_lake.py: tool descriptors, executors, and registry.
  • knowledge_base/: Data Lake, Tool Lake, Case Lake persistence and retrieval.

Project Structure

sci-data-copilot/
|-- agents/
|   |-- intent_parsing_agent.py
|   |-- data_access_agent.py
|   |-- data_processing_agent.py
|   `-- data_integration_agent.py
|-- core/
|   |-- plan_schema.py
|   |-- execute_repair_loop.py
|   `-- workflow.py
|-- tools/
|   |-- plan_generator.py
|   |-- requirement_analyzer.py
|   |-- strategy_reviewer.py
|   |-- tool_registry.py
|   `-- ...
|-- knowledge_base/
|   |-- data/
|   |   |-- data_lake.json
|   |   `-- case_lake.json
|   `-- *.py
|-- prompts/
|   |-- intent_prompts.py
|   |-- eeg_prompts.py
|   |-- polar_prompts.py
|   `-- ...
|-- config/
|   `-- config.yaml
|-- scripts/
|   `-- init_knowledge_base.py
|-- xlsx/
|   `-- sample tabular data
|-- tests/
|   `-- tool and integration tests
|-- requirements.txt
`-- sci_data_copilot.py

Installation Guide

1) Prerequisites

  • Python 3.10+ (recommended)
  • pip

2) Install dependencies

pip install -r requirements.txt

3) Configure API keys (important)

Before running, configure both:

  • config/config.yaml (LLM endpoint/key used by the main workflow)
  • tools/acquire_config.py (acquisition-related API settings, if you run acquire workflows)

Minimal config/config.yaml example:

model_name: "gpt-5.2"
openai_api_key: "YOUR_API_KEY"
openai_base_url: "YOUR_COMPATIBLE_API_BASE"
max_iterations: 5
save_dir: "./exp"
temp_code_path: "./exp/generated_code.py"

4) Initialize Knowledge Base first

python scripts/init_knowledge_base.py

Run this once after dependency/config setup. It seeds Case Lake/Data Lake defaults used by routing and planning.

Usage Guide (Reproducible)

All commands below are run from repository root.

A) EEG example (currently supported)

Note: the first run may download MNE sample data and can take longer.

python sci_data_copilot.py -p "Please perform ocular artifact correction on EEG and MEG data using the MNE sample dataset. First, apply a 0.3 Hz high-pass filter to remove slow drifts. Then fit and apply an EOG regression model to remove eye movement artifacts from all EEG, magnetometer, and gradiometer channels. After correction, extract epochs for the event types visual/left and visual/right within the time window 0.1 s to 0.5 s, applying baseline correction between 0.1 s and 0 s. The expected outputs are a topographic map of the EOG regression weights showing how ocular activity projects to different sensors, and comparison plots of evoked responses before and after correction across EEG (59 channels), gradiometers (203 channels), and magnetometers (102 channels), demonstrating the reduction of eye movement artifacts."

B) Polar example (currently supported)

Before running this example, make sure the two POLAR files in Data Lake map to valid local paths. If needed, update the POLAR dataset section in scripts/init_knowledge_base.py, then re-run initialization.

python sci_data_copilot.py -p "Process polar tabular data: merge header and records, compute daily averages from hourly values, then split outputs by month."

C) Acquire example (protein / UniProt, currently supported)

python sci_data_copilot.py -p "Download all P450 enzyme records from UniProt, including sequence information and catalytic reaction information."

D) General mode (experimental)

For tasks outside the three examples above, you can force general mode:

python sci_data_copilot.py -t general -p "Analyze taxi GPS logs and detect rush-hour anomalies"

General mode is supported but currently weaker than the main tuned examples.

Output conventions

Each run writes an experiment directory under exp/, for example:

exp/eeg_exp_YYYYMMDD_HHMMSS/
|-- processing_plan.json
|-- data_profile.json
|-- hybrid_execution.json
|-- step_<id>.py
|-- step_<id>_result.json
`-- ...

For acquire workflows, downloaded artifacts are now stored inside the same run directory (not a global top-level downloads/), for example:

exp/acquire_exp_YYYYMMDD_HHMMSS/
|-- acquire_results.json
|-- processing_results.json
|-- downloads/
|   |-- *.csv
|   `-- *.pdf
`-- ...

Additional Notes

Data Health Scoring

The project evaluates data quality with three dimensions (weighted):

  • intrinsic quality (completeness/consistency)
  • distributional quality (statistical reasonableness)
  • utility quality (fitness for target task)

Baseline and post-processing scores can be compared to quantify improvement.

Design Principles

  • LLM-first reasoning, with deterministic schema validation.
  • Explicit artifacts and dependencies for reproducibility.
  • Robust execution through tool fallback and code repair loops.
  • Keep backward compatibility when possible (data_type remains optional override).

Rule-based vs LLM-first (high level)

  • Legacy: regex/rule routing dominates, weaker cross-domain generalization.
  • Current: LLM-driven requirement split + hybrid planning, better for mixed tasks.
  • Safety net: validation + repair loops + selective fallback paths.

FAQ

  • Do I need to pass data_type?

    • No. Auto routing is supported. data_type is an optional override.
  • Where are logs and intermediate files?

    • In the run-specific folder under exp/.
  • Can I provide a direct local data path?

    • Yes. Explicit path resolution is first priority in DataAccess.
  • What if data is missing?

    • DataAccess tries Data Lake search, then acquisition tool steps, then returns a user follow-up message.

If you are starting for the first time, run these three commands in order:

Add your api key to /config/config.yaml and tools/aquire_config.py, then

pip install -r requirements.txt
python scripts/init_knowledge_base.py
python sci_data_copilot.py -p "Your task here"

About

An Agentic Data Preparation Framework for AGI-driven Scientific Discovery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages