SciDataCopilot

SciDataCopilot is a multi-agent system that turns natural-language research requests into executable scientific data workflows.

It is designed for end-to-end automation: requirement understanding, data discovery/acquisition, hybrid planning (tool + code), execution, and result integration.

Home • Paper • Docs

Core Features

LLM-first intent understanding with structured requirement split:
- data requirements (what data is needed)
- processing requirements (what analysis/transformation is needed)
Hybrid plan execution with explicit step dependencies:
- tool steps (Tool Lake executors)
- code steps (LLM-generated Python with execute-repair loop)
Multi-agent orchestration via LangGraph state workflow.
Knowledge-driven routing with Data Lake + Tool Lake + Case Lake.
Built-in support for multiple scenarios:
- tabular/Polar-style data processing
- EEG/MNE workflows
- acquisition workflows (for example UniProt/PDF/JSON pipelines)
Data health and quality tracking across baseline vs post-processing stages.

Architecture

Multi-Agent Coordination Workflow

User Prompt
   |
   v
DataAccessAgent
  - Resolve datasets (local path -> Data Lake -> acquisition tools)
   |
   v   
IntentParsingAgent
  - Requirement split
  - Case retrieval/adaptation
  - Hybrid plan generation + validation
   |
   v
DataAccessAgent
  - Profile/inspect data
   |
   v
DataProcessingAgent
  - Execute hybrid steps in dependency order
  - Tool execution + code generation/debug
  - Artifact registry + run logs
   |
   v
DataIntegrationAgent
  - Consolidate outputs
  - Quality assessment
  - Analysis/recommendations

Core Agents

Agent	File	Responsibility	Core Functions	Example Outputs
IntentParsingAgent	`agents/intent_parsing_agent.py`	Parse intent, split requirements, generate/repair hybrid plans	RequirementAnalyzer, CaseRetriever, PlanGenerator, StrategyReviewer	`processing_plan.json`, structured requirement dict
DataAccessAgent	`agents/data_access_agent.py`	Resolve and inspect data sources before execution	Data Lake search, tool-based acquisition, modality mapping, profiling	`data_profile.json`, `perception_report.txt`, resolved/unresolved sources
DataProcessingAgent	`agents/data_processing_agent.py`	Execute hybrid plan (tool + code), keep artifacts consistent	topological execution, ExecuteRepairLoop, fallback strategy, step logs	`hybrid_execution.json`, `step_<id>.py`, `step_<id>_result.json`
DataIntegrationAgent	`agents/data_integration_agent.py`	Integrate outputs and evaluate final quality	strategy analysis, output assembly, quality comparison	`integration_analysis.txt`, final output files, recommendations

Key Runtime Components

sci_data_copilot.py: main entry point, state machine wiring, CLI.
core/plan_schema.py: authoritative schema for structured requirements and hybrid plan steps.
tools/plan_generator.py: LLM plan generation + schema validation/repair.
tools/strategy_reviewer.py: tool I/O compatibility and dependency checks.
tools/tool_registry.py + tools/tool_lake.py: tool descriptors, executors, and registry.
knowledge_base/: Data Lake, Tool Lake, Case Lake persistence and retrieval.

Project Structure

sci-data-copilot/
|-- agents/
|   |-- intent_parsing_agent.py
|   |-- data_access_agent.py
|   |-- data_processing_agent.py
|   `-- data_integration_agent.py
|-- core/
|   |-- plan_schema.py
|   |-- execute_repair_loop.py
|   `-- workflow.py
|-- tools/
|   |-- plan_generator.py
|   |-- requirement_analyzer.py
|   |-- strategy_reviewer.py
|   |-- tool_registry.py
|   `-- ...
|-- knowledge_base/
|   |-- data/
|   |   |-- data_lake.json
|   |   `-- case_lake.json
|   `-- *.py
|-- prompts/
|   |-- intent_prompts.py
|   |-- eeg_prompts.py
|   |-- polar_prompts.py
|   `-- ...
|-- config/
|   `-- config.yaml
|-- scripts/
|   `-- init_knowledge_base.py
|-- xlsx/
|   `-- sample tabular data
|-- tests/
|   `-- tool and integration tests
|-- requirements.txt
`-- sci_data_copilot.py

Installation Guide

1) Prerequisites

Python 3.10+ (recommended)
pip

2) Install dependencies

pip install -r requirements.txt

3) Configure API keys (important)

Before running, configure both:

config/config.yaml (LLM endpoint/key used by the main workflow)
tools/acquire_config.py (acquisition-related API settings, if you run acquire workflows)

Minimal config/config.yaml example:

model_name: "gpt-5.2"
openai_api_key: "YOUR_API_KEY"
openai_base_url: "YOUR_COMPATIBLE_API_BASE"
max_iterations: 5
save_dir: "./exp"
temp_code_path: "./exp/generated_code.py"

4) Initialize Knowledge Base first

python scripts/init_knowledge_base.py

Run this once after dependency/config setup. It seeds Case Lake/Data Lake defaults used by routing and planning.

Usage Guide (Reproducible)

All commands below are run from repository root.

A) EEG example (currently supported)

Note: the first run may download MNE sample data and can take longer.

python sci_data_copilot.py -p "Please perform ocular artifact correction on EEG and MEG data using the MNE sample dataset. First, apply a 0.3 Hz high-pass filter to remove slow drifts. Then fit and apply an EOG regression model to remove eye movement artifacts from all EEG, magnetometer, and gradiometer channels. After correction, extract epochs for the event types visual/left and visual/right within the time window 0.1 s to 0.5 s, applying baseline correction between 0.1 s and 0 s. The expected outputs are a topographic map of the EOG regression weights showing how ocular activity projects to different sensors, and comparison plots of evoked responses before and after correction across EEG (59 channels), gradiometers (203 channels), and magnetometers (102 channels), demonstrating the reduction of eye movement artifacts."

B) Polar example (currently supported)

Before running this example, make sure the two POLAR files in Data Lake map to valid local paths. If needed, update the POLAR dataset section in scripts/init_knowledge_base.py, then re-run initialization.

python sci_data_copilot.py -p "Process polar tabular data: merge header and records, compute daily averages from hourly values, then split outputs by month."

C) Acquire example (protein / UniProt, currently supported)

python sci_data_copilot.py -p "Download all P450 enzyme records from UniProt, including sequence information and catalytic reaction information."

D) General mode (experimental)

For tasks outside the three examples above, you can force general mode:

python sci_data_copilot.py -t general -p "Analyze taxi GPS logs and detect rush-hour anomalies"

General mode is supported but currently weaker than the main tuned examples.

Output conventions

Each run writes an experiment directory under exp/, for example:

exp/eeg_exp_YYYYMMDD_HHMMSS/
|-- processing_plan.json
|-- data_profile.json
|-- hybrid_execution.json
|-- step_<id>.py
|-- step_<id>_result.json
`-- ...

For acquire workflows, downloaded artifacts are now stored inside the same run directory (not a global top-level downloads/), for example:

exp/acquire_exp_YYYYMMDD_HHMMSS/
|-- acquire_results.json
|-- processing_results.json
|-- downloads/
|   |-- *.csv
|   `-- *.pdf
`-- ...

Additional Notes

Data Health Scoring

The project evaluates data quality with three dimensions (weighted):

intrinsic quality (completeness/consistency)
distributional quality (statistical reasonableness)
utility quality (fitness for target task)

Baseline and post-processing scores can be compared to quantify improvement.

Design Principles

LLM-first reasoning, with deterministic schema validation.
Explicit artifacts and dependencies for reproducibility.
Robust execution through tool fallback and code repair loops.
Keep backward compatibility when possible (data_type remains optional override).

Rule-based vs LLM-first (high level)

Legacy: regex/rule routing dominates, weaker cross-domain generalization.
Current: LLM-driven requirement split + hybrid planning, better for mixed tasks.
Safety net: validation + repair loops + selective fallback paths.

FAQ

Do I need to pass data_type?
- No. Auto routing is supported. data_type is an optional override.
Where are logs and intermediate files?
- In the run-specific folder under exp/.
Can I provide a direct local data path?
- Yes. Explicit path resolution is first priority in DataAccess.
What if data is missing?
- DataAccess tries Data Lake search, then acquisition tool steps, then returns a user follow-up message.

If you are starting for the first time, run these three commands in order:

Add your api key to /config/config.yaml and tools/aquire_config.py, then

pip install -r requirements.txt
python scripts/init_knowledge_base.py
python sci_data_copilot.py -p "Your task here"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SciDataCopilot

Core Features

Architecture

Multi-Agent Coordination Workflow

Core Agents

Key Runtime Components

Project Structure

Installation Guide

1) Prerequisites

2) Install dependencies

3) Configure API keys (important)

4) Initialize Knowledge Base first

Usage Guide (Reproducible)

A) EEG example (currently supported)

B) Polar example (currently supported)

C) Acquire example (protein / UniProt, currently supported)

D) General mode (experimental)

Output conventions

Additional Notes

Data Health Scoring

Design Principles

Rule-based vs LLM-first (high level)

FAQ

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
config		config
core		core
health		health
knowledge_base		knowledge_base
perception		perception
prompts		prompts
scripts		scripts
tools		tools
utils		utils
xlsx		xlsx
.gitignore		.gitignore
LICENSE		LICENSE
__init__.py		__init__.py
readme.md		readme.md
requirements.txt		requirements.txt
sci_data_copilot.py		sci_data_copilot.py

License

InternScience/SciDataCopilot

Folders and files

Latest commit

History

Repository files navigation

SciDataCopilot

Core Features

Architecture

Multi-Agent Coordination Workflow

Core Agents

Key Runtime Components

Project Structure

Installation Guide

1) Prerequisites

2) Install dependencies

3) Configure API keys (important)

4) Initialize Knowledge Base first

Usage Guide (Reproducible)

A) EEG example (currently supported)

B) Polar example (currently supported)

C) Acquire example (protein / UniProt, currently supported)

D) General mode (experimental)

Output conventions

Additional Notes

Data Health Scoring

Design Principles

Rule-based vs LLM-first (high level)

FAQ

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages