HIS Database Migration Toolkit

A comprehensive, enterprise-grade toolkit for analyzing, profiling, and migrating Hospital Information System (HIS) databases. Features a dual-interface architecture with both a Streamlit dashboard and a FastAPI REST API with real-time Socket.IO events for background pipeline execution.

✨ Features

Core Features

🔍 Multi-Database Support: Analyze MySQL, PostgreSQL, and MSSQL databases
📊 Deep Data Profiling: Column-level statistics, data quality metrics, and composition analysis
🗂️ Schema Analysis: Automatic DDL extraction with schema namespace support
🎯 Smart Sampling: Intelligent data sampling with NULL and empty string filtering
⚡ Auto-Dependency Management: Automatic installation of required database clients
📈 HTML Reports: Beautiful, interactive reports with DataTables integration
🔧 Configuration Generator: Export migration configs in TypeScript/JSON format

New in v9.0

🔌 REST API: Full CRUD API via FastAPI with JSON:API responses
📡 Socket.IO: Real-time job events (batch progress, errors, completion)
🚀 Background Jobs: POST /api/v1/jobs triggers pipeline in background thread
🔗 Per-Step Datasource Resolution: Each pipeline step resolves its own source/target datasource from UUID FK
📊 Pipeline Nodes & Edges: Visual workflow graph with dependency-based topological sort
📋 generate_sql Priority: Custom SQL queries take priority over auto-generated SELECT

v8.0 Features

🗄️ Datasource Management: Centralized database connection profiles with PostgreSQL storage
🔌 Connection Pooling: Singleton pattern for efficient connection reuse across requests
🗺️ Enhanced Schema Mapper: Dual-mode source selection (Run ID or Live Datasource)
📡 Live Schema Discovery: Dynamic table and column loading from connected databases
💡 Smart Column Suggestions: Auto-suggest target columns from actual database schema
🎛️ Configuration Repository: Save and load mapping configurations from project database
📚 Configuration History: Version tracking with comparison and rollback capabilities
🚀 Migration Engine: Production-ready ETL execution with batch processing and logging
🤖 AI-Powered Mapping: Semantic column matching using ML transformers and healthcare dictionaries

🏗️ Architecture

The codebase follows Clean Architecture with a dual-interface pattern: Streamlit (MVC dashboard) and FastAPI (REST API).

his-analyzer/
├── app.py                          # Streamlit routing + sidebar navigation
├── config.py                       # Constants: TRANSFORMER_OPTIONS, VALIDATOR_OPTIONS, DB_TYPES
├── database.py                     # Legacy facade (deprecated, being removed)
│
├── api/                            # FastAPI REST API + Socket.IO
│   ├── main.py                     # App setup, CORS, router registration, /ws mount
│   ├── socket_manager.py           # Async Socket.IO server + emit_from_thread()
│   ├── base/                       # Shared API infrastructure
│   │   ├── controller.py           # BaseController (generic CRUD)
│   │   ├── service.py              # BaseService with pagination/sanitize
│   │   ├── exceptions.py           # JSON API error handlers
│   │   └── json_api.py             # JSON:API response builder
│   ├── datasources/                # /api/v1/datasources
│   ├── configs/                    # /api/v1/configs (+ /histories, /versions)
│   ├── pipelines/                  # /api/v1/pipelines (with nodes/edges)
│   ├── pipeline_runs/              # /api/v1/pipeline-runs
│   └── jobs/                       # /api/v1/jobs (POST → trigger background pipeline)
│
├── models/                         # Data classes (pure Python, no I/O)
│   ├── datasource.py               #   Datasource dataclass
│   ├── migration_config.py         #   ConfigRecord, MigrationConfig, MappingItem
│   ├── job.py                      #   JobRecord, JobUpdateRecord
│   └── pipeline_config.py          #   PipelineConfig, PipelineStep, PipelineNodeRecord,
│                                   #     PipelineEdgeRecord, PipelineRunRecord
│
├── protocols/                      # Protocol interfaces (DIP)
│   └── repository.py               #   Repository protocol interfaces
│
├── repositories/                   # Data access layer (PostgreSQL)
│   ├── connection.py               #   SQLAlchemy engine singleton
│   ├── base.py                     #   DDL + init_db()
│   ├── datasource_repo.py          #   Datasource CRUD
│   ├── config_repo.py              #   Config CRUD + versioning
│   ├── pipeline_repo.py            #   Pipeline CRUD + get_by_id (JOIN nodes/edges/configs)
│   ├── pipeline_node_repo.py       #   Pipeline node CRUD
│   ├── pipeline_edge_repo.py       #   Pipeline edge CRUD
│   ├── pipeline_run_repo.py        #   Pipeline Run CRUD
│   └── job_repo.py                 #   Job CRUD
│
├── services/                       # Business logic (no Streamlit imports)
│   ├── datasource_repository.py    #   DatasourceRepository facade
│   ├── db_connector.py             #   SQLAlchemy engine factory (MySQL, PG, MSSQL)
│   ├── ml_mapper.py                #   SmartMapper: HIS dictionary + semantic matching
│   ├── transformers.py             #   DataTransformer: vectorised Pandas transformations
│   ├── checkpoint_manager.py       #   Checkpoint save/load/clear for resumable migrations
│   ├── migration_logger.py         #   Per-run ETL log files
│   ├── encoding_helper.py          #   clean_dataframe for Thai legacy data
│   ├── migration_executor.py       #   Single-table ETL engine (generate_sql priority)
│   ├── pipeline_service.py         #   Pipeline orchestration + per-step datasource resolution
│   └── query_builder.py            #   SELECT builder, batch transform, bulk insert
│
├── dialects/                       # Database dialects (OCP)
│   ├── mysql.py, postgresql.py, mssql.py
│   └── registry.py                 #   Dialect registry
│
├── data_transformers/              # Data transformations (OCP)
│   ├── text.py, dates.py, healthcare.py, names.py, data_type.py, lookup.py
│   └── registry.py                 #   @register_transformer decorator
│
├── validators/                     # Data validators (OCP)
│   ├── not_null.py, unique.py, range_check.py
│   └── registry.py                 #   @register_validator decorator
│
├── controllers/                    # MVC Controllers (6/6)
│   ├── settings_controller.py
│   ├── pipeline_controller.py
│   ├── file_explorer_controller.py
│   ├── er_diagram_controller.py
│   ├── schema_mapper_controller.py
│   └── migration_engine_controller.py
│
├── views/                          # MVC Views (pure rendering)
│   └── components/                 # Reusable UI components
│       ├── schema_mapper/          # Source selector, mapping editor, config actions
│       ├── migration/              # Step config, connections, execution
│       └── shared/                 # Dialogs, styles
│
├── scripts/                        # Utility scripts
│   ├── migrate_sqlite_to_pg.py     # One-time SQLite → PostgreSQL migration
│   └── migrate_add_jobs_table.py   # Create jobs table + job_id FK
│
├── tests/                          # pytest test suite
├── analysis_report/                # Database Analysis Engine (Shell)
└── mini_his/                       # Mock HIS data

Layer Responsibilities

Layer	Rule
`models/`	Pure dataclasses — no I/O, no Streamlit
`services/`	Business logic — no `st.*` calls
`views/`	Thin orchestrators — call components, manage step flow
`views/components/`	Reusable UI widgets — read/write `session_state`, render widgets
`utils/`	Stateless pure helpers + `PageState` session abstraction

Data Flow Architecture (v9.0)

┌─────────────────────────────────────────────────────────────────┐
│                    ANALYSIS PHASE (Bash)                        │
├─────────────────────────────────────────────────────────────────┤
│  Source DB → unified_db_analyzer.sh → CSV/HTML Reports          │
└─────────────────────────────────────────────────────────────────┘
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                   MAPPING PHASE (Python + AI)                   │
├─────────────────────────────────────────────────────────────────┤
│  Schema Mapper (Streamlit) + REST API                          │
│  ├── AI Auto-Map: ML Model suggests column mappings             │
│  ├── Manual Review: User confirms/modifies                      │
│  ├── Transformer Selection: Date conv, trim, etc.              │
│  ├── generate_sql: Custom SELECT (optional, priority)          │
│  └── Save: Config → PostgreSQL (with versioning)                │
└─────────────────────────────────────────────────────────────────┘
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│              PIPELINE PHASE (FastAPI + Background)              │
├─────────────────────────────────────────────────────────────────┤
│  POST /api/v1/jobs { pipeline_id } → 202 Accepted              │
│  ├── PipelineExecutor resolves nodes + edges (topological sort) │
│  ├── Per-step datasource resolution (UUID → engine per step)    │
│  ├── Each step: extract → transform → load                      │
│  │   ├── generate_sql (priority) or build_select_query (fallback)│
│  │   └── Transformers applied after pd.read_sql()               │
│  ├── Socket.IO events: job:batch, job:error, job:completed      │
│  └── Frontend receives real-time progress                       │
└─────────────────────────────────────────────────────────────────┘
                             ↓
┌─────────────────────────────────────────────────────────────────┐
│                  STORAGE (PostgreSQL)                           │
├─────────────────────────────────────────────────────────────────┤
│  datasources → configs (FK) → pipeline_nodes → pipeline_edges  │
│  jobs → pipeline_runs → pipeline_steps (JSON)                   │
│  config_histories (versioning)                                  │
└─────────────────────────────────────────────────────────────────┘

🔧 Requirements

System Requirements

Operating System: Linux, macOS, Windows (via WSL2)
Shell: Bash 4.0+ (auto-switch on macOS)
Python: 3.8 or higher
RAM: 4GB minimum, 8GB+ recommended for large databases

Database Clients

The toolkit requires database-specific clients:

MySQL: mysql-client
PostgreSQL: libpq (PostgreSQL client)
MSSQL: mssql-tools18 (with ODBC driver)

Note: On macOS with Homebrew, these dependencies are auto-installed when missing.

Python Dependencies

streamlit >= 1.30.0 - Web dashboard framework
pandas >= 2.0.0 - Data manipulation
jq - JSON processor (system package)

🚀 Installation

🎯 Quick Setup (Using Makefile) — Recommended

The repository includes a Makefile to simplify setup and development.

# 1. Clone the repository
git clone https://github.com/yourusername/his-analyzer.git
cd his-analyzer

# 2. One-command setup (creates venv + installs dependencies)
make setup

# 3. Start developing with hot-reload
make run

# 4. View all available commands
make help

Available Makefile Commands:

make setup              # Create venv + install dependencies (first time)
make install           # Install dependencies only (venv must exist)
make run               # Start app with hot-reload (default)
make run-reload        # Start app with hot-reload (explicit)
make run-no-reload     # Start app without hot-reload

make test              # Run all unit tests (pytest discovers everything)
make test-simple       # Run AI pattern detection tests only
make test-column       # Run column analysis tests only
make test-suite        # Run tests/ directory only

make clean             # Remove venv and __pycache__
make help              # Show all commands

Option 1: Manual Virtual Environment Setup

Using a virtual environment prevents version conflicts with system Python packages.

# 1. Clone the repository
git clone https://github.com/yourusername/his-analyzer.git
cd his-analyzer

# 2. Create virtual environment
python3.11 -m venv venv

# 3. Activate virtual environment
source venv/bin/activate        # macOS/Linux
# venv\Scripts\activate         # Windows

# 4. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# 5. Start the app with hot-reload
python3.11 -m streamlit run app.py --server.runOnSave true

Option 2: System-Wide Installation

# Install Python dependencies
pip3 install -r requirements.txt

# Install system dependencies (macOS with Homebrew)
brew install jq

# Install database clients as needed
brew install mysql-client
brew install libpq
brew tap microsoft/mssql-release && brew install mssql-tools18

# Start the app
streamlit run app.py

⚡ Quick Start

1. Initialize Project Database

On first run, the application automatically creates migration_tool.db SQLite database:

# Using Makefile (recommended)
make run

# OR manually
python3.11 -m streamlit run app.py --server.runOnSave true

The database is created automatically with the following tables:

datasources - Stores database connection profiles
configs - Stores schema mapping configurations

No manual setup required! The database initialization happens on application startup.

2. Configure Datasources (New in v8.0)

Navigate to Settings page in the Streamlit interface to manage datasources:

Click "⚙️ Settings" in the sidebar
Select "Datasources" tab
Click "Add New Datasource"
Fill in connection details:
- Name (unique identifier)
- Database Type (MySQL, PostgreSQL, MSSQL)
- Host, Port, Database Name
- Username, Password
Test connection
Save datasource

Datasources are stored in SQLite and reused across:

Schema Mapper (source & target selection)
Migration Engine (connection profiles)
All database operations (via connection pool)

3. Configure Database Connection (Legacy - for Bash analyzer)

Edit analysis_report/config.json:

{
  "database": {
    "type": "mysql",
    "host": "localhost",
    "port": "3306",
    "name": "hospital_db",
    "user": "root",
    "password": "your_password",
    "schema": "",
    "tables": []
  },
  "sampling": {
    "default_limit": 10,
    "max_text_length": 300,
    "deep_analysis": true,
    "exceptions": []
  }
}

4. Run Database Analysis (Optional - for CSV-based mapping)

cd analysis_report
./unified_db_analyzer.sh

Output: Creates timestamped report in migration_report/YYYYMMDD_HHMM/

Note: In v8.0, you can also connect directly to datasources in Schema Mapper, bypassing the need for analysis reports.

⚙️ Configuration

Database Configuration

Basic Settings

Field	Description	Example
`type`	Database type	`mysql`, `postgresql`, `mssql`
`host`	Database host	`localhost`, `192.168.1.100`
`port`	Database port	`3306`, `5432`, `1433`
`name`	Database name	`hospital_db`
`user`	Username	`admin`
`password`	Password	`secure_password`
`schema`	Schema name (optional)	`public`, `dbo`
`tables`	Specific tables (optional)	`["patients", "visits"]`

Schema Support (v7.0+)

Specify database schema for PostgreSQL and MSSQL:

{
  "database": {
    "type": "postgresql",
    "schema": "public",
    ...
  }
}

Defaults:

PostgreSQL: public
MSSQL: dbo
MySQL: Not applicable

Sampling Configuration

Parameter	Description	Default
`default_limit`	Number of sample rows	`10`
`max_text_length`	Max characters for text fields	`300`
`deep_analysis`	Enable detailed statistics	`true`
`exceptions`	Per-column overrides	`[]`

Exception Rules

Override sampling limits for specific columns:

{
  "sampling": {
    "exceptions": [
      { "table": "patients", "column": "notes", "limit": 3 },
      { "table": "visits", "column": "diagnosis", "limit": 5 }
    ]
  }
}

📖 Usage

Automated Data Profiling: Shell Script vs Python/Tools

This toolkit uses a pure Bash shell script (unified_db_analyzer.sh) for database profiling instead of Python ETL frameworks or commercial tools. Here's why:

Why Shell Script?

Aspect	Shell Script Approach	Python/Tools Alternative
Dependencies	Minimal: `bash`, `jq`, native DB clients	Heavy: pandas, SQLAlchemy, various libraries
Portability	Runs anywhere with Bash 4.0+	Requires Python environment setup
Performance	Direct database access, minimal overhead	Abstraction layers slow down queries
Security	No code execution risks, simple audit	Complex dependency chains, supply chain risks
Maintenance	Single 600-line script, easy to debug	Multiple packages, version conflicts
Installation	Auto-installs missing DB clients via Homebrew	Manual pip installs, virtual environments

Benefits for ETL & Data Migration

1. Zero-Setup Profiling

# No Python, no pip install, no virtual env - just run
cd analysis_report
./unified_db_analyzer.sh

2. Multi-Database Native Support

Directly uses mysql, psql, sqlcmd for optimal performance
Schema-aware profiling (PostgreSQL public, MSSQL dbo)
Handles database-specific quirks (MSSQL SSL certs, NULL warnings)

3. Production-Ready Features

Smart Sampling: Filters NULL/empty values automatically
Deep Analysis Mode: Min/Max, Top-5 frequencies, data composition
Exception Rules: Per-column sampling limits
Table Size Calculation: Actual disk usage in MB
DDL Export: Complete schema with indexes and constraints

4. Migration-Friendly Output

CSV Format: Universal, works with any ETL tool
HTML Reports: Interactive DataTables for business users
Timestamped Runs: Tracks profiling history (YYYYMMDD_HHMM/)
Process Logs: Complete audit trail for compliance

5. Real-World Migration Use Cases

# Before migration: Profile source database
./unified_db_analyzer.sh  # Analyzes source system

# Review data quality, identify issues
open migration_report/20251130_1523/data_profile/data_profile.html

# Load into Streamlit for schema mapping
# Use profiling data to design transformations

# Execute migration with confidence
# Knowing exact data types, null counts, value ranges

6. Shell Script Advantages for Migration

Repeatable: Run daily to track data changes over time
Scriptable: Integrate into CI/CD pipelines
Offline: Profile production DB, analyze on laptop (CSV export)
Auditable: Single script = complete transparency
Fast: No Python overhead, direct SQL execution

Database Analysis Script

cd analysis_report
./unified_db_analyzer.sh

Features:

Auto-detects database type from config.json
Checks and installs missing dependencies (macOS with Homebrew)
Exports DDL schema to schema.sql
Generates CSV data profile with smart NULL/empty filtering
Creates interactive HTML report with DataTables
Logs all operations to process.log

Output Structure:

migration_report/20251124_0023/
├── ddl_schema/
│   └── schema.sql              # Complete DDL export
├── data_profile/
│   ├── data_profile.csv        # Raw profiling data
│   └── data_profile.html       # Interactive report
└── process.log                 # Execution log

Streamlit Dashboard

The dashboard provides several interfaces:

📊 Schema Mapper (v8.0 Enhanced)

Dual Source Mode:

Run ID Mode: Load from CSV analysis reports (legacy)
Datasource Mode: Connect directly to live database (new!)

Features:

View table and column statistics
Map source to target fields with live schema discovery
Smart target column suggestions from actual database
Select data transformers and validators
Save/load configurations from project database
Generate TypeScript/JSON configurations
Export configurations as downloadable files

Workflow:

Source Configuration: Choose Run ID or Datasource
- Run ID: Select from analysis report folders
- Datasource: Select datasource → Choose table (auto-loads schema)
Target Configuration: Select target datasource and table
Field Mapping: Map source columns to target with suggestions
Save Configuration: Store in SQLite for reuse
Export: Download as JSON for migration tools

⚙️ Settings (v8.0 New)

Datasources Tab:

Add/Edit/Delete datasource profiles
Test database connections
View all configured datasources
Secure credential storage in SQLite

Saved Configs Tab:

View all saved schema mapping configurations
Load configurations for editing
Delete unused configurations
Export configurations

🚀 Migration Engine

Select source and target from datasource profiles
Load saved configurations from project database
Upload configuration files
Execute data migration (simulation mode)

🔍 DDL Explorer

Browse database schema
Click tables to view CREATE statements
Navigate foreign key relationships

🎲 Mock Data Generator

Generate test data for migration testing
Configurable data volumes
Realistic HIS data patterns

🔄 Workflow

flowchart TD
    A[📝 Configure config.json] --> B{Select Database Type}
    B -->|MySQL| C1[MySQL Client]
    B -->|PostgreSQL| C2[PostgreSQL Client]
    B -->|MSSQL| C3[MSSQL Client + SSL]

    C1 --> D[🔍 unified_db_analyzer.sh]
    C2 --> D
    C3 --> D

    D --> E{Check Dependencies}
    E -->|Missing| F[🔧 Auto-Install via Homebrew]
    E -->|Available| G
    F --> G[⚙️ Start Analysis]

    G --> H1[📊 Table Size & Row Count]
    G --> H2[🔎 Column Profiling]
    G --> H3[📋 DDL Export]

    H2 --> I{Deep Analysis?}
    I -->|true| J[📈 Min/Max/Top5/Composition]
    I -->|false| K[Basic Stats Only]

    J --> L[🎯 Smart Sample<br/>NOT NULL & NOT EMPTY]
    K --> L

    L --> M[💾 Export to CSV]
    M --> N[🌐 Generate HTML Report]
    N --> O[📂 migration_report/YYYYMMDD_HHMM/]

    O --> P[🖥️ Open in Streamlit Dashboard]
    P --> Q[🗺️ Schema Mapping & Config Generation]

    style D fill:#4CAF50,color:#fff
    style L fill:#FF9800,color:#fff
    style O fill:#2196F3,color:#fff
    style Q fill:#9C27B0,color:#fff

Step-by-Step Migration Process

Database Analysis
- Configure config.json with source database credentials
- Run ./unified_db_analyzer.sh
- Review generated reports
Schema Mapping
- Launch Streamlit dashboard
- Navigate to Schema Mapper
- Load analysis report
- Map source fields to target schema
- Select transformers (e.g., date format converters, string normalizers)
Configuration Export
- Generate TypeScript/JSON configuration
- Integrate with migration pipeline
- Test with mock data if needed
Migration Execution
- Use generated config with your ETL tool
- Monitor data quality metrics
- Validate migrated data

🚀 Advanced Features

Connection Pooling (v8.0)

The toolkit uses a singleton connection pool pattern for efficient database operations:

Benefits:

Reuses connections across multiple requests
Automatic health checks and reconnection
Significant performance improvement for repeated operations
Thread-safe connection management

How it works:

# First call - creates connection
get_tables_from_datasource(...)  # Creates new connection

# Second call - reuses connection (no overhead!)
get_columns_from_table(...)      # Reuses existing connection

# Connection stays alive for future requests

Connection Management:

Connections are identified by unique hash (host, port, db, user)
Dead connections are automatically detected and recreated
All functions use autocommit mode for stability
Connections persist across Streamlit reruns

Manual Control:

from services.db_connector import close_connection, close_all_connections

# Close specific connection
close_connection(db_type, host, port, db_name, user)

# Close all connections (useful for cleanup)
close_all_connections()

Project Database (v8.0)

SQLite Storage: migration_tool.db

Tables:

datasources - Database connection profiles

CREATE TABLE datasources (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE,
    db_type TEXT,
    host TEXT,
    port TEXT,
    dbname TEXT,
    username TEXT,
    password TEXT
)

configs - Schema mapping configurations

CREATE TABLE configs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    config_name TEXT UNIQUE,
    table_name TEXT,
    json_data TEXT,
    updated_at TIMESTAMP
)

Automatic Initialization:

Database created on first application run
No manual SQL scripts required
Handles migrations automatically

Schema Mapper Dual Mode (v8.0)

Mode 1: Run ID (Traditional)

Uses CSV analysis reports
Offline operation
Historical data analysis
Best for: Initial exploration, documented analysis

Mode 2: Datasource (New)

Connects directly to live database
Real-time schema discovery
Auto-loads tables and columns
Best for: Active development, latest schema

Switching between modes:

Open Schema Mapper
Select source mode (Run ID / Datasource)
Choose source accordingly
Schema Mapper adapts automatically

Deep Analysis Mode

Enable comprehensive data profiling:

{
  "sampling": {
    "deep_analysis": true
  }
}

Metrics Collected:

Metric	Basic Mode	Deep Mode
Row Count	✅	✅
Null Count	✅	✅
Distinct Values	✅	✅
Min/Max Values	❌	✅
Top 5 Frequency	❌	✅
Data Composition	❌	✅ (Valid/Null/Empty/Zero)
Sample Data	✅	✅ (Smart filtered)

Performance Considerations:

Basic Mode: Fast, suitable for large tables (millions of rows)
Deep Mode: Slower, recommended for detailed migration planning

Smart Sample Data (v7.1+)

Automatically filters sample data to show only meaningful values:

Filtering Rules:

Excludes NULL values
Excludes empty strings ('')
Shows actual representative data

Implementation (MySQL example):

SELECT DISTINCT column_name
FROM table_name
WHERE column_name IS NOT NULL
  AND CAST(column_name AS CHAR) <> ''
LIMIT 10;

Auto-Dependency Installation

On macOS with Homebrew, missing database clients are automatically installed:

# Example: Installing MSSQL tools
❌ Error: Command 'sqlcmd' not found
🍺 Homebrew detected...
❓ Install 'mssql-tools18' now? (y/N): y
📦 Installing mssql-tools18...
   -> Tapping microsoft/mssql-release...
   -> Installing packages...
✅ Installation successful!

Interactive HTML Reports

Generated HTML reports include:

Overview Tab: Table-level metrics with sortable DataTable
Column Detail Tab: Comprehensive column-level statistics
Formulas & Docs Tab: Data quality score explanations
Process Log Tab: Complete execution logs

Features:

Responsive design with Bootstrap 5
Interactive tables with search/filter/sort
Data quality visualizations
Exportable to Excel/CSV/PDF

Configuration History & Version Control (v8.0)

Track every change to your schema mapping configurations with built-in version control.

How It Works:

Every time you save a configuration, the system automatically:

Creates a new version entry in config_histories table
Preserves complete JSON snapshot with timestamp
Increments version number (v1, v2, v3...)
Links to parent configuration via foreign key

Database Schema:

-- Main configuration table
CREATE TABLE configs (
    id TEXT PRIMARY KEY,              -- UUID for relationships
    config_name TEXT UNIQUE,          -- User-facing name
    table_name TEXT,                  -- Source table
    json_data TEXT,                   -- Current config JSON
    updated_at TIMESTAMP              -- Last modification
);

-- Version history table
CREATE TABLE config_histories (
    id TEXT PRIMARY KEY,              -- Unique history entry ID
    config_id TEXT,                   -- Links to parent config
    version INTEGER,                  -- Sequential version number
    json_data TEXT,                   -- Config snapshot at this version
    created_at TIMESTAMP,             -- When this version was created
    FOREIGN KEY(config_id) REFERENCES configs(id) ON DELETE CASCADE
);

Key Features:

Automatic Versioning
- No manual intervention needed
- Every save creates a new version
- Original version preserved forever

Version Comparison

# Compare two versions to see what changed
diff = db.compare_config_versions("PatientMigration", version1=1, version2=3)
# Returns:
{
    'mappings_added': [...],      # New column mappings
    'mappings_removed': [...],    # Deleted mappings
    'mappings_modified': [...]    # Changed transformers/targets
}

Rollback Support
- View all historical versions in Settings page
- Load any previous version
- Restore deleted configurations from history
Audit Trail
- Complete history of configuration changes
- Timestamp for every modification
- Useful for compliance and troubleshooting

Use Cases:

Migration Testing: Try different mapping strategies, rollback if needed
Team Collaboration: Track who changed what and when
Production Safety: Restore last-known-good configuration quickly
Documentation: Historical record of migration decisions

Migration Engine (v8.0)

Production-ready ETL execution engine with enterprise features.

Architecture:

┌─────────────────┐      ┌──────────────────┐      ┌─────────────────┐
│  Source DB      │──────▶│  Migration       │──────▶│  Target DB      │
│  (via Profile)  │◀──┐   │  Engine          │   ┌──▶│  (via Profile)  │
└─────────────────┘   │   └──────────────────┘   │   └─────────────────┘
                      │            │              │
                      │            ▼              │
                      │   ┌──────────────────┐   │
                      │   │  Transformers    │   │
                      │   │  - Date Conv.    │   │
                      │   │  - Trim/Clean    │   │
                      │   │  - JSON Parse    │   │
                      │   └──────────────────┘   │
                      │                           │
                      │   ┌──────────────────┐   │
                      └───│  Config JSON     │───┘
                          │  (Mappings)      │
                          └──────────────────┘

Key Features:

1. Batch Processing

Configurable batch size (default: 1000 rows)
Streaming execution - handles millions of rows
Memory-efficient: processes one chunk at a time
Progress tracking with visual progress bar

2. Smart Query Generation

# Only selects mapped columns - reduces network overhead
SELECT "hn", "fname", "lname", "dob" FROM patients
# Instead of SELECT * (which transfers unused data)

3. Data Transformation Pipeline

for batch in data_iterator:
    # 1. Fetch batch (1000 rows)
    df_batch = pd.read_sql(query, source_engine, chunksize=1000)

    # 2. Apply transformers (in-memory)
    df_batch = DataTransformer.apply_transformers_to_batch(df_batch, config)

    # 3. Rename columns to match target schema
    df_batch.rename(columns=rename_map, inplace=True)

    # 4. Bulk insert to target
    df_batch.to_sql(target_table, target_engine, if_exists='append')

4. Comprehensive Logging

Real-time log viewer in UI
Persistent log files: migration_logs/migration_NAME_TIMESTAMP.log
Audit trail: timestamps, row counts, errors
Downloadable after completion

5. Test Mode

Process only 1 batch (configurable limit)
Validate mappings without full migration
Dry-run capability for safety

6. Error Handling

Transaction-safe batch commits
Stops on first error (prevents data corruption)
Detailed error messages with context
Rollback support (database-dependent)

Execution Workflow:

Step 1: Select Configuration
  ├── Load from Project Database (saved configs)
  └── Upload JSON File (external configs)

Step 2: Test Connections
  ├── Select Source Datasource
  ├── Select Target Datasource
  ├── Verify connectivity
  └── Health checks

Step 3: Review & Settings
  ├── View configuration JSON
  ├── Set batch size
  ├── Enable/disable test mode
  └── Confirm execution

Step 4: Execute Migration
  ├── Connect to databases (SQLAlchemy engines)
  ├── Generate optimized SELECT query
  ├── Stream data in batches
  ├── Apply transformations
  ├── Bulk insert to target
  └── Log everything

Performance Characteristics:

Dataset Size	Batch Size	Approx. Time	Memory Usage
10K rows	1000	~10 seconds	< 50 MB
100K rows	1000	~1-2 minutes	< 200 MB
1M rows	1000	~10-15 min	< 500 MB
10M+ rows	5000	~1-2 hours	< 1 GB

Use Cases:

One-time Migrations: Legacy system to new platform
Continuous Sync: Nightly data transfers
Data Warehouse ETL: OLTP → OLAP transformations
Multi-tenant Migrations: Hospital A → Hospital B
Testing: Validate transformations with test mode

AI-Powered Column Mapping (v8.0)

Intelligent column matching using machine learning and healthcare domain knowledge.

Technology Stack:

Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Framework: Sentence Transformers (Hugging Face)
Similarity: Cosine similarity on semantic embeddings
Domain: Healthcare Information Systems (HIS)

How It Works:

1. Dual-Strategy Matching

# Strategy 1: Rule-Based Dictionary (Priority)
his_dictionary = {
    "hn": ["hn", "hospital_number", "mrn", "patient_code"],
    "cid": ["cid", "national_id", "card_id", "citizen_id"],
    "vn": ["vn", "visit_number", "visit_no"],
    # ... 30+ healthcare terms
}

# Strategy 2: Semantic AI Matching (Fallback)
# Encodes column names into 384-dimensional vectors
# Compares similarity using cosine distance
source_embedding = model.encode("patient_firstname")
target_embeddings = model.encode(["fname", "first_name", "given_name"])
best_match = argmax(cosine_similarity(source_embedding, target_embeddings))

2. Confidence Scoring

Exact Match: 1.0 (100% confidence)
Dictionary Match: 0.9 (90% confidence)
Semantic Match: 0.4-0.9 (threshold-based)
No Match: 0.0 (suggests manual review)

3. Sample Data Analysis

Analyzes actual column values to suggest transformers:

# Example: Detects Thai Buddhist year dates
sample_values = ["2566-01-15", "2567-03-20", "2565-12-01"]
analysis = ml_mapper.analyze_column_with_sample(
    source_col="admit_date",
    target_col="admission_date",
    sample_values=sample_values
)
# Returns:
{
    "confidence_score": 0.9,
    "transformers": ["BUDDHIST_TO_ISO"],  # Auto-suggested
    "reason": "Detected Thai Buddhist year (25xx) in 3/3 samples"
}

4. Pattern Detection

Pattern	Detection Logic	Suggested Transformer
Thai Buddhist Year	`25[5-9]\d` in >50% samples	`BUDDHIST_TO_ISO`
Whitespace Issues	Leading/trailing spaces	`TRIM`
JSON Structures	`{...}` or `[...]`	`PARSE_JSON`
Float IDs	`123.0` pattern	`FLOAT_TO_INT`
Leading Zeros	ID with `0` prefix	Keep as string
All NULL/Empty	No valid data	Mark as IGNORE

5. Healthcare-Specific Validation

# Hospital Number (HN) validation
if "hn" in source_column:
    hn_pattern = r'^\d{6,10}$'  # 6-10 digits
    valid_count = count_matches(samples, hn_pattern)
    confidence = valid_count / total_samples

# National ID (CID) validation
if "cid" in source_column:
    cid_pattern = r'^\d{13}$'  # Exactly 13 digits
    validate_thai_national_id_checksum(samples)

User Interface:

In Schema Mapper page:

Click "🤖 AI Auto-Map" button
AI analyzes source columns vs target schema
Displays suggestions with confidence scores
User reviews and confirms/modifies mappings
AI also suggests transformers based on sample data

Benefits:

Time Savings: Auto-map 100 columns in seconds vs hours
Accuracy: Semantic understanding, not just string matching
Learning: Improves with healthcare-specific dictionary
Transparency: Shows confidence scores for manual review
Flexibility: Suggestions, not forced decisions

Limitations:

Requires internet for first model download (~100 MB)
Best for English/Thai column names (multilingual model)
Suggestions need human validation
Not trained on your specific schema (generic model)

Example Session:

Source Columns          Target Columns          AI Suggestion    Confidence
---------------         ---------------         -------------    ----------
hn                  →   hospital_number         ✅ Matched       100%
patient_name        →   full_name              ⚠️  Maybe         65%
admit_dt            →   admission_date         ✅ Matched       85%
                        • Transformer: BUDDHIST_TO_ISO
blood_press         →   bp_systolic            ⚠️  Uncertain    45%
old_id              →   [No Match]             ❌ Manual        0%

🐛 Troubleshooting

Common Issues

Issue: `TypeError: data_editor() got an unexpected keyword argument 'selection_mode'`

Cause: Outdated Streamlit version (< 1.30.0)

Solution:

# Option 1: Use virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
pip install --upgrade streamlit

# Option 2: Force reinstall
pip uninstall streamlit -y
pip install --upgrade --force-reinstall streamlit

Issue: MSSQL SSL Certificate Error

Cause: Self-signed or untrusted SSL certificate

Solution: The toolkit automatically adds -C flag to trust server certificates:

sqlcmd -S host,port -C -U user -P password ...

Issue: `Must declare the scalar variable '@VariableName'`

Cause: T-SQL variable scope in dynamic SQL

Solution: Already handled in v7.1+ with proper variable injection

Issue: Empty CSV Output

Cause: Incorrect schema name (e.g., using default public for MSSQL)

Solution: Specify correct schema in config.json:

{
  "database": {
    "type": "mssql",
    "schema": "dbo"
  }
}

Getting Help

Check the process.log in the report folder
Review error messages in the terminal output

Verify database connectivity with native clients:

mysql -h host -u user -p
psql -h host -U user -d database
sqlcmd -S host,port -U user -P password

Open an issue on GitHub with:
- Error message
- Database type and version
- Operating system
- Relevant log excerpts

🧪 Testing

Run Mock Data Generator

cd mini_his
python gen_mini_his.py

Test Database Analysis

cd analysis_report

# Edit config.json to point to test database
./unified_db_analyzer.sh

# Verify output
ls -lh migration_report/*/data_profile/data_profile.csv

📊 Performance Benchmarks

Approximate analysis times (single table):

Rows	Columns	Basic Mode	Deep Mode
10K	20	~2s	~5s
100K	50	~10s	~30s
1M	100	~30s	~2min
10M+	200+	~2min	~10min+

Optimization Tips:

Use tables filter to analyze specific tables only
Disable deep_analysis for initial exploration
Adjust default_limit for faster sampling

🧪 Testing

Using Makefile (Recommended)

# Run ALL unit tests (pytest auto-discovers everything)
make test

# Run specific test suite
make test-simple        # AI pattern detection
make test-column        # Column analysis
make test-suite         # tests/ directory only

Manual Testing

The project includes:

test_analysis_simple.py — AI pattern detection tests
test_column_analysis.py — Column analysis tests
tests/ — pytest test suite with tmp_dir fixture for filesystem isolation

# Activate venv first (or use: source venv/bin/activate)
make install

# Run standard tests
python test_analysis_simple.py
python test_column_analysis.py

# Run pytest suite
python -m pytest tests/ -v

# Run pytest with coverage
python -m pytest tests/ --cov=services --cov=models --cov=utils --cov-report=term-missing -v

Run a Specific Module

python3.11 -m pytest tests/test_query_builder.py -v
python3.11 -m pytest tests/test_checkpoint_manager.py -v
python3.11 -m pytest tests/test_encoding_helper.py -v
python3.11 -m pytest tests/test_migration_logger.py -v
python3.11 -m pytest tests/test_models.py -v
python3.11 -m pytest tests/test_helpers.py -v

Test Coverage (46 tests)

Test File	Module Under Test	Tests
`test_checkpoint_manager.py`	`services/checkpoint_manager.py`	4
`test_encoding_helper.py`	`services/encoding_helper.py`	8
`test_helpers.py`	`utils/helpers.py`	14
`test_migration_logger.py`	`services/migration_logger.py`	5
`test_models.py`	`models/` (MigrationConfig, Datasource)	5
`test_query_builder.py`	`services/query_builder.py`	10

Writing New Tests

Use the tmp_dir fixture from conftest.py for any test that writes to disk:

def test_something(tmp_dir):
    # tmp_dir is a pathlib.Path pointing to a fresh temp directory
    # automatically cleaned up after each test
    path = tmp_dir / "output.json"
    ...

🤝 Contributing

Contributions are welcome! Please follow these guidelines:

Reporting Bugs

Use GitHub Issues
Include error messages and logs
Provide reproduction steps
Specify environment details

Suggesting Features

Open a GitHub Discussion
Describe use case and benefits
Provide examples if possible

Pull Requests

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Standards:

Bash scripts: Follow ShellCheck recommendations
Python: PEP 8 style guide
Add comments for complex logic
Update documentation for new features

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built for healthcare professionals managing HIS migrations
Inspired by enterprise database migration challenges
Community feedback and contributions welcome

📞 Support

Documentation: This README and inline code comments
Issues: GitHub Issues
Discussions: GitHub Discussions

🗺️ Roadmap

Completed ✅

Planned

Docker containerization
CI/CD pipeline integration
Data validation dashboard with anomaly detection
Scheduled migration jobs (cron-like)
Multi-datasource data lineage tracking
Custom AI model training for organization-specific schemas
Incremental/delta migration support
Data quality scoring engine

Made with ❤️ for the HIS migration community

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.claude/skills/add-entity-or-column		.claude/skills/add-entity-or-column
.devcontainer		.devcontainer
.streamlit		.streamlit
analysis_report		analysis_report
api		api
controllers		controllers
data_transformers		data_transformers
dialects		dialects
docs		docs
mini_his		mini_his
models		models
plan		plan
protocols		protocols
release-document		release-document
repositories		repositories
scripts		scripts
services		services
tests		tests
utils		utils
validators		validators
views		views
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
app.py		app.py
config.py		config.py
database.py		database.py
example_usage.py		example_usage.py
init_and_test.py		init_and_test.py
migration_tool.db		migration_tool.db
migration_tool.db.bak.db		migration_tool.db.bak.db
pytest.ini		pytest.ini
requirements.txt		requirements.txt
test_analysis_simple.py		test_analysis_simple.py
test_column_analysis.py		test_column_analysis.py
test_dialects.py		test_dialects.py
test_pg_connection.py		test_pg_connection.py
test_registries.py		test_registries.py

Folders and files

Latest commit

History

Repository files navigation

HIS Database Migration Toolkit

✨ Features

Core Features

New in v9.0

v8.0 Features

📋 Table of Contents

🏗️ Architecture

Layer Responsibilities

Data Flow Architecture (v9.0)

🔧 Requirements

System Requirements

Database Clients

Python Dependencies

🚀 Installation

🎯 Quick Setup (Using Makefile) — Recommended

Option 1: Manual Virtual Environment Setup

Option 2: System-Wide Installation

⚡ Quick Start

1. Initialize Project Database

2. Configure Datasources (New in v8.0)

3. Configure Database Connection (Legacy - for Bash analyzer)

4. Run Database Analysis (Optional - for CSV-based mapping)

⚙️ Configuration

Database Configuration

Basic Settings

Schema Support (v7.0+)

Sampling Configuration

Exception Rules

📖 Usage

Automated Data Profiling: Shell Script vs Python/Tools

Why Shell Script?

Benefits for ETL & Data Migration

Database Analysis Script

Streamlit Dashboard

📊 Schema Mapper (v8.0 Enhanced)

⚙️ Settings (v8.0 New)

🚀 Migration Engine

🔍 DDL Explorer

🎲 Mock Data Generator

🔄 Workflow

Step-by-Step Migration Process

🚀 Advanced Features

Connection Pooling (v8.0)

Project Database (v8.0)

Schema Mapper Dual Mode (v8.0)

Deep Analysis Mode

Smart Sample Data (v7.1+)

Auto-Dependency Installation

Interactive HTML Reports

Configuration History & Version Control (v8.0)

Migration Engine (v8.0)

AI-Powered Column Mapping (v8.0)

🐛 Troubleshooting

Common Issues

Issue: TypeError: data_editor() got an unexpected keyword argument 'selection_mode'

Issue: MSSQL SSL Certificate Error

Issue: Must declare the scalar variable '@VariableName'

Issue: Empty CSV Output

Getting Help

🧪 Testing

Run Mock Data Generator

Test Database Analysis

📊 Performance Benchmarks

🧪 Testing

Using Makefile (Recommended)

Manual Testing

Run a Specific Module

Test Coverage (46 tests)

Writing New Tests

🤝 Contributing

Reporting Bugs

Suggesting Features

Pull Requests

📝 License

🙏 Acknowledgments

📞 Support

Issue: `TypeError: data_editor() got an unexpected keyword argument 'selection_mode'`

Issue: `Must declare the scalar variable '@VariableName'`

Packages