A comprehensive, enterprise-grade toolkit for analyzing, profiling, and migrating Hospital Information System (HIS) databases. Features a dual-interface architecture with both a Streamlit dashboard and a FastAPI REST API with real-time Socket.IO events for background pipeline execution.
- π Multi-Database Support: Analyze MySQL, PostgreSQL, and MSSQL databases
- π Deep Data Profiling: Column-level statistics, data quality metrics, and composition analysis
- ποΈ Schema Analysis: Automatic DDL extraction with schema namespace support
- π― Smart Sampling: Intelligent data sampling with NULL and empty string filtering
- β‘ Auto-Dependency Management: Automatic installation of required database clients
- π HTML Reports: Beautiful, interactive reports with DataTables integration
- π§ Configuration Generator: Export migration configs in TypeScript/JSON format
- π REST API: Full CRUD API via FastAPI with JSON:API responses
- π‘ Socket.IO: Real-time job events (batch progress, errors, completion)
- π Background Jobs: POST /api/v1/jobs triggers pipeline in background thread
- π Per-Step Datasource Resolution: Each pipeline step resolves its own source/target datasource from UUID FK
- π Pipeline Nodes & Edges: Visual workflow graph with dependency-based topological sort
- π generate_sql Priority: Custom SQL queries take priority over auto-generated SELECT
- ποΈ Datasource Management: Centralized database connection profiles with PostgreSQL storage
- π Connection Pooling: Singleton pattern for efficient connection reuse across requests
- πΊοΈ Enhanced Schema Mapper: Dual-mode source selection (Run ID or Live Datasource)
- π‘ Live Schema Discovery: Dynamic table and column loading from connected databases
- π‘ Smart Column Suggestions: Auto-suggest target columns from actual database schema
- ποΈ Configuration Repository: Save and load mapping configurations from project database
- π Configuration History: Version tracking with comparison and rollback capabilities
- π Migration Engine: Production-ready ETL execution with batch processing and logging
- π€ AI-Powered Mapping: Semantic column matching using ML transformers and healthcare dictionaries
- Architecture
- Requirements
- Installation
- Quick Start
- Configuration
- Usage
- Workflow
- Advanced Features
- Troubleshooting
- Testing
- Contributing
- License
The codebase follows Clean Architecture with a dual-interface pattern: Streamlit (MVC dashboard) and FastAPI (REST API).
his-analyzer/
βββ app.py # Streamlit routing + sidebar navigation
βββ config.py # Constants: TRANSFORMER_OPTIONS, VALIDATOR_OPTIONS, DB_TYPES
βββ database.py # Legacy facade (deprecated, being removed)
β
βββ api/ # FastAPI REST API + Socket.IO
β βββ main.py # App setup, CORS, router registration, /ws mount
β βββ socket_manager.py # Async Socket.IO server + emit_from_thread()
β βββ base/ # Shared API infrastructure
β β βββ controller.py # BaseController (generic CRUD)
β β βββ service.py # BaseService with pagination/sanitize
β β βββ exceptions.py # JSON API error handlers
β β βββ json_api.py # JSON:API response builder
β βββ datasources/ # /api/v1/datasources
β βββ configs/ # /api/v1/configs (+ /histories, /versions)
β βββ pipelines/ # /api/v1/pipelines (with nodes/edges)
β βββ pipeline_runs/ # /api/v1/pipeline-runs
β βββ jobs/ # /api/v1/jobs (POST β trigger background pipeline)
β
βββ models/ # Data classes (pure Python, no I/O)
β βββ datasource.py # Datasource dataclass
β βββ migration_config.py # ConfigRecord, MigrationConfig, MappingItem
β βββ job.py # JobRecord, JobUpdateRecord
β βββ pipeline_config.py # PipelineConfig, PipelineStep, PipelineNodeRecord,
β # PipelineEdgeRecord, PipelineRunRecord
β
βββ protocols/ # Protocol interfaces (DIP)
β βββ repository.py # Repository protocol interfaces
β
βββ repositories/ # Data access layer (PostgreSQL)
β βββ connection.py # SQLAlchemy engine singleton
β βββ base.py # DDL + init_db()
β βββ datasource_repo.py # Datasource CRUD
β βββ config_repo.py # Config CRUD + versioning
β βββ pipeline_repo.py # Pipeline CRUD + get_by_id (JOIN nodes/edges/configs)
β βββ pipeline_node_repo.py # Pipeline node CRUD
β βββ pipeline_edge_repo.py # Pipeline edge CRUD
β βββ pipeline_run_repo.py # Pipeline Run CRUD
β βββ job_repo.py # Job CRUD
β
βββ services/ # Business logic (no Streamlit imports)
β βββ datasource_repository.py # DatasourceRepository facade
β βββ db_connector.py # SQLAlchemy engine factory (MySQL, PG, MSSQL)
β βββ ml_mapper.py # SmartMapper: HIS dictionary + semantic matching
β βββ transformers.py # DataTransformer: vectorised Pandas transformations
β βββ checkpoint_manager.py # Checkpoint save/load/clear for resumable migrations
β βββ migration_logger.py # Per-run ETL log files
β βββ encoding_helper.py # clean_dataframe for Thai legacy data
β βββ migration_executor.py # Single-table ETL engine (generate_sql priority)
β βββ pipeline_service.py # Pipeline orchestration + per-step datasource resolution
β βββ query_builder.py # SELECT builder, batch transform, bulk insert
β
βββ dialects/ # Database dialects (OCP)
β βββ mysql.py, postgresql.py, mssql.py
β βββ registry.py # Dialect registry
β
βββ data_transformers/ # Data transformations (OCP)
β βββ text.py, dates.py, healthcare.py, names.py, data_type.py, lookup.py
β βββ registry.py # @register_transformer decorator
β
βββ validators/ # Data validators (OCP)
β βββ not_null.py, unique.py, range_check.py
β βββ registry.py # @register_validator decorator
β
βββ controllers/ # MVC Controllers (6/6)
β βββ settings_controller.py
β βββ pipeline_controller.py
β βββ file_explorer_controller.py
β βββ er_diagram_controller.py
β βββ schema_mapper_controller.py
β βββ migration_engine_controller.py
β
βββ views/ # MVC Views (pure rendering)
β βββ components/ # Reusable UI components
β βββ schema_mapper/ # Source selector, mapping editor, config actions
β βββ migration/ # Step config, connections, execution
β βββ shared/ # Dialogs, styles
β
βββ scripts/ # Utility scripts
β βββ migrate_sqlite_to_pg.py # One-time SQLite β PostgreSQL migration
β βββ migrate_add_jobs_table.py # Create jobs table + job_id FK
β
βββ tests/ # pytest test suite
βββ analysis_report/ # Database Analysis Engine (Shell)
βββ mini_his/ # Mock HIS data
| Layer | Rule |
|---|---|
models/ |
Pure dataclasses β no I/O, no Streamlit |
services/ |
Business logic β no st.* calls |
views/ |
Thin orchestrators β call components, manage step flow |
views/components/ |
Reusable UI widgets β read/write session_state, render widgets |
utils/ |
Stateless pure helpers + PageState session abstraction |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ANALYSIS PHASE (Bash) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Source DB β unified_db_analyzer.sh β CSV/HTML Reports β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MAPPING PHASE (Python + AI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Schema Mapper (Streamlit) + REST API β
β βββ AI Auto-Map: ML Model suggests column mappings β
β βββ Manual Review: User confirms/modifies β
β βββ Transformer Selection: Date conv, trim, etc. β
β βββ generate_sql: Custom SELECT (optional, priority) β
β βββ Save: Config β PostgreSQL (with versioning) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PIPELINE PHASE (FastAPI + Background) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β POST /api/v1/jobs { pipeline_id } β 202 Accepted β
β βββ PipelineExecutor resolves nodes + edges (topological sort) β
β βββ Per-step datasource resolution (UUID β engine per step) β
β βββ Each step: extract β transform β load β
β β βββ generate_sql (priority) or build_select_query (fallback)β
β β βββ Transformers applied after pd.read_sql() β
β βββ Socket.IO events: job:batch, job:error, job:completed β
β βββ Frontend receives real-time progress β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STORAGE (PostgreSQL) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β datasources β configs (FK) β pipeline_nodes β pipeline_edges β
β jobs β pipeline_runs β pipeline_steps (JSON) β
β config_histories (versioning) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Operating System: Linux, macOS, Windows (via WSL2)
- Shell: Bash 4.0+ (auto-switch on macOS)
- Python: 3.8 or higher
- RAM: 4GB minimum, 8GB+ recommended for large databases
The toolkit requires database-specific clients:
- MySQL:
mysql-client - PostgreSQL:
libpq(PostgreSQL client) - MSSQL:
mssql-tools18(with ODBC driver)
Note: On macOS with Homebrew, these dependencies are auto-installed when missing.
streamlit >= 1.30.0- Web dashboard frameworkpandas >= 2.0.0- Data manipulationjq- JSON processor (system package)
The repository includes a Makefile to simplify setup and development.
# 1. Clone the repository
git clone https://github.com/yourusername/his-analyzer.git
cd his-analyzer
# 2. One-command setup (creates venv + installs dependencies)
make setup
# 3. Start developing with hot-reload
make run
# 4. View all available commands
make helpAvailable Makefile Commands:
make setup # Create venv + install dependencies (first time)
make install # Install dependencies only (venv must exist)
make run # Start app with hot-reload (default)
make run-reload # Start app with hot-reload (explicit)
make run-no-reload # Start app without hot-reload
make test # Run all unit tests (pytest discovers everything)
make test-simple # Run AI pattern detection tests only
make test-column # Run column analysis tests only
make test-suite # Run tests/ directory only
make clean # Remove venv and __pycache__
make help # Show all commandsUsing a virtual environment prevents version conflicts with system Python packages.
# 1. Clone the repository
git clone https://github.com/yourusername/his-analyzer.git
cd his-analyzer
# 2. Create virtual environment
python3.11 -m venv venv
# 3. Activate virtual environment
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
# 4. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 5. Start the app with hot-reload
python3.11 -m streamlit run app.py --server.runOnSave true# Install Python dependencies
pip3 install -r requirements.txt
# Install system dependencies (macOS with Homebrew)
brew install jq
# Install database clients as needed
brew install mysql-client
brew install libpq
brew tap microsoft/mssql-release && brew install mssql-tools18
# Start the app
streamlit run app.pyOn first run, the application automatically creates migration_tool.db SQLite database:
# Using Makefile (recommended)
make run
# OR manually
python3.11 -m streamlit run app.py --server.runOnSave trueThe database is created automatically with the following tables:
datasources- Stores database connection profilesconfigs- Stores schema mapping configurations
No manual setup required! The database initialization happens on application startup.
Navigate to Settings page in the Streamlit interface to manage datasources:
- Click "βοΈ Settings" in the sidebar
- Select "Datasources" tab
- Click "Add New Datasource"
- Fill in connection details:
- Name (unique identifier)
- Database Type (MySQL, PostgreSQL, MSSQL)
- Host, Port, Database Name
- Username, Password
- Test connection
- Save datasource
Datasources are stored in SQLite and reused across:
- Schema Mapper (source & target selection)
- Migration Engine (connection profiles)
- All database operations (via connection pool)
Edit analysis_report/config.json:
{
"database": {
"type": "mysql",
"host": "localhost",
"port": "3306",
"name": "hospital_db",
"user": "root",
"password": "your_password",
"schema": "",
"tables": []
},
"sampling": {
"default_limit": 10,
"max_text_length": 300,
"deep_analysis": true,
"exceptions": []
}
}cd analysis_report
./unified_db_analyzer.shOutput: Creates timestamped report in migration_report/YYYYMMDD_HHMM/
Note: In v8.0, you can also connect directly to datasources in Schema Mapper, bypassing the need for analysis reports.
| Field | Description | Example |
|---|---|---|
type |
Database type | mysql, postgresql, mssql |
host |
Database host | localhost, 192.168.1.100 |
port |
Database port | 3306, 5432, 1433 |
name |
Database name | hospital_db |
user |
Username | admin |
password |
Password | secure_password |
schema |
Schema name (optional) | public, dbo |
tables |
Specific tables (optional) | ["patients", "visits"] |
Specify database schema for PostgreSQL and MSSQL:
{
"database": {
"type": "postgresql",
"schema": "public",
...
}
}Defaults:
- PostgreSQL:
public - MSSQL:
dbo - MySQL: Not applicable
| Parameter | Description | Default |
|---|---|---|
default_limit |
Number of sample rows | 10 |
max_text_length |
Max characters for text fields | 300 |
deep_analysis |
Enable detailed statistics | true |
exceptions |
Per-column overrides | [] |
Override sampling limits for specific columns:
{
"sampling": {
"exceptions": [
{ "table": "patients", "column": "notes", "limit": 3 },
{ "table": "visits", "column": "diagnosis", "limit": 5 }
]
}
}This toolkit uses a pure Bash shell script (unified_db_analyzer.sh) for database profiling instead of Python ETL frameworks or commercial tools. Here's why:
| Aspect | Shell Script Approach | Python/Tools Alternative |
|---|---|---|
| Dependencies | Minimal: bash, jq, native DB clients |
Heavy: pandas, SQLAlchemy, various libraries |
| Portability | Runs anywhere with Bash 4.0+ | Requires Python environment setup |
| Performance | Direct database access, minimal overhead | Abstraction layers slow down queries |
| Security | No code execution risks, simple audit | Complex dependency chains, supply chain risks |
| Maintenance | Single 600-line script, easy to debug | Multiple packages, version conflicts |
| Installation | Auto-installs missing DB clients via Homebrew | Manual pip installs, virtual environments |
1. Zero-Setup Profiling
# No Python, no pip install, no virtual env - just run
cd analysis_report
./unified_db_analyzer.sh2. Multi-Database Native Support
- Directly uses
mysql,psql,sqlcmdfor optimal performance - Schema-aware profiling (PostgreSQL
public, MSSQLdbo) - Handles database-specific quirks (MSSQL SSL certs, NULL warnings)
3. Production-Ready Features
- Smart Sampling: Filters NULL/empty values automatically
- Deep Analysis Mode: Min/Max, Top-5 frequencies, data composition
- Exception Rules: Per-column sampling limits
- Table Size Calculation: Actual disk usage in MB
- DDL Export: Complete schema with indexes and constraints
4. Migration-Friendly Output
- CSV Format: Universal, works with any ETL tool
- HTML Reports: Interactive DataTables for business users
- Timestamped Runs: Tracks profiling history (
YYYYMMDD_HHMM/) - Process Logs: Complete audit trail for compliance
5. Real-World Migration Use Cases
# Before migration: Profile source database
./unified_db_analyzer.sh # Analyzes source system
# Review data quality, identify issues
open migration_report/20251130_1523/data_profile/data_profile.html
# Load into Streamlit for schema mapping
# Use profiling data to design transformations
# Execute migration with confidence
# Knowing exact data types, null counts, value ranges6. Shell Script Advantages for Migration
- Repeatable: Run daily to track data changes over time
- Scriptable: Integrate into CI/CD pipelines
- Offline: Profile production DB, analyze on laptop (CSV export)
- Auditable: Single script = complete transparency
- Fast: No Python overhead, direct SQL execution
cd analysis_report
./unified_db_analyzer.shFeatures:
- Auto-detects database type from
config.json - Checks and installs missing dependencies (macOS with Homebrew)
- Exports DDL schema to
schema.sql - Generates CSV data profile with smart NULL/empty filtering
- Creates interactive HTML report with DataTables
- Logs all operations to
process.log
Output Structure:
migration_report/20251124_0023/
βββ ddl_schema/
β βββ schema.sql # Complete DDL export
βββ data_profile/
β βββ data_profile.csv # Raw profiling data
β βββ data_profile.html # Interactive report
βββ process.log # Execution log
The dashboard provides several interfaces:
Dual Source Mode:
- Run ID Mode: Load from CSV analysis reports (legacy)
- Datasource Mode: Connect directly to live database (new!)
Features:
- View table and column statistics
- Map source to target fields with live schema discovery
- Smart target column suggestions from actual database
- Select data transformers and validators
- Save/load configurations from project database
- Generate TypeScript/JSON configurations
- Export configurations as downloadable files
Workflow:
- Source Configuration: Choose Run ID or Datasource
- Run ID: Select from analysis report folders
- Datasource: Select datasource β Choose table (auto-loads schema)
- Target Configuration: Select target datasource and table
- Field Mapping: Map source columns to target with suggestions
- Save Configuration: Store in SQLite for reuse
- Export: Download as JSON for migration tools
Datasources Tab:
- Add/Edit/Delete datasource profiles
- Test database connections
- View all configured datasources
- Secure credential storage in SQLite
Saved Configs Tab:
- View all saved schema mapping configurations
- Load configurations for editing
- Delete unused configurations
- Export configurations
- Select source and target from datasource profiles
- Load saved configurations from project database
- Upload configuration files
- Execute data migration (simulation mode)
- Browse database schema
- Click tables to view CREATE statements
- Navigate foreign key relationships
- Generate test data for migration testing
- Configurable data volumes
- Realistic HIS data patterns
flowchart TD
A[π Configure config.json] --> B{Select Database Type}
B -->|MySQL| C1[MySQL Client]
B -->|PostgreSQL| C2[PostgreSQL Client]
B -->|MSSQL| C3[MSSQL Client + SSL]
C1 --> D[π unified_db_analyzer.sh]
C2 --> D
C3 --> D
D --> E{Check Dependencies}
E -->|Missing| F[π§ Auto-Install via Homebrew]
E -->|Available| G
F --> G[βοΈ Start Analysis]
G --> H1[π Table Size & Row Count]
G --> H2[π Column Profiling]
G --> H3[π DDL Export]
H2 --> I{Deep Analysis?}
I -->|true| J[π Min/Max/Top5/Composition]
I -->|false| K[Basic Stats Only]
J --> L[π― Smart Sample<br/>NOT NULL & NOT EMPTY]
K --> L
L --> M[πΎ Export to CSV]
M --> N[π Generate HTML Report]
N --> O[π migration_report/YYYYMMDD_HHMM/]
O --> P[π₯οΈ Open in Streamlit Dashboard]
P --> Q[πΊοΈ Schema Mapping & Config Generation]
style D fill:#4CAF50,color:#fff
style L fill:#FF9800,color:#fff
style O fill:#2196F3,color:#fff
style Q fill:#9C27B0,color:#fff
-
Database Analysis
- Configure
config.jsonwith source database credentials - Run
./unified_db_analyzer.sh - Review generated reports
- Configure
-
Schema Mapping
- Launch Streamlit dashboard
- Navigate to Schema Mapper
- Load analysis report
- Map source fields to target schema
- Select transformers (e.g., date format converters, string normalizers)
-
Configuration Export
- Generate TypeScript/JSON configuration
- Integrate with migration pipeline
- Test with mock data if needed
-
Migration Execution
- Use generated config with your ETL tool
- Monitor data quality metrics
- Validate migrated data
The toolkit uses a singleton connection pool pattern for efficient database operations:
Benefits:
- Reuses connections across multiple requests
- Automatic health checks and reconnection
- Significant performance improvement for repeated operations
- Thread-safe connection management
How it works:
# First call - creates connection
get_tables_from_datasource(...) # Creates new connection
# Second call - reuses connection (no overhead!)
get_columns_from_table(...) # Reuses existing connection
# Connection stays alive for future requestsConnection Management:
- Connections are identified by unique hash (host, port, db, user)
- Dead connections are automatically detected and recreated
- All functions use autocommit mode for stability
- Connections persist across Streamlit reruns
Manual Control:
from services.db_connector import close_connection, close_all_connections
# Close specific connection
close_connection(db_type, host, port, db_name, user)
# Close all connections (useful for cleanup)
close_all_connections()SQLite Storage: migration_tool.db
Tables:
-
datasources - Database connection profiles
CREATE TABLE datasources ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT UNIQUE, db_type TEXT, host TEXT, port TEXT, dbname TEXT, username TEXT, password TEXT )
-
configs - Schema mapping configurations
CREATE TABLE configs ( id INTEGER PRIMARY KEY AUTOINCREMENT, config_name TEXT UNIQUE, table_name TEXT, json_data TEXT, updated_at TIMESTAMP )
Automatic Initialization:
- Database created on first application run
- No manual SQL scripts required
- Handles migrations automatically
Mode 1: Run ID (Traditional)
- Uses CSV analysis reports
- Offline operation
- Historical data analysis
- Best for: Initial exploration, documented analysis
Mode 2: Datasource (New)
- Connects directly to live database
- Real-time schema discovery
- Auto-loads tables and columns
- Best for: Active development, latest schema
Switching between modes:
- Open Schema Mapper
- Select source mode (Run ID / Datasource)
- Choose source accordingly
- Schema Mapper adapts automatically
Enable comprehensive data profiling:
{
"sampling": {
"deep_analysis": true
}
}Metrics Collected:
| Metric | Basic Mode | Deep Mode |
|---|---|---|
| Row Count | β | β |
| Null Count | β | β |
| Distinct Values | β | β |
| Min/Max Values | β | β |
| Top 5 Frequency | β | β |
| Data Composition | β | β (Valid/Null/Empty/Zero) |
| Sample Data | β | β (Smart filtered) |
Performance Considerations:
- Basic Mode: Fast, suitable for large tables (millions of rows)
- Deep Mode: Slower, recommended for detailed migration planning
Automatically filters sample data to show only meaningful values:
Filtering Rules:
- Excludes
NULLvalues - Excludes empty strings (
'') - Shows actual representative data
Implementation (MySQL example):
SELECT DISTINCT column_name
FROM table_name
WHERE column_name IS NOT NULL
AND CAST(column_name AS CHAR) <> ''
LIMIT 10;On macOS with Homebrew, missing database clients are automatically installed:
# Example: Installing MSSQL tools
β Error: Command 'sqlcmd' not found
πΊ Homebrew detected...
β Install 'mssql-tools18' now? (y/N): y
π¦ Installing mssql-tools18...
-> Tapping microsoft/mssql-release...
-> Installing packages...
β
Installation successful!Generated HTML reports include:
- Overview Tab: Table-level metrics with sortable DataTable
- Column Detail Tab: Comprehensive column-level statistics
- Formulas & Docs Tab: Data quality score explanations
- Process Log Tab: Complete execution logs
Features:
- Responsive design with Bootstrap 5
- Interactive tables with search/filter/sort
- Data quality visualizations
- Exportable to Excel/CSV/PDF
Track every change to your schema mapping configurations with built-in version control.
How It Works:
Every time you save a configuration, the system automatically:
- Creates a new version entry in
config_historiestable - Preserves complete JSON snapshot with timestamp
- Increments version number (v1, v2, v3...)
- Links to parent configuration via foreign key
Database Schema:
-- Main configuration table
CREATE TABLE configs (
id TEXT PRIMARY KEY, -- UUID for relationships
config_name TEXT UNIQUE, -- User-facing name
table_name TEXT, -- Source table
json_data TEXT, -- Current config JSON
updated_at TIMESTAMP -- Last modification
);
-- Version history table
CREATE TABLE config_histories (
id TEXT PRIMARY KEY, -- Unique history entry ID
config_id TEXT, -- Links to parent config
version INTEGER, -- Sequential version number
json_data TEXT, -- Config snapshot at this version
created_at TIMESTAMP, -- When this version was created
FOREIGN KEY(config_id) REFERENCES configs(id) ON DELETE CASCADE
);Key Features:
-
Automatic Versioning
- No manual intervention needed
- Every save creates a new version
- Original version preserved forever
-
Version Comparison
# Compare two versions to see what changed diff = db.compare_config_versions("PatientMigration", version1=1, version2=3) # Returns: { 'mappings_added': [...], # New column mappings 'mappings_removed': [...], # Deleted mappings 'mappings_modified': [...] # Changed transformers/targets }
-
Rollback Support
- View all historical versions in Settings page
- Load any previous version
- Restore deleted configurations from history
-
Audit Trail
- Complete history of configuration changes
- Timestamp for every modification
- Useful for compliance and troubleshooting
Use Cases:
- Migration Testing: Try different mapping strategies, rollback if needed
- Team Collaboration: Track who changed what and when
- Production Safety: Restore last-known-good configuration quickly
- Documentation: Historical record of migration decisions
Production-ready ETL execution engine with enterprise features.
Architecture:
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Source DB ββββββββΆβ Migration ββββββββΆβ Target DB β
β (via Profile) βββββ β Engine β ββββΆβ (via Profile) β
βββββββββββββββββββ β ββββββββββββββββββββ β βββββββββββββββββββ
β β β
β βΌ β
β ββββββββββββββββββββ β
β β Transformers β β
β β - Date Conv. β β
β β - Trim/Clean β β
β β - JSON Parse β β
β ββββββββββββββββββββ β
β β
β ββββββββββββββββββββ β
βββββ Config JSON βββββ
β (Mappings) β
ββββββββββββββββββββ
Key Features:
1. Batch Processing
- Configurable batch size (default: 1000 rows)
- Streaming execution - handles millions of rows
- Memory-efficient: processes one chunk at a time
- Progress tracking with visual progress bar
2. Smart Query Generation
# Only selects mapped columns - reduces network overhead
SELECT "hn", "fname", "lname", "dob" FROM patients
# Instead of SELECT * (which transfers unused data)3. Data Transformation Pipeline
for batch in data_iterator:
# 1. Fetch batch (1000 rows)
df_batch = pd.read_sql(query, source_engine, chunksize=1000)
# 2. Apply transformers (in-memory)
df_batch = DataTransformer.apply_transformers_to_batch(df_batch, config)
# 3. Rename columns to match target schema
df_batch.rename(columns=rename_map, inplace=True)
# 4. Bulk insert to target
df_batch.to_sql(target_table, target_engine, if_exists='append')4. Comprehensive Logging
- Real-time log viewer in UI
- Persistent log files:
migration_logs/migration_NAME_TIMESTAMP.log - Audit trail: timestamps, row counts, errors
- Downloadable after completion
5. Test Mode
- Process only 1 batch (configurable limit)
- Validate mappings without full migration
- Dry-run capability for safety
6. Error Handling
- Transaction-safe batch commits
- Stops on first error (prevents data corruption)
- Detailed error messages with context
- Rollback support (database-dependent)
Execution Workflow:
Step 1: Select Configuration
βββ Load from Project Database (saved configs)
βββ Upload JSON File (external configs)
Step 2: Test Connections
βββ Select Source Datasource
βββ Select Target Datasource
βββ Verify connectivity
βββ Health checks
Step 3: Review & Settings
βββ View configuration JSON
βββ Set batch size
βββ Enable/disable test mode
βββ Confirm execution
Step 4: Execute Migration
βββ Connect to databases (SQLAlchemy engines)
βββ Generate optimized SELECT query
βββ Stream data in batches
βββ Apply transformations
βββ Bulk insert to target
βββ Log everythingPerformance Characteristics:
| Dataset Size | Batch Size | Approx. Time | Memory Usage |
|---|---|---|---|
| 10K rows | 1000 | ~10 seconds | < 50 MB |
| 100K rows | 1000 | ~1-2 minutes | < 200 MB |
| 1M rows | 1000 | ~10-15 min | < 500 MB |
| 10M+ rows | 5000 | ~1-2 hours | < 1 GB |
Use Cases:
- One-time Migrations: Legacy system to new platform
- Continuous Sync: Nightly data transfers
- Data Warehouse ETL: OLTP β OLAP transformations
- Multi-tenant Migrations: Hospital A β Hospital B
- Testing: Validate transformations with test mode
Intelligent column matching using machine learning and healthcare domain knowledge.
Technology Stack:
- Model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - Framework: Sentence Transformers (Hugging Face)
- Similarity: Cosine similarity on semantic embeddings
- Domain: Healthcare Information Systems (HIS)
How It Works:
1. Dual-Strategy Matching
# Strategy 1: Rule-Based Dictionary (Priority)
his_dictionary = {
"hn": ["hn", "hospital_number", "mrn", "patient_code"],
"cid": ["cid", "national_id", "card_id", "citizen_id"],
"vn": ["vn", "visit_number", "visit_no"],
# ... 30+ healthcare terms
}
# Strategy 2: Semantic AI Matching (Fallback)
# Encodes column names into 384-dimensional vectors
# Compares similarity using cosine distance
source_embedding = model.encode("patient_firstname")
target_embeddings = model.encode(["fname", "first_name", "given_name"])
best_match = argmax(cosine_similarity(source_embedding, target_embeddings))2. Confidence Scoring
- Exact Match: 1.0 (100% confidence)
- Dictionary Match: 0.9 (90% confidence)
- Semantic Match: 0.4-0.9 (threshold-based)
- No Match: 0.0 (suggests manual review)
3. Sample Data Analysis
Analyzes actual column values to suggest transformers:
# Example: Detects Thai Buddhist year dates
sample_values = ["2566-01-15", "2567-03-20", "2565-12-01"]
analysis = ml_mapper.analyze_column_with_sample(
source_col="admit_date",
target_col="admission_date",
sample_values=sample_values
)
# Returns:
{
"confidence_score": 0.9,
"transformers": ["BUDDHIST_TO_ISO"], # Auto-suggested
"reason": "Detected Thai Buddhist year (25xx) in 3/3 samples"
}4. Pattern Detection
| Pattern | Detection Logic | Suggested Transformer |
|---|---|---|
| Thai Buddhist Year | 25[5-9]\d in >50% samples |
BUDDHIST_TO_ISO |
| Whitespace Issues | Leading/trailing spaces | TRIM |
| JSON Structures | {...} or [...] |
PARSE_JSON |
| Float IDs | 123.0 pattern |
FLOAT_TO_INT |
| Leading Zeros | ID with 0 prefix |
Keep as string |
| All NULL/Empty | No valid data | Mark as IGNORE |
5. Healthcare-Specific Validation
# Hospital Number (HN) validation
if "hn" in source_column:
hn_pattern = r'^\d{6,10}$' # 6-10 digits
valid_count = count_matches(samples, hn_pattern)
confidence = valid_count / total_samples
# National ID (CID) validation
if "cid" in source_column:
cid_pattern = r'^\d{13}$' # Exactly 13 digits
validate_thai_national_id_checksum(samples)User Interface:
In Schema Mapper page:
- Click "π€ AI Auto-Map" button
- AI analyzes source columns vs target schema
- Displays suggestions with confidence scores
- User reviews and confirms/modifies mappings
- AI also suggests transformers based on sample data
Benefits:
- Time Savings: Auto-map 100 columns in seconds vs hours
- Accuracy: Semantic understanding, not just string matching
- Learning: Improves with healthcare-specific dictionary
- Transparency: Shows confidence scores for manual review
- Flexibility: Suggestions, not forced decisions
Limitations:
- Requires internet for first model download (~100 MB)
- Best for English/Thai column names (multilingual model)
- Suggestions need human validation
- Not trained on your specific schema (generic model)
Example Session:
Source Columns Target Columns AI Suggestion Confidence
--------------- --------------- ------------- ----------
hn β hospital_number β
Matched 100%
patient_name β full_name β οΈ Maybe 65%
admit_dt β admission_date β
Matched 85%
β’ Transformer: BUDDHIST_TO_ISO
blood_press β bp_systolic β οΈ Uncertain 45%
old_id β [No Match] β Manual 0%
Cause: Outdated Streamlit version (< 1.30.0)
Solution:
# Option 1: Use virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
pip install --upgrade streamlit
# Option 2: Force reinstall
pip uninstall streamlit -y
pip install --upgrade --force-reinstall streamlitCause: Self-signed or untrusted SSL certificate
Solution: The toolkit automatically adds -C flag to trust server certificates:
sqlcmd -S host,port -C -U user -P password ...Cause: T-SQL variable scope in dynamic SQL
Solution: Already handled in v7.1+ with proper variable injection
Cause: Incorrect schema name (e.g., using default public for MSSQL)
Solution: Specify correct schema in config.json:
{
"database": {
"type": "mssql",
"schema": "dbo"
}
}- Check the
process.login the report folder - Review error messages in the terminal output
- Verify database connectivity with native clients:
mysql -h host -u user -p psql -h host -U user -d database sqlcmd -S host,port -U user -P password
- Open an issue on GitHub with:
- Error message
- Database type and version
- Operating system
- Relevant log excerpts
cd mini_his
python gen_mini_his.pycd analysis_report
# Edit config.json to point to test database
./unified_db_analyzer.sh
# Verify output
ls -lh migration_report/*/data_profile/data_profile.csvApproximate analysis times (single table):
| Rows | Columns | Basic Mode | Deep Mode |
|---|---|---|---|
| 10K | 20 | ~2s | ~5s |
| 100K | 50 | ~10s | ~30s |
| 1M | 100 | ~30s | ~2min |
| 10M+ | 200+ | ~2min | ~10min+ |
Optimization Tips:
- Use
tablesfilter to analyze specific tables only - Disable
deep_analysisfor initial exploration - Adjust
default_limitfor faster sampling
# Run ALL unit tests (pytest auto-discovers everything)
make test
# Run specific test suite
make test-simple # AI pattern detection
make test-column # Column analysis
make test-suite # tests/ directory onlyThe project includes:
- test_analysis_simple.py β AI pattern detection tests
- test_column_analysis.py β Column analysis tests
- tests/ β pytest test suite with
tmp_dirfixture for filesystem isolation
# Activate venv first (or use: source venv/bin/activate)
make install
# Run standard tests
python test_analysis_simple.py
python test_column_analysis.py
# Run pytest suite
python -m pytest tests/ -v
# Run pytest with coverage
python -m pytest tests/ --cov=services --cov=models --cov=utils --cov-report=term-missing -vpython3.11 -m pytest tests/test_query_builder.py -v
python3.11 -m pytest tests/test_checkpoint_manager.py -v
python3.11 -m pytest tests/test_encoding_helper.py -v
python3.11 -m pytest tests/test_migration_logger.py -v
python3.11 -m pytest tests/test_models.py -v
python3.11 -m pytest tests/test_helpers.py -v| Test File | Module Under Test | Tests |
|---|---|---|
test_checkpoint_manager.py |
services/checkpoint_manager.py |
4 |
test_encoding_helper.py |
services/encoding_helper.py |
8 |
test_helpers.py |
utils/helpers.py |
14 |
test_migration_logger.py |
services/migration_logger.py |
5 |
test_models.py |
models/ (MigrationConfig, Datasource) |
5 |
test_query_builder.py |
services/query_builder.py |
10 |
Use the tmp_dir fixture from conftest.py for any test that writes to disk:
def test_something(tmp_dir):
# tmp_dir is a pathlib.Path pointing to a fresh temp directory
# automatically cleaned up after each test
path = tmp_dir / "output.json"
...Contributions are welcome! Please follow these guidelines:
- Use GitHub Issues
- Include error messages and logs
- Provide reproduction steps
- Specify environment details
- Open a GitHub Discussion
- Describe use case and benefits
- Provide examples if possible
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Code Standards:
- Bash scripts: Follow ShellCheck recommendations
- Python: PEP 8 style guide
- Add comments for complex logic
- Update documentation for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- Built for healthcare professionals managing HIS migrations
- Inspired by enterprise database migration challenges
- Community feedback and contributions welcome
- Documentation: This README and inline code comments
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Datasource management with PostgreSQL storage (v8.0)
- Connection pooling and reuse (v8.0)
- Dual-mode schema mapper (Run ID / Datasource) (v8.0)
- Live schema discovery (v8.0)
- Configuration repository (v8.0)
- Smart column suggestions (v8.0)
- Configuration history and version control (v8.0)
- Version comparison and rollback (v8.0)
- Production-ready migration engine (v8.0)
- Batch processing with streaming (v8.0)
- AI-powered column mapping (v8.0)
- Healthcare-specific ML dictionary (v8.0)
- Automatic transformer suggestions (v8.0)
- FastAPI REST API with full CRUD (v9.0)
- Socket.IO real-time job events (v9.0)
- Background pipeline execution (v9.0)
- Pipeline nodes & edges with topological sort (v9.0)
- Per-step datasource resolution (v9.0)
- generate_sql priority over dynamic SELECT (v9.0)
- Docker containerization
- CI/CD pipeline integration
- Data validation dashboard with anomaly detection
- Scheduled migration jobs (cron-like)
- Multi-datasource data lineage tracking
- Custom AI model training for organization-specific schemas
- Incremental/delta migration support
- Data quality scoring engine
Made with β€οΈ for the HIS migration community