Skip to content

rapiddweller/datamimic

Repository files navigation

DATAMIMIC β€” Governed Test Data for Regulated Enterprises

This repository contains the DATAMIMIC Community Edition (CE). MIT-licensed, Python-native, MCP-ready.

CE is fully usable standalone for deterministic synthetic data generation and PII-aware pseudonymization. The Enterprise Platform adds governed workflows, PII scanning, role-based access, audit logging, scheduling, multi-system execution, and the full operational layer that regulated enterprises require.

πŸ‘‰ Enterprise Platform: datamimic.io Β |Β  πŸ“˜ Docs: docs.datamimic.io Β |Β  πŸ“… Book a strategy call: datamimic.io/contact


CI Coverage Maintainability Python License: MIT MCP Ready


What is DATAMIMIC?

DATAMIMIC CE is the open-source deterministic data engine at the core of the DATAMIMIC Enterprise Platform. It is usable standalone for synthetic data generation and PII-aware pseudonymization in any local, CI, or agent-driven workflow.

The Enterprise Platform adds the governed workflows, scanners, dashboards, and execution layer that regulated enterprises require for production-scale test-data operations.

Available in CE (this repo):

  • Generate fully synthetic, deterministic datasets β€” model-driven, no source data required
  • Pseudonymize staging/QA exports β€” deterministic (seeded) or privacy-maximized (non-seeded) field transformation; PII fields identified and modeled manually in the XML pipeline
  • Execute single-system pipelines against PostgreSQL Β· MySQL Β· Oracle Β· MS SQL Β· SQLite Β· MongoDB Β· CSV Β· JSON Β· XML
  • Emit provenance β€” append-only execution logs and per-output content hash for audit re-execution
  • Serve agents β€” bundled MCP server exposing generate as a deterministic tool for AI/LLM tooling

The Enterprise Platform adds:

  • PII scanner β€” probability-scored field detection with configurable thresholds via DataWorkbench
  • Multi-system execution β€” Oracle / MongoDB / Kafka in coordinated workflows with referential integrity
  • Industry message templates β€” EDIFACT / SWIFT MT / HL7 v2.x / HL7 FHIR generated as deterministic test/training artefacts
  • Governance layer β€” role-based dashboards, audit trails, approval flows, reusable enterprise templates, scheduler
  • Performance core β€” Rust fastpath, ML/auto-regressive engine for complex distributions, keyset and manifest building, optimised distributed execution
  • On-premise / air-gapped deployment β€” podman-compose or Helm, with consulting-led rollout

Deployed in regulated EU banking environments for deterministic test data across Oracle, MongoDB, and Kafka pipelines. Reference customers available under NDA β€” see also datamimic.io case studies.


CE vs Enterprise Platform

CE and EE are not the same engine with a feature flag. They share the DSL and determinism contract, but EE is an independently optimised execution engine built for enterprise-scale throughput and operational control.

Engine comparison

Capability Community Edition (CE) Enterprise Platform (EE)
Deterministic data generation βœ… βœ…
Deterministic seeding in the DSL βœ… entities (<setup rngSeed>, <variable rngSeed>) βœ… entities + standalone literal <key generator>
Pseudonymization β€” seeded (GDPR Art. 4(5); supports Art. 25 / Art. 32) βœ… manual model βœ… automated via DataWorkbench
Pseudonymization β€” non-seeded (privacy-maximized) βœ… manual model βœ… automated via DataWorkbench
Python API + XML pipelines βœ… βœ…
Domain models: Finance, Healthcare, Demographics βœ… βœ…
Time-series generation (<generate start/end/interval>, ISO 8601, prefix-stable) βœ… βœ…
MCP server for AI agent integration βœ… βœ…
CLI + local execution βœ… βœ…
Scale millions of records via Python multiprocessing (and optional Ray) designed for billion-record workloads β€” Rust fastpath, optimised multi-process execution, and keyset/manifest building on top of the shared Ray distribution layer
PII scanner ❌ βœ… probability-scored field detection, configurable threshold, DataWorkbench integration
Runtime configuration profiles ❌ βœ… Performance Β· Balanced Β· Flexibility
Memory management standard optimised for high-volume batch and streaming
Logging granularity flat execution log configurable: minimal Β· standard Β· deep nested tracing
Nested structure evaluation basic deep nested generation with extended condition + ruleset evaluation
Importer / exporter logging ❌ per-stage logging for importers and exporters
Error handling standard exceptions structured error catalog with recovery strategies
Rust fastpath ❌ performance-critical paths in Rust
Keyset and manifest building ❌ reads live DB schemas to build coordinated multi-table generation plans
ML / auto-regressive engine ❌ combine statistical models with conditions, rulesets, validators for complex distributions

Platform capabilities (EE only)

Capability EE
Multi-user collaboration βœ…
Role-based access control (RBAC) βœ…
Audit logs + provenance dashboards βœ…
PII scanner β€” probability scoring, threshold-based field flagging βœ…
DataWorkbench β€” visual field mapping and pseudonymization model builder βœ…
Reusable enterprise template library βœ…
Scheduled execution + task runner βœ…
CI/CD pipeline integration (Tosca, Jenkins, GitLab) βœ…
Multi-system execution: Oracle, MongoDB, Kafka βœ…
Template engine: schema-aware editors for EDIFACT, SWIFT MT, HL7 v2.x, and HL7 FHIR β€” customer-uploadable specs, further industry formats built per engagement on the same framework βœ…
Audit-evidence artefacts for GDPR Art. 30 records, PCI DSS 4.0 Req. 6.5.5 (test data) reviews, and β€” for US Covered Entities / Business Associates β€” HIPAA Β§164.312 evidence packs βœ…
On-premise deployment + air-gapped environments βœ…
LSP-powered IDE tooling for DSL authoring βœ…

πŸ‘‰ Explore the Enterprise Platform Β |Β  Book a platform demo


EE runtime profiles

The EE core supports three runtime configuration profiles, selectable per execution context:

Profile Optimises for Typical use case
Performance Maximum throughput via Rust fastpath, optimised multi-process execution, and Ray-based distribution Bulk generation at billion-record volumes to PostgreSQL, Oracle, Kafka
Balanced Throughput + full audit logging Standard enterprise pipeline runs with compliance requirements
Flexibility Deep nested evaluation, extended condition and ruleset processing Complex domain models with ML engine combinations, multi-level referential structures

Logging depth is independently configurable per profile β€” from minimal (throughput-optimised) to full nested tracing across importers, exporters, and generation stages.


EE template engine

The EE template engine generates industry-standard financial messages from DATAMIMIC models. The workbench parses uploaded message samples, auto-detects the message type, and validates edits against the registered spec version in real time.

Capabilities

  • Spec-aware form editing β€” segments and elements rendered as structured forms with mandatory/optional indicators, per-field value suggestions, and inline custom-extension support
  • Strict validation against baked spec versions, with segment- and element-level error reporting
  • Advisory mode when a spec is unregistered or in draft β€” editing stays enabled, validation continues as guidance
  • Round-trip between the structured form view and the authoritative template text β€” no fidelity loss
  • Download / adjust / upload your own spec β€” customers can extend or override the baked spec catalogue without waiting for a release
  • Live structure tree + preview for every edit
  • File auto-detection β€” upload an existing message, the editor identifies the type and loads the matching spec

Format coverage

Format Coverage
UN/EDIFACT Schema-aware form editor; spec versions and subsets per engagement
SWIFT MT Schema-aware form editor; categories and SR versions per engagement
HL7 v2.x Schema-aware form editor; versions per engagement
HL7 FHIR Schema-aware form editor for FHIR resources (Patient, Observation, Encounter, …); profiles per engagement
Further industry formats (ISO 20022 / MX, vertical dialects) Built into the editor catalogue per customer engagement, on the same framework

Customers can extend the spec catalogue between releases by downloading, adjusting, and uploading their own spec files directly.

Generated messages are deterministic and traceable to their source model, and syntactically valid against the registered spec. They are intended for test and training environments only β€” they are not network-validated and must not be transmitted on production SWIFTNet or EDI networks. See the SWIFT CSP note below.


Who is DATAMIMIC for?

Enterprise Platform (EE)

Role What DATAMIMIC solves
QA / Test Manager Eliminate manual test data requests. Self-service, governed, always ready.
Business Analyst Define data requirements in business-readable models β€” no scripting needed.
Platform / DevOps Engineer Integrate deterministic test data generation into CI/CD and scheduled pipelines.
Compliance / Audit Full audit trail for every generation run. Regulator-ready logs, no production data exposure.
Enterprise Architect One governed standard across Oracle, MongoDB, Kafka, flat files, and custom systems.

Community Edition (CE)

Developers and data engineers who need deterministic synthetic data generation or PII-aware pseudonymization in local environments, CI pipelines, or agent-driven workflows. PII field identification is manual β€” the EE DataWorkbench automates this step.


Why deterministic generation matters

Most test data tools produce random output. That breaks regression tests, audit trails, and cross-team reproducibility.

DATAMIMIC's determinism contract (CE):

  • Same seed + same model = byte-identical output, every run, every machine. Holds at three layers: the generate_domain facade, every domain service called directly, and every literal generator that accepts an rng= argument. Verified per-service on every CI run via tests_ce/architecture/test_service_replay_determinism.py.
  • DSL-level seeding (entities): <setup rngSeed="N"> makes the whole model deterministic β€” every seed-less <variable entity="…"> derives a reproducible child RNG from it, and <variable rngSeed="…"> overrides it for that block (no seed anywhere β†’ wall-clock random). Verified by tests_ce/integration_tests/test_determinism_seed_scenarios. Deterministic DSL-level seeding of standalone literal generators (<key generator="…">) is an Enterprise (EE) feature; in CE such generators are seeded only when used directly from Python with rng=.
  • Source reads: distribution="ordered" reads a data source in stable file order; distribution="random" shuffles but replays identically when <setup rngSeed> is set (without a seed the shuffle is non-deterministic by design, for privacy-maximized one-time deliveries). Deterministic shuffling across distributed / multi-process execution is EE.
  • Provenance hash on every facade output = re-executable lineage. Same input β†’ same determinism_proof.content_hash, always.
  • UUIDv5 entity identifiers = stable across runs and machines.
  • Single wall-clock SPOT (now_utc_naive()); raw datetime.now() is forbidden in production code and the clock-drift architecture gate fails CI on any reintroduction.
  • RNG/clock runtime SPOTs in datamimic_ce/domains/domain_core/runtime/: spawn_rng (reproducible child-RNG derivation), now_utc_naive, and resolve_clock. The same contract vocabulary the Enterprise Platform enforces end-to-end.

The Enterprise Platform (EE) goes further: beyond the CE contract, EE makes the whole execution environment deterministic β€” a configurable/frozen wall-clock (not just CE's fixed anchor), DSL-level seeding of literal <key generator> generators, and deterministic SAFE_GLOBALS plus the Python random functions, so sandboxed script expressions and any stdlib random call replay identically as well.

from datamimic_ce.domains.facade import generate_domain

request = {
    "domain": "person",
    "version": "v1",
    "count": 1,
    "seed": "regression-suite-42",       # identical seed β†’ identical output
    "locale": "en_US",
    "clock": "2025-01-01T00:00:00Z"      # fixed clock = stable time context
}

response = generate_domain(request)
# response["determinism_proof"]["content_hash"] is stable across runs.

Direct service use is equally deterministic when given a seeded RNG:

import random
from datamimic_ce.domains.finance.services import CreditCardService

# Same seeded Random β†’ byte-identical CreditCard across runs.
card_a = CreditCardService(rng=random.Random(42)).generate()
card_b = CreditCardService(rng=random.Random(42)).generate()
assert card_a.bic == card_b.bic and card_a.card_number == card_b.card_number

Determinism contract β€” CE vs EE

Scope CE Enterprise Platform
Facade (generate_domain registered domains) βœ… byte-identical, CI-gated βœ… byte-identical
Domain services (direct use with seeded rng=...) βœ… byte-identical, CI-gated βœ… byte-identical
Literal generators (with seeded rng=...) βœ… byte-identical βœ… byte-identical
RNG / clock runtime SPOTs βœ… spawn_rng, now_utc_naive, resolve_clock βœ… same contract, enforced end-to-end
Architecture gates in CI βœ… facade replay + service replay (every service) + clock drift βœ… 5+ gates (RNG ownership, clock drift, DSL eval, seeded-mode propagation, dataset SPOT)
Custom XML pipelines ⚠️ best-effort βœ… byte-identical
Multi-system coordinated execution (Oracle + MongoDB + Kafka in one run) β€” βœ… byte-identical end-to-end
Seeded vs unseeded pseudonymization (deterministic clock anchor vs CSPRNG live-clock) β€” βœ…
Threat-led / TLPT-grade audit evidence (full contract enforcement, per-stage execution logging) β€” βœ…

CE delivers contract-enforced determinism for the synthetic-data generation surface (facade, services, generators). The Enterprise Platform extends the same contract across the full pipeline β€” custom XML descriptors, multi-system writes with referential integrity, the seeded/unseeded pseudonymization modes β€” and adds the five drift-gates that lock the contract end-to-end for regulated deployments.


How DATAMIMIC differs from Faker and generic generators

Faker / Random generators DATAMIMIC CE DATAMIMIC EE
Reproducible output ❌ βœ… βœ…
Domain-aware relationships ❌ βœ… βœ…
Business logic constraints ❌ βœ… βœ…
Per-output provenance hash ❌ βœ… βœ…
Source data pseudonymization ❌ βœ… manual βœ… automated
PII field detection ❌ ❌ βœ… probability-scored
Enterprise governance layer ❌ ❌ βœ…
Multi-system execution ❌ ❌ βœ…
Role-based workflows ❌ ❌ βœ…
Designed for regulated-industry deployment (governance, audit, RBAC) ❌ ❌ βœ…
# Faker β€” broken relationships
from faker import Faker
fake = Faker()
patient_age = fake.random_int(1, 99)
conditions  = [fake.word()]
# "25-year-old with Alzheimer's" β€” meaningless for any real test

# DATAMIMIC β€” domain-aware, deterministic with a seed
import random
from datamimic_ce.domains.healthcare.services import PatientService
patient = PatientService(rng=random.Random(42)).generate()
print(f"{patient.full_name}, {patient.age}, {patient.conditions}")
# Age-appropriate, domain-consistent β€” and identical every run with a fixed seed

Quickstart β€” Community Edition

pip install datamimic-ce

Healthcare domain

import random
from datamimic_ce.domains.healthcare.services import PatientService

patient = PatientService(rng=random.Random(42)).generate()
print(patient.full_name, patient.age, patient.conditions)
# Age-appropriate conditions, demographically realistic; deterministic with a seed

Finance domain

import random
from datamimic_ce.domains.finance.services import BankAccountService

account = BankAccountService(rng=random.Random(42)).generate()
print(account.account_number, account.balance)
# Balance-consistent, locale-correct; reproducible with a seed

Pseudonymization β€” CE (manual model)

DATAMIMIC supports two pseudonymization modes with different privacy postures:

Mode How Legal classification Use case
Seeded (rngSeed set) Deterministic, reproducible Pseudonymization (GDPR Art. 4(5)) Regression testing, stable CI/CD pipelines
Non-seeded (no rngSeed) Non-deterministic, no reversible mapping at field level Privacy-maximized transformation One-time data delivery, higher privacy posture

Note on GDPR anonymization: Full anonymization status under GDPR depends on complete field coverage across all quasi-identifiers and a re-identification risk assessment on the complete record β€” not on individual field transformation alone. DATAMIMIC does not make anonymization claims on behalf of the customer. Non-seeded mode maximizes privacy at the transformation level; the customer is responsible for assessing re-identification risk across the full dataset.

In CE, PII fields are identified and modeled manually in the XML pipeline:

<setup>
  <generate name="customers" source="customer_export" target="customer_test" distribution="ordered">
    <!-- distribution="ordered" reads the source in a stable order β€” required so the
         Nth source row maps to the same seeded synthetic value on every run. The
         default ("random") shuffles non-deterministically and would break it.
         rngSeed on the <variable> makes the synthetic values reproducible; drop
         rngSeed for the privacy-maximized (non-deterministic) mode. -->
    <variable name="p"   entity="Person"      dataset="DE" rngSeed="42" />
    <variable name="acc" entity="BankAccount" dataset="DE" rngSeed="42" />

    <key name="first_name" script="p.given_name" />
    <key name="last_name"  script="p.family_name" />
    <key name="email"      script="p.email" />
    <key name="iban"       script="acc.iban" />
    <key name="birth_date" script="p.birthdate" />
  </generate>
</setup>

Built-in converters can additionally transform a key's value β€” e.g. irreversibly hash the original instead of replacing it, or partially mask it:

<key name="email" script="p.email" converter="Hash('sha256','hex')" />
<key name="iban"  script="acc.iban" converter="MiddleMask(8, 4)" />

Available converters: Mask, MiddleMask(start, end), CutLength(n), Hash(type, format[, salt]), DateFormat(fmt), Append, UpperCase, LowerCase, Date2Timestamp, Timestamp2Date.

datamimic run ./pseudonymize-customers/datamimic.xml

source is a controlled export or staging input β€” never a live production connection.

With rngSeed set: same source record β†’ same pseudonymized output on every run. Stable for regression testing.

Without rngSeed: non-deterministic output β€” no reversible mapping exists at the field level. Stronger privacy posture for one-time delivery scenarios.

In the Enterprise Platform (EE): the DataWorkbench PII scanner automatically scans source schemas, assigns probability scores to each field, and flags candidates above a configurable threshold. Flagged fields are wired into the pseudonymization model automatically β€” no manual field mapping required.

<setup>
  <generate name="patients" count="1000" target="CSV">
    <variable name="patient" entity="Patient" dataset="US" ageMin="60" ageMax="80" rngSeed="42" />
    <key name="full_name"   script="patient.full_name" />
    <key name="age"         script="patient.age" />
    <array name="conditions" script="patient.conditions" />
  </generate>
</setup>
datamimic run ./patient-scenario/datamimic.xml

Time-series generation β€” CE

Any <generate> becomes a time-series loop when given strict ISO 8601 start/end/interval attributes. Per iteration the script context exposes a ts namespace:

Variable Type Meaning
ts.now datetime Current tick
ts.step int Position within one series (0..N-1)
ts.series int Which series this row belongs to (0..count-1)

Output column names β€” including whether to even emit a timestamp or series-id column β€” are entirely the user's choice via <key>. The primitive is domain-agnostic; the same DSL covers IoT readings, financial ticks, log streams, smart meters, anything time-indexed.

<setup>
  <!-- Stock ticks: three symbols, 5-min interval, 30-min window -->
  <generate name="ticks" count="3"
            start="2026-01-01T09:30:00+00:00"
            end="2026-01-01T10:00:00+00:00"
            interval="PT5M"
            target="ticks.csv">
    <key name="timestamp" script="ts.now.isoformat()"/>
    <key name="symbol"    script="['AAPL','MSFT','GOOG'][ts.series]"/>
    <key name="price"     script="100 + ts.step * 0.25"/>
  </generate>

  <!-- Sensor with diurnal seasonality, single series (count defaults to 1) -->
  <generate name="readings"
            start="2026-01-01T00:00:00+00:00"
            end="2026-01-08T00:00:00+00:00"
            interval="PT1H"
            target="readings.csv">
    <key name="timestamp" script="ts.now.isoformat()"/>
    <key name="value"     script="20 - 10 * math.cos(ts.now.hour * math.pi / 12)"/>
  </generate>
</setup>

Guarantees:

  • Prefix-stable by construction β€” the first N ticks of series 0 are byte-identical regardless of total window length, because each row's ts.now is a pure function of start + interval * step.
  • Loop order is contiguous per series β€” series 0's full sequence, then series 1's, etc. Makes downstream grouping trivial.
  • Strict ISO 8601 β€” start/end via datetime.fromisoformat (Z-suffix supported); interval via the isodate library (PT1H, PT15M, PT5S, P1D, P1W, P1DT12H, fractional seconds for sub-second precision). Resolution: PT0.001S = 1 ms, PT0.000001S = 1 Β΅s (Python datetime.timedelta microsecond floor; sub-Β΅s intervals and constant-length-undefined units like months/years are rejected with a clear error).
  • count is orthogonal, not overloaded β€” it means "outer-loop iterations of this <generate>" in all modes (same as nested <generate count=…>). In time-series mode each outer iteration is one series of N ticks, so total rows = count Γ— ticks_per_series. Default count="1" keeps single-series fixtures terse.
  • Naming caveat β€” a <key name="ts"> output column would shadow the namespace (current_product overrides current_variables in script scope), and a <variable name="ts"> is rejected at parse time. Use a different name, e.g. timestamp for the column.

Composes with the existing <variable> mechanism for multi-source merges (e.g. join each tick with a sensor-metadata CSV via <variable source="meta.csv" cyclic="True"> inside the same <generate>), with <key condition="..."> filtering, and with <nestedKey> sub-scopes β€” the ts namespace is visible everywhere a <key script> runs. See tests_ce/integration_tests/test_timeseries/ for committed DSL fixtures + proofs (including pagination invariance).


MCP Server β€” AI Agent Integration

DATAMIMIC CE ships with a Model Context Protocol (MCP) server, making it directly callable from AI agents, Claude, Cursor, and any MCP-compatible runtime.

pip install datamimic-ce[mcp]

export DATAMIMIC_MCP_HOST=127.0.0.1
export DATAMIMIC_MCP_PORT=8765
export DATAMIMIC_MCP_API_KEY=your-key
datamimic-mcp

Agents can call generate with a domain, seed, count, and locale and receive deterministic, provenance-hashed output β€” making DATAMIMIC the natural test data runtime for agent-driven workflows.

import anyio, json
from fastmcp.client import Client
from datamimic_ce.mcp.models import GenerateArgs
from datamimic_ce.mcp.server import create_server

async def main():
    args = GenerateArgs(domain="person", locale="en_US", seed=42, count=2)
    payload = args.model_dump(mode="python")
    async with Client(create_server()) as c:
        a = await c.call_tool("generate", {"args": payload})
        b = await c.call_tool("generate", {"args": payload})
        # Determinism proof: identical hashes across calls
        assert (json.loads(a[0].text)["determinism_proof"]["content_hash"]
             == json.loads(b[0].text)["determinism_proof"]["content_hash"])

anyio.run(main)

πŸ“˜ Full guide: docs/mcp_quickstart.md


Where CE fits on its own

Most teams adopt CE for one of three reasons. EE is not required for any of them.

1. Reproducible test data for CI/CD pipelines. Pin a seed against the generate_domain facade β€” or hand a seeded random.Random to any domain service β€” and you get byte-identical output across runs and machines. Both layers are gated on every CI run by tests_ce/architecture/. Regression tests stop being flaky because the input data is stable across runs.

from datamimic_ce.domains.facade import generate_domain

response = generate_domain({
    "domain": "person", "version": "v1", "count": 1,
    "seed": "ci-pipeline-42", "locale": "en_US",
    "clock": "2026-01-01T00:00:00Z",
})
# Same input β†’ same output, every machine, every run.

2. Deterministic data backend for AI agents and LLM tooling. The bundled MCP server (pip install datamimic-ce[mcp]) exposes generate as an MCP tool. Agents call it with seed, locale, count; outputs ship with a determinism_proof.content_hash so the same call can be re-executed and verified later β€” useful for agent regression tests and for any workflow where the data the agent saw needs to be reconstructable.

3. Pseudonymization of staging and QA exports. Manual model in CE (XML pipeline), no scanner license required. Seeded mode for stable regression test data; non-seeded mode for one-time deliveries with maximized privacy posture. See the Pseudonymization section above.


Where DATAMIMIC fits in your compliance program

DATAMIMIC produces evidence and reproducible artifacts that support compliance work. It does not replace your DPO, your CISO, or your auditor. The following are pointers for where DATAMIMIC outputs commonly slot into established programs:

Both editions produce reproducible artefacts. CE covers single-system fixtures and provenance evidence; multi-system audit evidence with role-based dashboards is EE.

Regulation / standard Where DATAMIMIC contributes
DORA (Reg. 2022/2554) β€” Art. 24 (testing of ICT tools, systems and processes; non-TLPT scope) Reproducible test datasets for non-TLPT resilience tests; deterministic data fixtures for ICT testing programmes
ISO/IEC 27701:2019 β€” A.7.2.8 (records related to processing PII) and A.7.4.5 (PII minimisation) Synthetic data in lieu of PII in non-production environments; documented model definitions as supporting evidence
HIPAA Security Rule β€” Β§164.312 technical safeguards (US Covered Entities / Business Associates only) Synthetic Patient/MedicalDevice/MedicalProcedure data for dev and test environments without ePHI exposure
GDPR β€” Art. 4(5) pseudonymization definition; Art. 25 privacy by design; Art. 32 security of processing Seeded pseudonymization with deterministic mapping; non-seeded mode for stronger privacy posture
PCI DSS 4.0 β€” Req. 6.5.5 (live PANs prohibited in test/development) Synthetic PAN generation for test environments; deterministic tokenisation reproducible across runs

These pointers do not constitute legal advice or a compliance attestation. Consult your DPO, CISO, or qualified counsel for formal compliance determinations. Full anonymization status under GDPR depends on re-identification risk across the complete dataset β€” see the pseudonymization disclaimer above.


Architecture

CE and EE share the DATAMIMIC DSL and the determinism contract. The execution layer is separate: CE is a Python execution engine using multiprocessing (with optional Ray for distribution); EE is an independently-optimised execution engine with a Rust fastpath, ML/auto-regressive generation, keyset and manifest building from live schemas, and optimised distributed execution at billion-record scale.

╔══════════════════════════════════════════════════════════════════╗
β•‘              DATAMIMIC ENTERPRISE PLATFORM (EE)                  β•‘
β•‘                                                                  β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β•‘
β•‘  β”‚  PLATFORM LAYER                                          β”‚    β•‘
β•‘  β”‚  UI Β· RBAC Β· Governance Β· Audit Dashboards               β”‚    β•‘
β•‘  β”‚  DataWorkbench Β· PII Scanner Β· Pseudonymization Builder  β”‚    β•‘
β•‘  β”‚  Scheduler Β· Task Runner Β· CI/CD Β· Template Engine       β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                                                                  β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β•‘
β•‘  β”‚  EE CORE  (separately maintained, more advanced than CE) β”‚    β•‘
β•‘  β”‚                                                          β”‚    β•‘
β•‘  β”‚  Rust fastpath for performance-critical paths            β”‚    β•‘
β•‘  β”‚  ML / auto-regressive engine for complex distributions   β”‚    β•‘
β•‘  β”‚  Keyset and manifest building from live DB schemas       β”‚    β•‘
β•‘  β”‚  Optimised distributed execution at billion-record scale β”‚    β•‘
β•‘  β”‚  Runtime profiles: Performance Β· Balanced Β· Flexibility  β”‚    β•‘
β•‘  β”‚  Deep nested evaluation Β· Conditions Β· Rulesets          β”‚    β•‘
β•‘  β”‚  Structured error catalog Β· Per-stage execution logging  β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

╔══════════════════════════════════════════════════════════════════╗
β•‘              DATAMIMIC COMMUNITY EDITION (CE)  β€” this repo       β•‘
β•‘                                                                  β•‘
β•‘  Determinism Kit Β· Domain Services Β· Schema Validators           β•‘
β•‘  Synthetic Generation Β· Pseudonymization (manual model)          β•‘
β•‘  Python API Β· XML Pipelines Β· CLI Β· MCP Server                   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

         ↓              ↓              ↓              ↓
    PostgreSQL       Oracle         MongoDB      CSV / JSON / XML

EE adds Kafka, EDIFACT, SWIFT MT, HL7 v2.x, and HL7 FHIR as additional targets β€” see Supported systems below. Both editions share the DATAMIMIC DSL and determinism contract.


Supported systems

System CE EE Notes
PostgreSQL βœ… βœ… EE adds schema introspection and referential integrity
MySQL βœ… βœ…
Oracle βœ… βœ… EE production-validated in regulated banking environments
MS SQL Server βœ… βœ…
SQLite βœ… βœ… Lightweight CI/CD fixtures
MongoDB βœ… βœ… EE adds nested document generation
CSV / JSON / XML βœ… βœ… Flat file pipelines
Apache Kafka β€” βœ… Real-time streaming, payment scenarios
HL7 v2.x β€” βœ… Test/training output via template engine
HL7 FHIR β€” βœ… Test/training output via template engine
EDIFACT / SWIFT MT β€” βœ… Test/training output only; does not satisfy SWIFT CSCF v2025 secure-zone controls (1.1 environment protection, 1.4 internet restriction). Generated messages must not be transmitted from a CSP-attested secure zone.

CE domains

Domain Services available
Healthcare Patient, Doctor, Hospital, MedicalDevice, MedicalProcedure
Finance Bank, BankAccount, CreditCard, Transaction
Insurance InsuranceCompany, InsuranceProduct, InsurancePolicy, InsuranceCoverage
E-commerce Order, Product
Public sector AdministrationOffice, EducationalInstitution, PoliceOfficer
Demographics Person (DE / US / VN locale packs), Address

All services are versioned and seeded; each generation emits a provenance hash suitable as evidence in audit reviews. Domain services can be used directly via constructor injection, or driven through the higher-level generate_domain({...}) facade for seed/locale/clock/count parameterisation (currently supports person, address, patient, doctor at v1).


CLI reference

# Initialize a new project
datamimic init ./my-scenario

# Validate an XML descriptor without executing it
datamimic validate ./my-scenario/datamimic.xml

# Run a scenario
datamimic run ./my-scenario/datamimic.xml

# Demos
datamimic demo list
datamimic demo create healthcare-example
datamimic demo create --all --target ./my_demos

# System and version info
datamimic info
datamimic version

Documentation

Resource Link
Full documentation docs.datamimic.io
MCP quickstart docs/mcp_quickstart.md
Developer guide docs/developer_guide.md
Enterprise platform datamimic.io
GitHub Discussions Discussions
Issue tracker Issues
Email support support@rapiddweller.com

Contributing

See CONTRIBUTING.md. CE is MIT licensed and community contributions are welcome.

The CE engine is the foundation. If you are building integrations, domain extensions, or MCP tooling on top of DATAMIMIC, we want to hear from you.


License

MIT β€” see LICENSE.

The DATAMIMIC Enterprise Platform (EE) is a commercial product. Contact us for licensing.


DATAMIMIC β€” Deterministic, governed test data for regulated enterprises.

datamimic.io Β |Β  Book a demo Β |Β  LinkedIn

About

Model-driven synthetic test data for CI/CD and analytics - deterministic, privacy-preserving, and domain-aware. Includes Python APIs, XML pipelines, and MCP/IDE integration to orchestrate realistic datasets for finance, healthcare, and other regulated environments.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages