Objective
Improve the existing rule-based synthetic data generation system by introducing:
- keyword-based pruning and lightweight uncertainty-aware ranking for generator selection
- minimal quantitative evaluation layer
without modifying the core generator implementations.
──────────────────────────────────────────────
│ 4. Governance & Human Review (IG / Clinicians)│
──────────────────────────────────────────────
│ 3. Evaluation Layer (Fidelity / Utility / Privacy metrics)│ - evaluate
──────────────────────────────────────────────
│ 2. Synthetic Data Generation Layer │
──────────────────────────────────────────────
│ 1. Schema creating + Generator Selection Layer │ - configure-generators
──────────────────────────────────────────────
Enhancement to generators selection layer
To improve the propose command by
without modifying generator logic or fit computation.
New evaluation layer
To add a new evaluation layer that measures how closely the synthetic data matches the real data, how useful it is for analysis, and whether there are any privacy risks. To support this, we will add a new command, evaluate.
Minimal metrics:
- fidelity metrics (how well the distributions match the original data, e.g. KL divergence, correlation matrix distance);
- column-level fidelity evaluates each column independently
- table-level fidelity examines relationships and dependencies within a table
- dataset-level fidelity evaluates relationships across tables within the dataset
- dataset-level utility metric (whether the data preserves analytical usefulness for downstream tasks, e.g. downstream ML performance, accuracy drop, calibration shift)
- dataset-level privacy metric (whether there is any risk of revealing sensitive or unique records, e.g. nearest-neighbour distance ratio)
The goal is not to replace human oversight, but to make it more informed, consistent, and auditable. The user can use the fidelity metrics to tune generators and regenerate data, while the utility and privacy metrics are likely to be of greater interest to IG people.
Objective
Improve the existing rule-based synthetic data generation system by introducing:
without modifying the core generator implementations.
──────────────────────────────────────────────
│ 4. Governance & Human Review (IG / Clinicians)│
──────────────────────────────────────────────
│ 3. Evaluation Layer (Fidelity / Utility / Privacy metrics)│ -
evaluate──────────────────────────────────────────────
│ 2. Synthetic Data Generation Layer │
──────────────────────────────────────────────
│ 1. Schema creating + Generator Selection Layer │ -
configure-generators──────────────────────────────────────────────
Enhancement to generators selection layer
To improve the
proposecommand bywithout modifying generator logic or fit computation.
New evaluation layer
To add a new evaluation layer that measures how closely the synthetic data matches the real data, how useful it is for analysis, and whether there are any privacy risks. To support this, we will add a new command,
evaluate.Minimal metrics:
The goal is not to replace human oversight, but to make it more informed, consistent, and auditable. The user can use the fidelity metrics to tune generators and regenerate data, while the utility and privacy metrics are likely to be of greater interest to IG people.