Skip to content

Enhancement to generators selection layer and new evaluation layer #112

Description

@yhong123

Objective

Improve the existing rule-based synthetic data generation system by introducing:

  • keyword-based pruning and lightweight uncertainty-aware ranking for generator selection
  • minimal quantitative evaluation layer

without modifying the core generator implementations.

──────────────────────────────────────────────
│ 4. Governance & Human Review (IG / Clinicians)│
──────────────────────────────────────────────
│ 3. Evaluation Layer (Fidelity / Utility / Privacy metrics)│ - evaluate
──────────────────────────────────────────────
│ 2. Synthetic Data Generation Layer │
──────────────────────────────────────────────
│ 1. Schema creating + Generator Selection Layer │ - configure-generators
──────────────────────────────────────────────

Enhancement to generators selection layer

To improve the propose command by

without modifying generator logic or fit computation.

New evaluation layer

To add a new evaluation layer that measures how closely the synthetic data matches the real data, how useful it is for analysis, and whether there are any privacy risks. To support this, we will add a new command, evaluate.
Minimal metrics:

  • fidelity metrics (how well the distributions match the original data, e.g. KL divergence, correlation matrix distance);
    • column-level fidelity evaluates each column independently
    • table-level fidelity examines relationships and dependencies within a table
    • dataset-level fidelity evaluates relationships across tables within the dataset
  • dataset-level utility metric (whether the data preserves analytical usefulness for downstream tasks, e.g. downstream ML performance, accuracy drop, calibration shift)
  • dataset-level privacy metric (whether there is any risk of revealing sensitive or unique records, e.g. nearest-neighbour distance ratio)

The goal is not to replace human oversight, but to make it more informed, consistent, and auditable. The user can use the fidelity metrics to tune generators and regenerate data, while the utility and privacy metrics are likely to be of greater interest to IG people.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions