Enhancement to generators selection layer and new evaluation layer

**Objective**

Improve the existing rule-based synthetic data generation system by introducing:
- keyword-based pruning and lightweight uncertainty-aware ranking for generator selection
- minimal quantitative evaluation layer

without modifying the core generator implementations.

 ──────────────────────────────────────────────
 │ 4. Governance & Human Review (IG / Clinicians)│
 ──────────────────────────────────────────────
 │ 3. Evaluation Layer (Fidelity / Utility / Privacy metrics)│                                                         - `evaluate`
 ──────────────────────────────────────────────
 │ 2. Synthetic Data Generation Layer            │
 ──────────────────────────────────────────────
 │ 1. Schema creating + Generator Selection Layer         │                                      - `configure-generators`
 ──────────────────────────────────────────────


**Enhancement to generators selection layer**

To improve the `propose` command by 
- adding keyword-based pruning method (as suggested by Tim in sqlsynthgen #243)
- normalising heterogeneous raw fit scores into a probabilistic ranking and reducing output length using entropy-based truncation 

without modifying generator logic or fit computation.


**New evaluation layer**

To add a new evaluation layer that measures how closely the synthetic data matches the real data, how useful it is for analysis, and whether there are any privacy risks. To support this, we will add a new command, `evaluate`.
Minimal metrics:
- fidelity metrics (how well the distributions match the original data, e.g. KL divergence, correlation matrix distance); 
    - column-level fidelity evaluates each column independently
    - table-level fidelity examines relationships and dependencies within a table
    - dataset-level fidelity evaluates relationships across tables within the dataset
- dataset-level utility metric (whether the data preserves analytical usefulness for downstream tasks, e.g. downstream ML performance, accuracy drop, calibration shift)
- dataset-level privacy metric (whether there is any risk of revealing sensitive or unique records, e.g. nearest-neighbour distance ratio)

The goal is not to replace human oversight, but to make it more informed, consistent, and auditable. The user can use the fidelity metrics to tune generators and regenerate data, while the utility and privacy metrics are likely to be of greater interest to IG people.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement to generators selection layer and new evaluation layer #112

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Enhancement to generators selection layer and new evaluation layer #112

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions