feat(text-metrics): split oneig_alignment by davidberenstein1957 · Pull Request #646 · PrunaAI/pruna

davidberenstein1957 · 2026-04-28T13:03:58Z

Summary

Splits oneig_alignment into its own stacked PR, adds OneIGAlignmentMetric, and wires OneIG alignment subset benchmarks.

Stack Position

Base: PR feat(text-metrics): split qa_accuracy #645 (feat/vlm-pr-3a-qa-accuracy)
Next: PR feat(text-metrics): split text_score pair #647 (feat/vlm-pr-3c-text-score-pair)
Final integration: PR feat(e2e-tests): stacked e2e after split metrics #641 (feat/vlm-pr-5-e2e-tests)
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Files

src/pruna/evaluation/metrics/metric_oneig_alignment.py
src/pruna/evaluation/benchmarks.py
tests/evaluation/test_text_metrics.py

Test Plan

uv run pytest tests/evaluation/test_text_metrics.py -k oneig_alignment

Review Focus

Dependency-mask semantics
OneIG subset benchmark mapping

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (4/10)

Review after PR feat(text-metrics): split qa_accuracy #645.
Next PR to review: feat(text-metrics): split text_score pair #647.
Confirm this PR's tests and scope before continuing.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 2627d78. Configure here.}

cursor · 2026-04-28T13:18:22Z

+    Benchmark(
+        name="OneIG Portrait",
+        description="OneIG subset: people and portraits.",
+        metrics=["oneig_alignment"],


OneIG subset benchmarks fail to register at import

High Severity

The new OneIG benchmarks create lookup keys (e.g., OneIGAnimeStylization) that are missing from base_datasets. This causes BenchmarkRegistry._register to raise a ValueError, failing module import for pruna.evaluation.benchmarks and its dependents. The original OneIG entry was also removed.

^{Reviewed by Cursor Bugbot for commit 2627d78. Configure here.}

cursor · 2026-04-28T13:18:22Z

+    assert "21" not in record["questions"]
+    assert "21" not in record["dependencies"]
+    assert record["questions"] == {"1": "Is there a cat?"}
+    assert record["dependencies"] == {"1": [0]}


OneIG record test calls function with wrong signature

Medium Severity

The test_to_oneig_record_strips_null_questions_and_dependencies test calls _to_oneig_record with an incorrect number of arguments, causing a TypeError. The test also asserts that null-valued questions and dependencies are stripped, a behavior not implemented in _to_oneig_record.

^{Reviewed by Cursor Bugbot for commit 2627d78. Configure here.}

cursor · 2026-04-28T13:18:22Z

+        metrics=["oneig_alignment"],
+        task_type="text_to_image",
+        reference="https://arxiv.org/abs/2506.07977",
+    ),


OneIG Multilingualism subset has no alignment questions

Medium Severity

The OneIG Multilingualism benchmark wires the oneig_alignment metric, but _CATEGORY_TO_QD in prompt.py only maps Anime_Stylization, Portrait, and General_Object to Q_D files. Multilingualism rows therefore receive an empty questions dict in _to_oneig_record, and OneIGAlignmentMetric.update raises ValueError whenever questions is empty. Running this benchmark would error out on every sample rather than produce alignment scores.

Additional Locations (1)

src/pruna/evaluation/metrics/metric_oneig_alignment.py#L213-L218

^{Reviewed by Cursor Bugbot for commit 2627d78. Configure here.}

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor

Adds oneig_alignment metric implementation, its focused tests, and benchmark subset wiring while keeping reasoning and text-rendering metrics for later stacked PRs. Made-with: Cursor

begumcig

Thank you so much David this is really super close to the original implementation right now! Thank you so much for implementing the dependency masks and per prompt averaging! Just added a few nitpicky details and some general questions about the metric logic but this already looks super good to me!! Also I noticed the tests haven't run for this PR, do you have any idea why? (This is probably something I should know as the package team lol) Thanks again :)

begumcig · 2026-05-11T16:06:57Z

+        """
+        inputs = metric_data_processor(x, gt, outputs, self.call_type)
+        images = _process_images(inputs[0])
+        aux_list = inputs[1] if len(inputs) > 1 else []


if the call type is y_gt i don't see this ever not having len>1

begumcig · 2026-05-11T16:09:15Z

+        inputs = metric_data_processor(x, gt, outputs, self.call_type)
+        images = _process_images(inputs[0])
+        aux_list = inputs[1] if len(inputs) > 1 else []
+        if isinstance(aux_list, torch.Tensor):


i am a bit confused by this, if the aux_list is a torch.Tensor we turn it into a list, but below we are trying to index it, and we are expecting it to be a list of dicts. How does a tensor turn into a list of dicts? Maybe if it's a tensor, it's already not what we are looking for?

begumcig · 2026-05-11T16:10:30Z

+            raw_scores_list = self.vlm.score(
+                [image] * len(question_texts),
+                question_texts,
+                ["Yes"] * len(question_texts),


I am asking this just out of curiosity, I also don't know the answer: Can the expected answer here be "No" instead of "yes", or perhaps a number, etc?

begumcig · 2026-05-11T16:13:11Z

+    metric_name: str = "oneig_alignment"
+    metric_units: str = "alignment"
+
+    def update(self, x: list[Any] | torch.Tensor, gt: torch.Tensor, outputs: torch.Tensor) -> None:


This is more of a general comment but I noticed the base class for this QAAccuracymetric also implements many of the lines in its update function (processing the input, checking if questions exist in the data, calling the vlm.score function). If these are genuienly overlapping for many metrics and only the middle "computation" part is differing from metric to metric, maybe we should make these functions standalone in the base class and resuse them in the child classes?

begumcig · 2026-05-11T16:23:33Z

@@ -0,0 +1,234 @@
+# Copyright 2025 - Pruna AI GmbH. All rights reserved.


Thank you so much for the refactoring David! I think this is much closer to the original implementation for OneIG Alignment! A couple of things I have noticed that perhaps we can discuss if we want to address them or not:

In the original Alignment score, I believe for each prompt we generate a grid of images (default being 2x2) So for each prompt, the VLM sees multiple images, factoring for problems that can come from the seed. I believe here we process images one by one, which is fine by me, but perhaps we should address this in the docstring or in the documentation?

Also related to the above topic a bit, I think in the original implementation, the questions for each prompt is passed to the VLM one by one (so if a prompt has 10 questions for that generation, each is one request). Please correct me if I am wrong (maybe the VLM PR handles these requests one by one) but here we seem to pass these questions all together.

So in the original implementation we evaluate multiple images of the same prompt multiple times, and in our implementation we are evaluating one image at the same time with multiple questions. I think this may create some discrepancies in how the scores are shaped quite a lot, maybe at least we should also evaluate the questions one by one?

begumcig · 2026-05-11T16:24:51Z

@@ -0,0 +1,234 @@
+# Copyright 2025 - Pruna AI GmbH. All rights reserved.
+#


This is not really an issue at all, but I think the original implementation uses a Qwen model for scoring and I think our VLM implementation usually defaults to GPT 4-o? This is totally fine by me, but maybe we should again note it somewhere?

begumcig · 2026-05-11T16:25:59Z

if kwargs only support aggegation right now, maybe we should also make it a keyword argument?

begumcig

Something is wrong with the reviewing so i have to make this placeholder comment

This was referenced Apr 28, 2026

feat(text-metrics): add text-based VLM judge metrics #639

Closed

feat(vision-metrics): add vision-based VLM judge metrics #640

Closed

cursor Bot reviewed Apr 28, 2026

View reviewed changes

davidberenstein1957 added 2 commits May 8, 2026 11:01

feat(text-metrics): split qa_accuracy into dedicated PR branch

04ab2e5

Isolates qa_accuracy metric implementation and GenEval benchmark wiring so it can be reviewed independently before stacking the remaining text metrics. Made-with: Cursor

feat(text-metrics): split oneig_alignment into dedicated branch

c51653e

Adds oneig_alignment metric implementation, its focused tests, and benchmark subset wiring while keeping reasoning and text-rendering metrics for later stacked PRs. Made-with: Cursor

davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch from 15db155 to 04ab2e5 Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-3b-oneig-alignment branch from 2627d78 to c51653e Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-3a-qa-accuracy branch from 04ab2e5 to 161223e Compare May 8, 2026 09:44

begumcig requested review from begumcig May 11, 2026 16:27

begumcig requested changes May 11, 2026

View reviewed changes

		@@ -0,0 +1,234 @@
		# Copyright 2025 - Pruna AI GmbH. All rights reserved.

Conversation

davidberenstein1957 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Files

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (4/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

OneIG subset benchmarks fail to register at import

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

OneIG record test calls function with wrong signature

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

OneIG Multilingualism subset has no alignment questions

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig May 11, 2026

Choose a reason for hiding this comment

Uh oh!

begumcig left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davidberenstein1957 commented Apr 28, 2026 •

edited

Loading