Skip to content

Bug Report: ”total_score“ calculation ignores partial metrics in ”get_final_score“ #77

@Evan-Joseph

Description

@Evan-Joseph

Summary

In evaluation/utils.py, the get_final_score function calculates four distinct metrics (skill_match_score, entity_match_score, skill_with_entity_match_score, exact_match_score). However, when computing the weighted total_score, the code iterates only over the keys present in the skill_entity_scores dictionary. This causes skill_with_entity_match_score (10% weight) and exact_match_score (10% weight) to be completely ignored.

As a result, the maximum possible score for any task is capped at 80.0 instead of 100.0, and models are not rewarded for correct structural dependencies or joint skill-entity matching.

Code Analysis

The issue is located in evaluation/utils.py:

# ... lines 319-323
skill_entity_scores = calculate_skill_and_entity_scores(standard_skill_sequence, model_skill_sequence)
# This dict ONLY contains: ['skill_match_score', 'entity_match_score']

skill_with_entity_scores = calculate_skill_with_entity_scores(standard_skill_sequence, model_skill_sequence)
# This is a separate variable

exact_match_score = get_exact_match(standard_skill_sequence, model_skill_sequence, dependency)
# This is a separate variable

score_weight = {
    "skill_match_score": 0.4,
    "entity_match_score": 0.4,
    "skill_with_entity_match_score": 0.1,
    "exact_match_score": 0.1
}

# BUG HERE: This loop only iterates over keys in `skill_entity_scores`,
# effectively ignoring the other two metrics computed above.
total_score = sum(score_weight[key] * value for key, value in skill_entity_scores.items())

Reproduction Steps

  1. Run evaluation on any task where skill_with_entity_match_score > 0.
  2. Observe the generated output.json or final_score.json.
  3. Manually calculate the weighted sum: 0.4*skill + 0.4*entity + 0.1*joint + 0.1*exact.
  4. Compare it with the logged total_score.

Example:
If a model gets:

  • Skill Match: 100 (weighted 40)
  • Entity Match: 0 (weighted 0)
  • Skill+Entity Match: 50 (weighted 5)
  • Exact Match: 0 (weighted 0)

Expected Score: 45.0
Actual Score: 40.0

Suggested Fix

Explicitly sum all weighted components instead of iterating through a partial dictionary.

    total_score = (
        skill_entity_scores["skill_match_score"] * score_weight["skill_match_score"] +
        skill_entity_scores["entity_match_score"] * score_weight["entity_match_score"] +
        skill_with_entity_scores["skill_with_entity_match_score"] * score_weight["skill_with_entity_match_score"] +
        exact_match_score * score_weight["exact_match_score"]
    )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions