-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Summary
In evaluation/utils.py, the get_final_score function calculates four distinct metrics (skill_match_score, entity_match_score, skill_with_entity_match_score, exact_match_score). However, when computing the weighted total_score, the code iterates only over the keys present in the skill_entity_scores dictionary. This causes skill_with_entity_match_score (10% weight) and exact_match_score (10% weight) to be completely ignored.
As a result, the maximum possible score for any task is capped at 80.0 instead of 100.0, and models are not rewarded for correct structural dependencies or joint skill-entity matching.
Code Analysis
The issue is located in evaluation/utils.py:
# ... lines 319-323
skill_entity_scores = calculate_skill_and_entity_scores(standard_skill_sequence, model_skill_sequence)
# This dict ONLY contains: ['skill_match_score', 'entity_match_score']
skill_with_entity_scores = calculate_skill_with_entity_scores(standard_skill_sequence, model_skill_sequence)
# This is a separate variable
exact_match_score = get_exact_match(standard_skill_sequence, model_skill_sequence, dependency)
# This is a separate variable
score_weight = {
"skill_match_score": 0.4,
"entity_match_score": 0.4,
"skill_with_entity_match_score": 0.1,
"exact_match_score": 0.1
}
# BUG HERE: This loop only iterates over keys in `skill_entity_scores`,
# effectively ignoring the other two metrics computed above.
total_score = sum(score_weight[key] * value for key, value in skill_entity_scores.items())Reproduction Steps
- Run evaluation on any task where
skill_with_entity_match_score > 0. - Observe the generated
output.jsonorfinal_score.json. - Manually calculate the weighted sum:
0.4*skill + 0.4*entity + 0.1*joint + 0.1*exact. - Compare it with the logged
total_score.
Example:
If a model gets:
- Skill Match: 100 (weighted 40)
- Entity Match: 0 (weighted 0)
- Skill+Entity Match: 50 (weighted 5)
- Exact Match: 0 (weighted 0)
Expected Score: 45.0
Actual Score: 40.0
Suggested Fix
Explicitly sum all weighted components instead of iterating through a partial dictionary.
total_score = (
skill_entity_scores["skill_match_score"] * score_weight["skill_match_score"] +
skill_entity_scores["entity_match_score"] * score_weight["entity_match_score"] +
skill_with_entity_scores["skill_with_entity_match_score"] * score_weight["skill_with_entity_match_score"] +
exact_match_score * score_weight["exact_match_score"]
)