Dear MMDR-Bench team,
Thank you for releasing this benchmark! The multi-dimensional scoring approach seems very well thought out.
I ran into an issue when trying to use the evaluation pipeline on my own reports.
The GEN and EVI scoring prompts in detail/scoring_judge.py inject Rules.txt as context for the LLM judge:
# scoring_judge.py:137
prompt = f"""
You are a judge for General Metrics in a Deep Research benchmark.
Rules (excerpt):
{rules_text[:1400]}
...
But Rules.txt is not included in the repository. load_rules_text falls back to "No specific rules provided.", which means the judge receives no definitions for the fields it's asked to return (E, p_analysis, rho_connective, U, O, T, kappa, theta, support_strength, hallucination_prob, ...)
These fields feed directly into the scoring formulas (e.g., sigmoid(a0 + a1*TTR + a2*H - a3*L - a4*E) in scoring_general.py), so their interpretation matters. Without Rules.txt, a third party running the evaluation pipeline gets scores that depend entirely on whatever the judge LLM guesses these field names mean.
Could you include Rules.txt in the repo (or document the expected field definitions somewhere)?
Dear MMDR-Bench team,
Thank you for releasing this benchmark! The multi-dimensional scoring approach seems very well thought out.
I ran into an issue when trying to use the evaluation pipeline on my own reports.
The GEN and EVI scoring prompts in
detail/scoring_judge.pyinjectRules.txtas context for the LLM judge:But
Rules.txtis not included in the repository.load_rules_textfalls back to"No specific rules provided.", which means the judge receives no definitions for the fields it's asked to return (E,p_analysis,rho_connective,U,O,T,kappa,theta,support_strength,hallucination_prob, ...)These fields feed directly into the scoring formulas (e.g.,
sigmoid(a0 + a1*TTR + a2*H - a3*L - a4*E)inscoring_general.py), so their interpretation matters. WithoutRules.txt, a third party running the evaluation pipeline gets scores that depend entirely on whatever the judge LLM guesses these field names mean.Could you include
Rules.txtin the repo (or document the expected field definitions somewhere)?