Add benchmark for model evaluation on small clusters by frostedoyster · Pull Request #547 · ddmms/ml-peg

frostedoyster · 2026-05-12T09:22:43Z

Pre-review checklist for PR author

PR author must check the checkboxes below when creating the PR.

I've confirmed the contribution guidelines.

Summary

As described in #546, this benchmark evaluates force accuracies on small atomic clusters.

Linked issue

Resolves #546

Progress

Calculations
Analysis
Application
Documentation

Potential aspects of the benchmark to be discussed with the maintainers:

weighting of clusters of different sizes (defaults to one for all sizes at the moment)
energies are excluded for the moment because models can have arbitrary baselines; however, we could run a few DFT calculations to determine isolated atom energies and subtract those from the cluster energies
the benchmark is marked as "very slow" for the moment because it consists of the evaluation of 60k structures. This is to be tested though, as all structures are very small. It would also be possible to select a smaller subset.
the benchmark contains random elements up to (and including) the third row of the periodic table. This includes the presence of clusters including noble gases, which are often not well-represented in training sets. Also depending on the performance of the existing models on those, we might want to decide to exclude cluster containing noble gases

Testing

We carefully checked consistency of the labels with the publicly available OMol25 and MAD-1.5 datasets. The benchmark has not been tested on any models yet.

New decorators/callbacks

No new callbacks are needed.

ElliottKasoar

Hi @frostedoyster, thanks for this, it's looking really nice already, and should be a really interesting addition!

@joehart2001 and I will try to look in more detail as soon as we can, but I've left a few comments/questions from an initial pass of everything.

ElliottKasoar · 2026-05-12T13:37:50Z

+        out_atoms.arrays["pred_forces"] = np.asarray(
+            atoms_pred.get_forces(), dtype=float
+        )


Could we add a try/except around this, and maybe store the forces as NaNs if this fails?

Otherwise (unless I'm missing something) models which support a more limited set of elements, for example, will fail at some point.

ElliottKasoar · 2026-05-12T13:40:07Z

+MAD2_FORCES_KEY = "mad2_forces"
+OMOL25_FORCES_KEY = "omol25_forces"
+ORGANIC_MODEL_MARKERS = ("omol", "off", "polar")


Generally when we have multiple references, we'd still compare all models to each one. Otherwise the scores we assign mean different things to different models, which could be misleading/unfair

i agree, so we would test all models on all sets

ElliottKasoar · 2026-05-12T13:42:10Z

+            "to their training domain."
+        ),
+        table_path=DATA_PATH / "cluster_forces_metrics_table.json",
+        extra_components=[Div(id=f"{BENCHMARK_NAME}-figure-placeholder")],


Could you add a docs_url to this, corresponding to the URL that will take you to the new documentation?

ElliottKasoar · 2026-05-12T13:50:07Z

+    level_of_theory: null
+    weight: 1
+  Force MAE (4 atoms):
+    good: 0.1
+    bad: 1.0
+    unit: eV/A
+    tooltip: "Component-wise force MAE on neutral 4-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
+    level_of_theory: null
+    weight: 1
+  Force MAE (5 atoms):
+    good: 0.1
+    bad: 1.0
+    unit: eV/A
+    tooltip: "Component-wise force MAE on neutral 5-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
+    level_of_theory: null
+    weight: 1
+  Force MAE (6 atoms):
+    good: 0.1
+    bad: 1.0
+    unit: eV/A
+    tooltip: "Component-wise force MAE on neutral 6-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
+    level_of_theory: null
+    weight: 1
+  Force MAE (7 atoms):
+    good: 0.1
+    bad: 1.0
+    unit: eV/A
+    tooltip: "Component-wise force MAE on neutral 7-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
+    level_of_theory: null
+    weight: 1
+  Force MAE (8 atoms):
+    good: 0.1
+    bad: 1.0
+    unit: eV/A
+    tooltip: "Component-wise force MAE on neutral 8-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
+    level_of_theory: null


I thought the level of theory for these was either ωB97M-V/def2-TZVPD or r2SCAN? Are the nulls because it could be either?

ElliottKasoar · 2026-05-12T13:56:16Z

+
+Input/reference data:
+
+* Cluster structures and reference forces are distributed as a separate zip archive and


Can you add comments on how the structures were originally obtained/built, and the level of theory of the reference data?

i agree, it would be useful to have a lot of info from your email

ElliottKasoar · 2026-05-12T13:59:07Z

+
+    frames = iread(data_path, index=":")
+    if sys.stderr.isatty():
+        frames = tqdm(frames, desc=f"{model_name} cluster forces", unit="cluster")


If we know the full number, it might be nice to include that here, since otherwise you don't get the progress bar

ElliottKasoar · 2026-05-12T14:07:08Z

+            )
+            for cluster_size in CLUSTER_SIZES
+        }
+        plot_from_table_column(


It might be nice to add structure visualisation to these plots. @joehart2001 can probably help, since it's slightly more complicated for density plots than normal scatters.

For the density scatter you can look at e.g. analysis example and app example

but let us know if you're confused about anything or any formats etc

joehart2001 · 2026-05-12T19:48:32Z

+    return ref_forces
+
+
+@pytest.mark.very_slow


Suggested change

@pytest.mark.very_slow

@pytest.mark.slow

very slow is reserved for calculations which take on the order of days, so i think slow is more appropriate

I wondered about this. If it does actually take several hours to run, say, locally, I think it could be borderline, but yeah slow is probably slightly more accurate still

from testing locally it was less than an hour

joehart2001 · 2026-05-12T20:06:32Z

Overall its looking great! Would we want to apply dispersion corrections for models which have not been trained on dispersion corrected data? We have the utility to do this automatically, see here. Im not suggesting that this needs dispersion, just would like to check

Implement draft

f11534e

ElliottKasoar requested review from ElliottKasoar and joehart2001 May 12, 2026 10:21

ElliottKasoar added the new benchmark Proposals and suggestions for new benchmarks label May 12, 2026

ElliottKasoar reviewed May 12, 2026

View reviewed changes

joehart2001 reviewed May 12, 2026

View reviewed changes


		Input/reference data:

		* Cluster structures and reference forces are distributed as a separate zip archive and

Conversation

frostedoyster commented May 12, 2026

Pre-review checklist for PR author

Summary

Linked issue

Progress

Testing

New decorators/callbacks

Uh oh!

ElliottKasoar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElliottKasoar May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joehart2001 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ElliottKasoar May 12, 2026 •

edited

Loading

joehart2001 commented May 12, 2026 •

edited

Loading