Skip to content

Add benchmark for model evaluation on small clusters#547

Open
frostedoyster wants to merge 1 commit into
ddmms:mainfrom
frostedoyster:main
Open

Add benchmark for model evaluation on small clusters#547
frostedoyster wants to merge 1 commit into
ddmms:mainfrom
frostedoyster:main

Conversation

@frostedoyster
Copy link
Copy Markdown

Pre-review checklist for PR author

PR author must check the checkboxes below when creating the PR.

Summary

As described in #546, this benchmark evaluates force accuracies on small atomic clusters.

Linked issue

Resolves #546

Progress

  • Calculations
  • Analysis
  • Application
  • Documentation

Potential aspects of the benchmark to be discussed with the maintainers:

  • weighting of clusters of different sizes (defaults to one for all sizes at the moment)
  • energies are excluded for the moment because models can have arbitrary baselines; however, we could run a few DFT calculations to determine isolated atom energies and subtract those from the cluster energies
  • the benchmark is marked as "very slow" for the moment because it consists of the evaluation of 60k structures. This is to be tested though, as all structures are very small. It would also be possible to select a smaller subset.
  • the benchmark contains random elements up to (and including) the third row of the periodic table. This includes the presence of clusters including noble gases, which are often not well-represented in training sets. Also depending on the performance of the existing models on those, we might want to decide to exclude cluster containing noble gases

Testing

We carefully checked consistency of the labels with the publicly available OMol25 and MAD-1.5 datasets. The benchmark has not been tested on any models yet.

New decorators/callbacks

No new callbacks are needed.

@ElliottKasoar ElliottKasoar added the new benchmark Proposals and suggestions for new benchmarks label May 12, 2026
Copy link
Copy Markdown
Collaborator

@ElliottKasoar ElliottKasoar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @frostedoyster, thanks for this, it's looking really nice already, and should be a really interesting addition!

@joehart2001 and I will try to look in more detail as soon as we can, but I've left a few comments/questions from an initial pass of everything.

Comment on lines +213 to +215
out_atoms.arrays["pred_forces"] = np.asarray(
atoms_pred.get_forces(), dtype=float
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a try/except around this, and maybe store the forces as NaNs if this fails?

Otherwise (unless I'm missing something) models which support a more limited set of elements, for example, will fail at some point.

Comment on lines +28 to +30
MAD2_FORCES_KEY = "mad2_forces"
OMOL25_FORCES_KEY = "omol25_forces"
ORGANIC_MODEL_MARKERS = ("omol", "off", "polar")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally when we have multiple references, we'd still compare all models to each one. Otherwise the scores we assign mean different things to different models, which could be misleading/unfair

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, so we would test all models on all sets

"to their training domain."
),
table_path=DATA_PATH / "cluster_forces_metrics_table.json",
extra_components=[Div(id=f"{BENCHMARK_NAME}-figure-placeholder")],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a docs_url to this, corresponding to the URL that will take you to the new documentation?

Comment on lines +7 to +42
level_of_theory: null
weight: 1
Force MAE (4 atoms):
good: 0.1
bad: 1.0
unit: eV/A
tooltip: "Component-wise force MAE on neutral 4-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
level_of_theory: null
weight: 1
Force MAE (5 atoms):
good: 0.1
bad: 1.0
unit: eV/A
tooltip: "Component-wise force MAE on neutral 5-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
level_of_theory: null
weight: 1
Force MAE (6 atoms):
good: 0.1
bad: 1.0
unit: eV/A
tooltip: "Component-wise force MAE on neutral 6-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
level_of_theory: null
weight: 1
Force MAE (7 atoms):
good: 0.1
bad: 1.0
unit: eV/A
tooltip: "Component-wise force MAE on neutral 7-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
level_of_theory: null
weight: 1
Force MAE (8 atoms):
good: 0.1
bad: 1.0
unit: eV/A
tooltip: "Component-wise force MAE on neutral 8-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces."
level_of_theory: null
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the level of theory for these was either ωB97M-V/def2-TZVPD or r2SCAN? Are the nulls because it could be either?


Input/reference data:

* Cluster structures and reference forces are distributed as a separate zip archive and
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments on how the structures were originally obtained/built, and the level of theory of the reference data?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, it would be useful to have a lot of info from your email


frames = iread(data_path, index=":")
if sys.stderr.isatty():
frames = tqdm(frames, desc=f"{model_name} cluster forces", unit="cluster")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we know the full number, it might be nice to include that here, since otherwise you don't get the progress bar

)
for cluster_size in CLUSTER_SIZES
}
plot_from_table_column(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add structure visualisation to these plots. @joehart2001 can probably help, since it's slightly more complicated for density plots than normal scatters.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the density scatter you can look at e.g. analysis example and app example

but let us know if you're confused about anything or any formats etc

return ref_forces


@pytest.mark.very_slow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.very_slow
@pytest.mark.slow

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very slow is reserved for calculations which take on the order of days, so i think slow is more appropriate

Copy link
Copy Markdown
Collaborator

@ElliottKasoar ElliottKasoar May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this. If it does actually take several hours to run, say, locally, I think it could be borderline, but yeah slow is probably slightly more accurate still

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from testing locally it was less than an hour

@joehart2001
Copy link
Copy Markdown
Collaborator

joehart2001 commented May 12, 2026

Overall its looking great! Would we want to apply dispersion corrections for models which have not been trained on dispersion corrected data? We have the utility to do this automatically, see here. Im not suggesting that this needs dispersion, just would like to check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new benchmark Proposals and suggestions for new benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Model evaluation on small clusters

3 participants