Add benchmark for model evaluation on small clusters#547
Conversation
ElliottKasoar
left a comment
There was a problem hiding this comment.
Hi @frostedoyster, thanks for this, it's looking really nice already, and should be a really interesting addition!
@joehart2001 and I will try to look in more detail as soon as we can, but I've left a few comments/questions from an initial pass of everything.
| out_atoms.arrays["pred_forces"] = np.asarray( | ||
| atoms_pred.get_forces(), dtype=float | ||
| ) |
There was a problem hiding this comment.
Could we add a try/except around this, and maybe store the forces as NaNs if this fails?
Otherwise (unless I'm missing something) models which support a more limited set of elements, for example, will fail at some point.
| MAD2_FORCES_KEY = "mad2_forces" | ||
| OMOL25_FORCES_KEY = "omol25_forces" | ||
| ORGANIC_MODEL_MARKERS = ("omol", "off", "polar") |
There was a problem hiding this comment.
Generally when we have multiple references, we'd still compare all models to each one. Otherwise the scores we assign mean different things to different models, which could be misleading/unfair
There was a problem hiding this comment.
i agree, so we would test all models on all sets
| "to their training domain." | ||
| ), | ||
| table_path=DATA_PATH / "cluster_forces_metrics_table.json", | ||
| extra_components=[Div(id=f"{BENCHMARK_NAME}-figure-placeholder")], |
There was a problem hiding this comment.
Could you add a docs_url to this, corresponding to the URL that will take you to the new documentation?
| level_of_theory: null | ||
| weight: 1 | ||
| Force MAE (4 atoms): | ||
| good: 0.1 | ||
| bad: 1.0 | ||
| unit: eV/A | ||
| tooltip: "Component-wise force MAE on neutral 4-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces." | ||
| level_of_theory: null | ||
| weight: 1 | ||
| Force MAE (5 atoms): | ||
| good: 0.1 | ||
| bad: 1.0 | ||
| unit: eV/A | ||
| tooltip: "Component-wise force MAE on neutral 5-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces." | ||
| level_of_theory: null | ||
| weight: 1 | ||
| Force MAE (6 atoms): | ||
| good: 0.1 | ||
| bad: 1.0 | ||
| unit: eV/A | ||
| tooltip: "Component-wise force MAE on neutral 6-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces." | ||
| level_of_theory: null | ||
| weight: 1 | ||
| Force MAE (7 atoms): | ||
| good: 0.1 | ||
| bad: 1.0 | ||
| unit: eV/A | ||
| tooltip: "Component-wise force MAE on neutral 7-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces." | ||
| level_of_theory: null | ||
| weight: 1 | ||
| Force MAE (8 atoms): | ||
| good: 0.1 | ||
| bad: 1.0 | ||
| unit: eV/A | ||
| tooltip: "Component-wise force MAE on neutral 8-atom clusters. Materials/general models are compared to MAD2 forces; organic-focused models are compared to OMOL25 forces." | ||
| level_of_theory: null |
There was a problem hiding this comment.
I thought the level of theory for these was either ωB97M-V/def2-TZVPD or r2SCAN? Are the nulls because it could be either?
|
|
||
| Input/reference data: | ||
|
|
||
| * Cluster structures and reference forces are distributed as a separate zip archive and |
There was a problem hiding this comment.
Can you add comments on how the structures were originally obtained/built, and the level of theory of the reference data?
There was a problem hiding this comment.
i agree, it would be useful to have a lot of info from your email
|
|
||
| frames = iread(data_path, index=":") | ||
| if sys.stderr.isatty(): | ||
| frames = tqdm(frames, desc=f"{model_name} cluster forces", unit="cluster") |
There was a problem hiding this comment.
If we know the full number, it might be nice to include that here, since otherwise you don't get the progress bar
| ) | ||
| for cluster_size in CLUSTER_SIZES | ||
| } | ||
| plot_from_table_column( |
There was a problem hiding this comment.
It might be nice to add structure visualisation to these plots. @joehart2001 can probably help, since it's slightly more complicated for density plots than normal scatters.
There was a problem hiding this comment.
For the density scatter you can look at e.g. analysis example and app example
but let us know if you're confused about anything or any formats etc
| return ref_forces | ||
|
|
||
|
|
||
| @pytest.mark.very_slow |
There was a problem hiding this comment.
| @pytest.mark.very_slow | |
| @pytest.mark.slow |
There was a problem hiding this comment.
very slow is reserved for calculations which take on the order of days, so i think slow is more appropriate
There was a problem hiding this comment.
I wondered about this. If it does actually take several hours to run, say, locally, I think it could be borderline, but yeah slow is probably slightly more accurate still
There was a problem hiding this comment.
from testing locally it was less than an hour
|
Overall its looking great! Would we want to apply dispersion corrections for models which have not been trained on dispersion corrected data? We have the utility to do this automatically, see here. Im not suggesting that this needs dispersion, just would like to check |
Pre-review checklist for PR author
PR author must check the checkboxes below when creating the PR.
Summary
As described in #546, this benchmark evaluates force accuracies on small atomic clusters.
Linked issue
Resolves #546
Progress
Potential aspects of the benchmark to be discussed with the maintainers:
Testing
We carefully checked consistency of the labels with the publicly available OMol25 and MAD-1.5 datasets. The benchmark has not been tested on any models yet.
New decorators/callbacks
No new callbacks are needed.