Skip to content

Support glob patterns in open_datatree(group=...) for selective group loading#11302

Open
aladinor wants to merge 16 commits into
pydata:mainfrom
aladinor:glob-group-filtering-standalone
Open

Support glob patterns in open_datatree(group=...) for selective group loading#11302
aladinor wants to merge 16 commits into
pydata:mainfrom
aladinor:glob-group-filtering-standalone

Conversation

@aladinor
Copy link
Copy Markdown
Contributor

Summary

When the group parameter contains glob metacharacters (*, ?, [), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.

Use cases

  • Radar data: xr.open_datatree("radar.nc", group="*/sweep_0") — load only the lowest elevation sweep from each volume scan
  • CMIP archives: xr.open_datatree("cmip.zarr", group="*/historical/tas") — load only temperature across all models

Changes

  • Added shared utilities _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter in common.py
  • Updated NetCDF4, H5NetCDF, and Zarr backends to use a discover → filter → open pipeline
  • Uses the same matching engine as DataTree.match() (PurePosixPath.match)
  • Root (/) and all ancestors of matched nodes are always included to form a valid tree

Behavior summary

group value Behavior
None Load all groups (unchanged)
"VCP-34" (no glob chars) Root selection (unchanged)
"*/sweep_0" (glob chars) Filter mode — only matched groups + ancestors
Pattern matches nothing Root-only tree

Test plan

  • 27 new tests covering all backends (netCDF4, h5netcdf, zarr v2/v3)
  • Unit tests for _is_glob_pattern, _filter_group_paths, _resolve_group_and_filter with *, ?, []
  • Integration tests: glob match, no-match, data preservation, open_groups API
  • Full test_backends_datatree.py suite passes (228 passed, 0 failures)
  • Pre-commit checks pass

@github-actions github-actions Bot added topic-backends topic-zarr Related to zarr storage library io labels Apr 16, 2026
Add _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter
to common.py for detecting and applying glob patterns to group paths.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Use _resolve_group_and_filter in open_groups_as_dict to support glob
patterns in the group parameter for selective group loading.
Update docstrings for the group kwarg in open_datatree and open_groups
to describe glob metacharacter behavior.
Add integration tests for netCDF4, h5netcdf, and zarr backends, plus
unit tests for _is_glob_pattern, _filter_group_paths, and
_resolve_group_and_filter covering *, ?, and [] metacharacters.
@aladinor aladinor force-pushed the glob-group-filtering-standalone branch from e892524 to 5fb46e1 Compare April 16, 2026 17:09
@kmuehlbauer
Copy link
Copy Markdown
Contributor

@aladinor Thanks, that's great a feature. I'd instantly use it.

There might be some pitfalls if group names are containing one or more of the glob meta characters. Will this be handled, too?

my_nifty_group_with_a_star_*_01
my_nifty_group_with_a_star_*_11
my_nifty_group_with_a_star_*_12

@kmuehlbauer
Copy link
Copy Markdown
Contributor

XRef: h5py/h5py#2059 for discussion of adding globbing in h5py

@aladinor
Copy link
Copy Markdown
Contributor Author

aladinor commented Apr 22, 2026

@kmuehlbauer, thanks for taking the time to check this out.

my_nifty_group_with_a_star_01
my_nifty_group_with_a_star
11
my_nifty_group_with_a_star
*_12

This seems to be a strange way to name a group, but yes. It will work via the same character-class escape that fnmatch / PurePath.match supports.

For example, if we have something like this

  paths = ['/my_nifty_group_with_a_star_*_01',
           '/my_nifty_group_with_a_star_*_11',                                                                                                                                                                         
           '/my_nifty_group_with_a_star_*_12']      

We can use this pattern to get those groups "*star_[*]_*". This will match all 3. literal * via [*]

aladinor and others added 8 commits April 22, 2026 08:40
Add coverage for group names containing literal ``*`` / ``?`` / ``[``.
These are reachable with ``[*]`` / ``[?]`` / ``[[]`` character-class
escaping (inherited from ``fnmatch`` / ``PurePath.match`` semantics).

New tests:
- ``test_open_datatree_glob_char_class_escape_literal_metachar`` on
  ``NetCDFIOBase`` and ``TestZarrDatatreeIO`` — end-to-end verification
  that groups with literal metacharacters in their names can be
  targeted across all supported backends.
- ``test_filter_group_paths_literal_metachar_via_char_class`` on
  ``TestGlobPatternUtilities`` — unit-level check of the filter.
Explain that matching follows ``fnmatch`` / :py:meth:`pathlib.PurePath.match`
semantics and that literal ``*`` / ``?`` / ``[`` in group names can be
targeted via character-class escapes (``[*]``, ``[?]``, ``[[]``), with a
short example. Applied to both :py:func:`open_datatree` and
:py:func:`open_groups` for consistency.
Add ``/plain_01`` to the zarr ``test_open_datatree_glob_char_class_escape_literal_metachar``
fixture so it matches the NetCDF version and confirms plain (no-metachar)
group names are excluded when the pattern targets literal-metachar names.
Windows forbids ``*`` and ``?`` in filesystem directory/file names, and
zarr stores each group as an on-disk directory. That makes writing the
fixture impossible before the test can exercise the filter. NetCDF4/H5
store groups inside the HDF5 container so they are unaffected.

Skip the zarr variant on Windows with a clear reason; the NetCDF
variants still cover the escape behavior on all platforms.
The previous commit skipped the zarr variant on Windows because the
filesystem rejects ``*`` and ``?`` in directory names. Using
``zarr.storage.MemoryStore`` side-steps the filesystem entirely, so the
test now runs on every platform and still exercises the escape logic.

This is also a more realistic target for the feature on Windows — users
who hit group names with glob metacharacters are likely reading from
cloud/icechunk stores (dict-keyed like ``MemoryStore``), not an on-disk
zarr directory tree.
``open_datatree``'s static signature doesn't list zarr store objects
(``MemoryStore`` etc.) among its accepted first-argument types, but the
zarr backend handles them correctly at runtime. Apply a narrow
``# type: ignore[arg-type]`` on the three test calls rather than
widening the public signature.
@kmuehlbauer
Copy link
Copy Markdown
Contributor

@aladinor Thanks for adding the glob escapes. Is this ready from your side?

@aladinor
Copy link
Copy Markdown
Contributor Author

aladinor commented May 8, 2026

Yep, it is ready to merge @kmuehlbauer

@kmuehlbauer kmuehlbauer added the plan to merge Final call for comments label May 8, 2026
Copy link
Copy Markdown
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. Can't say much wrt typing, though.

@kmuehlbauer
Copy link
Copy Markdown
Contributor

@pydata/xarray Another set of eyes much appreciated here. If there are no concerns, I'd move on and merge early next week. Thanks!

@shoyer
Copy link
Copy Markdown
Member

shoyer commented May 11, 2026

I like the idea of this feature, but worry about ambiguity with the existing group argument -- are we sure that names with these characters are invalidate in netCDF/Zarr?

A safer strategy would be to make a new argument, something like group_filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io plan to merge Final call for comments topic-backends topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support glob patterns in open_datatree(group=...) for selective group loading

3 participants