Skip to content

Dataclass for "VirtualZarrStore" #375

@TomNicholas

Description

@TomNicholas

Problem

Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.

Suggestion

Instead create a new VirtualZarrStore dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.

Advantages

  • Easier to read and interrogate than multiply-nested dicts
  • Allows direct validation
  • Serializes in obvious ways (via .to_json, to_parquet, .to_dict or similar.)
  • Easier to write tests, by using fixtures to generate VirtualZarrStore objects
  • Concentrates concerns over changes/enhancements to Zarr Spec in one class
  • A v2->v3 converter could act directly on these objects
  • Possibly easier to understand whenever anyone reimplements kerchunk in other languages?

Implementation ideas

  • Implementation could subclass Zarr Object Model classes (where .to_json is analogous to the ZOM's .serialize), which then would be solidified as the recommended abstract representation once ZEP006 is accepted
  • Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
  • Attributes of this dataclass need to always be serializable, so the VirtualZarrStore should be basically a json schema (see #373)

Questions

  • Is it possible to do this in a broadly backwards-compatible manner?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions