Skip to content

Format images as webp within HDF5 with Parquet metadata #38

@thompsonmj

Description

@thompsonmj

Issue:
The access patterns for TreeOfLife image data have evolved toward high-frequency random access, exposing a key performance limitation of the current pure Parquet design. Storage costs are also high. Together, these motivate a search for more performant formats.

Solution:
Storing the images in HDF5 groups has proven to be advantageous over pure Parquet by enabling fast random access that does not rely on more memory utilization than needed for the image(s) of interest. Image metadata is stored alongside .h5 files in query-optimized Parquet files, resulting in a hybrid format.

Furthermore, WebP has shown to be an ideal image format that dramatically reduces disk utilization while remaining lossless with fast decoding performance. This was determined from a comparison among JPG, PNG, AVIF (min/max effort), WebP (min/max effort), and JXL.

Support:
The tests covered:

  • Disk utilization for lossless (where possible) formats of 10k sampled images.
  • Throughput (wall time to randomly access and decode 1, 10, 100, 1k, and 10k individual images by UUID from the 10k set, N=3).
  • Technical complexity / library support.

From the tests, the following were determined:

  • Parquet bytes are inefficient for storage and slow for random access. Identical wall time for 1 and 10k images, and even bulk access was worse than other formats.
  • AVIF is superb with storage and throughput, but more technically complicated to support for bit-perfect hashing, which is needed for data validation and duplicate management.
  • JPG has great performance, but not lossless.
  • JXL has superb performance but lags in library support.
  • PNG has good throughput but heavy files.
  • WebP has excellent storage performance and decent throughput.

Taken together, encoding images as WebP for random access in HDF5 files with metadata querying supported by accompanying Parquet files appears to be an ideal hybrid format for the current use case requirements.

Required action:
A tool should be added to the toolbox that can convert Parquet bytes into the hybrid HDF5(WebP)-Parquet(metadata) format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions