-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Issue:
The access patterns for TreeOfLife image data have evolved toward high-frequency random access, exposing a key performance limitation of the current pure Parquet design. Storage costs are also high. Together, these motivate a search for more performant formats.
Solution:
Storing the images in HDF5 groups has proven to be advantageous over pure Parquet by enabling fast random access that does not rely on more memory utilization than needed for the image(s) of interest. Image metadata is stored alongside .h5 files in query-optimized Parquet files, resulting in a hybrid format.
Furthermore, WebP has shown to be an ideal image format that dramatically reduces disk utilization while remaining lossless with fast decoding performance. This was determined from a comparison among JPG, PNG, AVIF (min/max effort), WebP (min/max effort), and JXL.
Support:
The tests covered:
- Disk utilization for lossless (where possible) formats of 10k sampled images.
- Throughput (wall time to randomly access and decode 1, 10, 100, 1k, and 10k individual images by UUID from the 10k set, N=3).
- Technical complexity / library support.
From the tests, the following were determined:
- Parquet bytes are inefficient for storage and slow for random access. Identical wall time for 1 and 10k images, and even bulk access was worse than other formats.
- AVIF is superb with storage and throughput, but more technically complicated to support for bit-perfect hashing, which is needed for data validation and duplicate management.
- JPG has great performance, but not lossless.
- JXL has superb performance but lags in library support.
- PNG has good throughput but heavy files.
- WebP has excellent storage performance and decent throughput.
Taken together, encoding images as WebP for random access in HDF5 files with metadata querying supported by accompanying Parquet files appears to be an ideal hybrid format for the current use case requirements.
Required action:
A tool should be added to the toolbox that can convert Parquet bytes into the hybrid HDF5(WebP)-Parquet(metadata) format.

