Skip to content

Parquet output format#870

Merged
Josef-Haupt merged 3 commits intobirdnet-team:mainfrom
mschulist:mschulist-parquet-output
Feb 26, 2026
Merged

Parquet output format#870
Josef-Haupt merged 3 commits intobirdnet-team:mainfrom
mschulist:mschulist-parquet-output

Conversation

@mschulist
Copy link
Contributor

This PR adds parquet as an output format for analyze. It uses PyArrow to handle all of the parquet reading and writing.

Currently, it creates a separate "table" for each timestamp, which might result in nonoptimal compression when there are few results per timestamp. There are a few options to improve this:

  • Have a buffer that creates a new table for every $n$ rows (slightly more complex implementation, but still not too bad).
  • Put all rows in a single table (which might use a lot of memory for large datasets).

Combining the results into a single file (with --combine_results) does make the output much smaller, but it would be ideal to have good compression without having to do this extra step.

Either way, I have found that parquet's columnar compression works particularly well on classifier outputs due to the repetitive nature of their outputs (e.g. filenames are repeated for many rows). For large datasets, parquet should provide a significant improvement in file size.

This is somewhat related to #230 as well.

@Josef-Haupt
Copy link
Member

Sounds good, the birdnet lib also has a parquet output, we are currently replacing the core of the analyzer with the lib anyway, we can merge the PR and I'll update the code in #867 to match it.

@mschulist
Copy link
Contributor Author

Ok great! There are a decent number of unrelated failing tests (which should be fixed once the PR you mentioned gets merged), is that okay?

I'll change this PR to all write to a single table, I realized that it's probably not worth batching because all of the results are in memory anyways...

@mschulist
Copy link
Contributor Author

@Josef-Haupt I just pushed the changes to make a single table per file, lmk if you have any thoughts :)

Thanks!

@Josef-Haupt Josef-Haupt merged commit 59ddce6 into birdnet-team:main Feb 26, 2026
1 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants