Skip to content

fix: update to perch-hoplite 1.0.0 API (Deployment → Recording → Window)#871

Merged
max-mauermann merged 6 commits intobirdnet-team:birdnet-libfrom
LimitlessGreen:fix/perch-hoplite-1.0-api-compat
Feb 26, 2026
Merged

fix: update to perch-hoplite 1.0.0 API (Deployment → Recording → Window)#871
max-mauermann merged 6 commits intobirdnet-team:birdnet-libfrom
LimitlessGreen:fix/perch-hoplite-1.0-api-compat

Conversation

@LimitlessGreen
Copy link
Contributor

@LimitlessGreen LimitlessGreen commented Feb 16, 2026

Summary

Migrate from the deprecated perch-hoplite API (EmbeddingSource model) to the new Deployment → Recording → Window data model introduced in perch-hoplite v1.0.0.

The previous code used EmbeddingSource, get_embedding_source(), insert_embedding(), and SQLiteUsearchDB (lowercase "s"), all of which have been removed or renamed in perch-hoplite 1.0.0.

Changes

Embedding pipeline (birdnet_analyzer/embeddings/core.py)

  • Rewrite to use insert_deployment() / insert_recording() / insert_window() instead of the removed insert_embedding() + EmbeddingSource
  • Add ghost segment filtering: birdnet pads shorter files in a batch to match max_n_segments, and not all padded segments are masked. Now additionally checks s_start >= input_durations[i] and clamps s_end = min(s_end, file_dur) to avoid inserting phantom windows
  • Use handle_duplicates="skip" on insert_window() for resume support
  • Fix create_csv_output() to use match_window_ids() + get_window() + get_recording()

Model utilities (birdnet_analyzer/model_utils.py)

  • Replace removed model.encode_array() with model.encode_session() + session.run_arrays() (birdnet library API change)

Search (birdnet_analyzer/search/utils.py, search/core.py)

  • Fix SQLiteUsearchDBSQLiteUSearchDB casing (renamed in perch-hoplite 1.0)
  • Replace embedding_id with window_id in SearchResult
  • Replace removed get_embedding_source() with get_window() + get_recording()

GUI (birdnet_analyzer/gui/search.py, gui/embeddings.py)

  • Same get_window() + get_recording() migration
  • Fix SQLiteUSearchDB casing

Tests (tests/embeddings/test_embeddings.py)

  • Update mock for get_embeddings() to return a proper AcousticFileEncodingResult-like object with segment_duration_s, overlap_duration_s, n_inputs, embeddings, embeddings_masked, inputs, input_durations

Testing

  • 373 tests pass, no regressions introduced (7 pre-existing failures in tests/analyze/test_analyze.py and tests/test_utils.py are unrelated to this PR and also fail on the base birdnet-lib branch)
  • tests/embeddings/test_embeddings.py passes with updated mock
  • Verified embedding creation with a real audio dataset (100 recordings → 495 windows)
  • Verified search and CSV export functionality

Copilot AI review requested due to automatic review settings February 16, 2026 17:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates BirdNET-Analyzer from the deprecated perch-hoplite EmbeddingSource API to the new v1.0.0 Deployment → Recording → Window data model. The migration includes API renames (e.g., SQLiteUsearchDBSQLiteUSearchDB), replacement of removed methods (model.encode_array()model.encode_session() + session.run_arrays()), and implementation of ghost segment filtering to prevent invalid padded segments from being inserted into the database.

Changes:

  • Migrated embedding pipeline to use new deployment/recording/window hierarchy with improved resume support via handle_duplicates="skip"
  • Fixed model utilities to use new birdnet library encoding session API
  • Updated search and GUI components to retrieve window and recording data using the new perch-hoplite 1.0 API

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/embeddings/test_embeddings.py Updated mock to return proper encoding result structure with required attributes
birdnet_analyzer/search/utils.py Fixed SQLiteUSearchDB casing and migrated from embedding_id to window_id
birdnet_analyzer/search/core.py Updated to use get_window() and get_recording() instead of removed get_embedding_source()
birdnet_analyzer/model_utils.py Replaced encode_array() with encode_session() + run_arrays() and added result squeezing
birdnet_analyzer/gui/search.py Updated GUI search to use new window/recording API
birdnet_analyzer/gui/embeddings.py Fixed SQLiteUSearchDB casing in database creation
birdnet_analyzer/embeddings/core.py Comprehensive rewrite to use deployment/recording/window model with ghost segment filtering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LimitlessGreen and others added 4 commits February 17, 2026 14:59
Migrate from deprecated perch-hoplite API (EmbeddingSource model) to the
new Deployment → Recording → Window data model introduced in v1.0.0.

Changes:
- embeddings/core.py: Rewrite embedding pipeline to use
  insert_deployment/insert_recording/insert_window instead of
  insert_embedding+EmbeddingSource. Add ghost segment filtering
  for birdnet's padded AcousticFileEncodingResult. Use
  handle_duplicates="skip" for resume support.
- model_utils.py: Replace removed encode_array() with
  encode_session()+run_arrays() API.
- search/utils.py: Fix SQLiteUsearchDB → SQLiteUSearchDB casing,
  replace embedding_id with window_id in SearchResult.
- search/core.py: Use get_window()+get_recording() instead of
  removed get_embedding_source().
- gui/search.py: Same get_window()+get_recording() migration.
- gui/embeddings.py: Fix SQLiteUSearchDB casing.
- tests/embeddings/test_embeddings.py: Update mock to match new
  AcousticFileEncodingResult structure.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@LimitlessGreen LimitlessGreen force-pushed the fix/perch-hoplite-1.0-api-compat branch from 8e5d479 to 289c346 Compare February 17, 2026 14:01
@LimitlessGreen
Copy link
Contributor Author

LimitlessGreen commented Feb 17, 2026

I rebased it. Just note that this is for the current WIP birdnet-team:birdnet-lib (#867) branch. Since the refactoring to birdnetlib is ongoing, this should not interfere with it.

@mschulist
Copy link
Contributor

If it is possible, you will see significant performance improvements using insert_windows_batch instead of insert_window (about 5x improvement). Perhaps you can do a single batch per file (which would also remove the redundant get_all_recordings calls)?

In addition, you'll also see a significant speedup if you use the USearch index to perform the ANN/KNN instead of the brute force approach it currently uses. This might require changing the interface because the search metric must be defined during the db creation, however I imagine most people are using the inner product metric anyways...

@LimitlessGreen
Copy link
Contributor Author

@mschulist Thanks a lot for the pointers here, they were super helpful. I implemented both suggestions and ran benchmarks.

What I changed

  1. I switched embedding writes to insert_windows_batch (instead of per-window inserts).
  2. I added USearch ANN for score_function=dot when the DB metric is IP (with brute-force fallback otherwise).

Benchmark setup

  • 100 WAV files from a representative sample subset
  • Same machine and DB configuration for before/after comparison

Results

Workload Insert (before -> after) Insert speedup Search dot (before -> after) Search speedup
30 segments/file (3 runs) 20.84s -> 20.69s 1.01x 0.1245s -> 0.0031s 40.37x
60 segments/file (2 runs) 85.19s -> 80.84s 1.05x 0.1556s -> 0.0042s 36.86x
90 segments/file (1 run) 189.11s -> 185.40s 1.02x 0.2414s -> 0.0056s 42.74x

Note on score-function interfaces

ANN is metric-bound at DB/index level, while score function is currently selected per query.
So right now I use ANN for compatible combinations (dot + IP), and I keep brute-force fallback for other combinations (cosine/euclidean) to preserve correctness.

Why insert gain is modest (my current hypothesis)

insert_windows_batch removes some overhead (for example fewer repeated recording lookups), but the dominant cost still seems to be per-window DB/index writes and duplicate handling, so ingest improvement is measurable but small in this setup.

I’m still thinking through the cleanest interface changes for cosine/euclidean (and whether to expose backend choice more explicitly), so I can make that behavior clearer and less surprising.

@mschulist
Copy link
Contributor

Yeah it is a bit unfortunate that there is so much overhead with insert_windows_batch when checking for duplicates... But at least the indexing is fast!

@Josef-Haupt
Copy link
Member

Looks good!

@max-mauermann
Copy link
Member

Looks good to me also.
Thanks for providing this!

@max-mauermann max-mauermann merged commit 5cb7939 into birdnet-team:birdnet-lib Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants