fix: update to perch-hoplite 1.0.0 API (Deployment → Recording → Window)#871
Conversation
There was a problem hiding this comment.
Pull request overview
This PR migrates BirdNET-Analyzer from the deprecated perch-hoplite EmbeddingSource API to the new v1.0.0 Deployment → Recording → Window data model. The migration includes API renames (e.g., SQLiteUsearchDB → SQLiteUSearchDB), replacement of removed methods (model.encode_array() → model.encode_session() + session.run_arrays()), and implementation of ghost segment filtering to prevent invalid padded segments from being inserted into the database.
Changes:
- Migrated embedding pipeline to use new deployment/recording/window hierarchy with improved resume support via
handle_duplicates="skip" - Fixed model utilities to use new birdnet library encoding session API
- Updated search and GUI components to retrieve window and recording data using the new perch-hoplite 1.0 API
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/embeddings/test_embeddings.py | Updated mock to return proper encoding result structure with required attributes |
| birdnet_analyzer/search/utils.py | Fixed SQLiteUSearchDB casing and migrated from embedding_id to window_id |
| birdnet_analyzer/search/core.py | Updated to use get_window() and get_recording() instead of removed get_embedding_source() |
| birdnet_analyzer/model_utils.py | Replaced encode_array() with encode_session() + run_arrays() and added result squeezing |
| birdnet_analyzer/gui/search.py | Updated GUI search to use new window/recording API |
| birdnet_analyzer/gui/embeddings.py | Fixed SQLiteUSearchDB casing in database creation |
| birdnet_analyzer/embeddings/core.py | Comprehensive rewrite to use deployment/recording/window model with ghost segment filtering |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Migrate from deprecated perch-hoplite API (EmbeddingSource model) to the new Deployment → Recording → Window data model introduced in v1.0.0. Changes: - embeddings/core.py: Rewrite embedding pipeline to use insert_deployment/insert_recording/insert_window instead of insert_embedding+EmbeddingSource. Add ghost segment filtering for birdnet's padded AcousticFileEncodingResult. Use handle_duplicates="skip" for resume support. - model_utils.py: Replace removed encode_array() with encode_session()+run_arrays() API. - search/utils.py: Fix SQLiteUsearchDB → SQLiteUSearchDB casing, replace embedding_id with window_id in SearchResult. - search/core.py: Use get_window()+get_recording() instead of removed get_embedding_source(). - gui/search.py: Same get_window()+get_recording() migration. - gui/embeddings.py: Fix SQLiteUSearchDB casing. - tests/embeddings/test_embeddings.py: Update mock to match new AcousticFileEncodingResult structure.
…uard os.makedirs against empty dirname
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
8e5d479 to
289c346
Compare
|
I rebased it. Just note that this is for the current WIP birdnet-team:birdnet-lib (#867) branch. Since the refactoring to birdnetlib is ongoing, this should not interfere with it. |
|
If it is possible, you will see significant performance improvements using In addition, you'll also see a significant speedup if you use the USearch index to perform the ANN/KNN instead of the brute force approach it currently uses. This might require changing the interface because the search metric must be defined during the db creation, however I imagine most people are using the inner product metric anyways... |
|
@mschulist Thanks a lot for the pointers here, they were super helpful. I implemented both suggestions and ran benchmarks. What I changed
Benchmark setup
Results
Note on score-function interfacesANN is metric-bound at DB/index level, while score function is currently selected per query. Why insert gain is modest (my current hypothesis)
I’m still thinking through the cleanest interface changes for |
|
Yeah it is a bit unfortunate that there is so much overhead with |
|
Looks good! |
|
Looks good to me also. |
Summary
Migrate from the deprecated perch-hoplite API (
EmbeddingSourcemodel) to the new Deployment → Recording → Window data model introduced in perch-hoplite v1.0.0.The previous code used
EmbeddingSource,get_embedding_source(),insert_embedding(), andSQLiteUsearchDB(lowercase "s"), all of which have been removed or renamed in perch-hoplite 1.0.0.Changes
Embedding pipeline (
birdnet_analyzer/embeddings/core.py)insert_deployment()/insert_recording()/insert_window()instead of the removedinsert_embedding()+EmbeddingSourcebirdnetpads shorter files in a batch to matchmax_n_segments, and not all padded segments are masked. Now additionally checkss_start >= input_durations[i]and clampss_end = min(s_end, file_dur)to avoid inserting phantom windowshandle_duplicates="skip"oninsert_window()for resume supportcreate_csv_output()to usematch_window_ids()+get_window()+get_recording()Model utilities (
birdnet_analyzer/model_utils.py)model.encode_array()withmodel.encode_session()+session.run_arrays()(birdnet library API change)Search (
birdnet_analyzer/search/utils.py,search/core.py)SQLiteUsearchDB→SQLiteUSearchDBcasing (renamed in perch-hoplite 1.0)embedding_idwithwindow_idinSearchResultget_embedding_source()withget_window()+get_recording()GUI (
birdnet_analyzer/gui/search.py,gui/embeddings.py)get_window()+get_recording()migrationSQLiteUSearchDBcasingTests (
tests/embeddings/test_embeddings.py)get_embeddings()to return a properAcousticFileEncodingResult-like object withsegment_duration_s,overlap_duration_s,n_inputs,embeddings,embeddings_masked,inputs,input_durationsTesting
tests/analyze/test_analyze.pyandtests/test_utils.pyare unrelated to this PR and also fail on the basebirdnet-libbranch)tests/embeddings/test_embeddings.pypasses with updated mock