Bug: cannot query metadata with unsupported annotations#653
Bug: cannot query metadata with unsupported annotations#653
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis change introduces support for handling unsupported metadata annotations by adding a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan Review Summary by QodoFix unsupported annotations breaking queries and double-loading
WalkthroughsDescription• Handle unsupported metadata annotations gracefully without breaking queries • Create UnsupportedMetadataAnnotations subclass to prevent double-parsing errors • Add defensive checks to prevent re-processing already-converted annotation fields • Improve error messaging for unsupported or invalid annotations Diagramflowchart LR
A["Unsupported Annotation"] --> B["UnsupportedMetadataAnnotations"]
B --> C["Prevents Double-Parsing"]
C --> D["Query Succeeds"]
E["Defensive Checks"] --> F["Skip MetadataAnnotations"]
F --> D
File Changes1. dagshub/data_engine/annotation/metadata.py
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/data_engine/annotation_import/res/audio_annotation.json (1)
44-71: Consider using fake PII in test fixtures.The test data contains what appears to be a real email address and name. Even in test fixtures, it's better practice to use obviously fake data (e.g.,
test@example.com,John,Doe) to avoid any compliance concerns around PII in source control.dagshub/data_engine/util/not_implemented.py (1)
9-12: Edge case:dir(base)includes inherited attributes fromobjectand all ancestors.
dir(base)returns attributes from the entire MRO, not justbaseitself. IfMetadataAnnotationsinherits public methods from other classes in the future, those will also be replaced withNotImplementedErrorin the subclass. This is likely the desired behavior, but worth being aware of.Also,
staticmethodandclassmethoddescriptors from the base are not detected separately — they'll be seen ascallableand wrapped as instance methods, which changes their calling convention. Currently not a problem sincefrom_ls_task(a@staticmethod) is always called on the base class directly, but could be surprising if the pattern is reused.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
dagshub/data_engine/annotation/metadata.pydagshub/data_engine/model/query_result.pydagshub/data_engine/util/__init__.pydagshub/data_engine/util/not_implemented.pytests/data_engine/annotation_import/res/audio_annotation.jsontests/data_engine/annotation_import/test_annotation_parsing.py
🧰 Additional context used
🧬 Code graph analysis (2)
dagshub/data_engine/annotation/metadata.py (3)
dagshub/common/helpers.py (1)
log_message(100-107)dagshub/data_engine/util/not_implemented.py (1)
NotImplementedMeta(1-48)dagshub/data_engine/model/datapoint.py (1)
Datapoint(29-263)
tests/data_engine/annotation_import/test_annotation_parsing.py (2)
dagshub/data_engine/annotation/metadata.py (7)
MetadataAnnotations(36-319)UnsupportedMetadataAnnotations(322-339)add_image_bbox(146-183)value(104-114)value(332-333)to_ls_task(88-101)to_ls_task(335-336)dagshub/data_engine/model/query_result.py (1)
get_blob_fields(334-437)
🪛 Ruff (0.15.0)
tests/data_engine/annotation_import/test_annotation_parsing.py
[warning] 64-64: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Agent
- GitHub Check: build (3.11)
- GitHub Check: build (3.10)
- GitHub Check: build (3.9)
- GitHub Check: build (3.13)
- GitHub Check: build (3.12)
🔇 Additional comments (5)
dagshub/data_engine/annotation/metadata.py (1)
322-339: LGTM!The
UnsupportedMetadataAnnotationsclass correctly wraps unsupported annotations while preservingvalueandto_ls_taskbehavior. The metaclass will replace all non-overridden public methods (likeadd_image_bbox,add_image_segmentation, etc.) withNotImplementedError, which is the intended behavior.dagshub/data_engine/model/query_result.py (2)
429-435: Good defensive guard against re-processing parsed annotations as documents.This correctly prevents the
TypeError: expected str, bytes or os.PathLike object, not MetadataAnnotationsby skipping fields that were already parsed as annotations. Solid fix for Bug 2.
449-463: LGTM — annotation conversion guards and unsupported annotation fallback.The
isinstanceguard at line 451 prevents double-loading, and theValidationErrorcatch at line 460 gracefully wraps invalid annotations inUnsupportedMetadataAnnotationsinstead of crashing.tests/data_engine/annotation_import/test_annotation_parsing.py (2)
99-115: LGTM — good coverage of basic annotation parsing and double-load safety.
test_double_loading_annotation_worksat line 104 is a valuable regression test ensuring that callingget_blob_fieldsa second time doesn't crash or corrupt the already-parsed annotations.
118-134: Good test coverage for unsupported annotations.Verifies the
UnsupportedMetadataAnnotationssubclass behavior: isinstance checks,NotImplementedErroron unsupported methods, and content preservation viavalue/to_ls_task.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
There was a problem hiding this comment.
Pull request overview
Fixes Data Engine metadata blob handling so unsupported/invalid Label Studio annotations no longer break Datasource.all() / QueryResult.get_annotations(), and adds regression tests for double-loading behavior.
Changes:
- Add
UnsupportedMetadataAnnotationsto represent annotations that fail parsing while keeping consistentMetadataAnnotationstyping. - Harden
QueryResult.get_blob_fields()conversions to avoid re-processing already-parsed annotation fields and improve handling of path-vs-bytes blobs. - Extend annotation parsing tests and add an “unsupported” annotation fixture.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
dagshub/data_engine/model/query_result.py |
Adjusts blob/document/annotation conversion logic and introduces unsupported-annotation wrapper usage. |
dagshub/data_engine/annotation/metadata.py |
Adds UnsupportedMetadataAnnotations type for unsupported/invalid LS tasks. |
dagshub/data_engine/util/not_implemented.py |
Introduces a metaclass to block inherited methods/properties on unsupported annotations. |
tests/data_engine/annotation_import/test_annotation_parsing.py |
Adds regression tests for double-loading and unsupported annotations. |
tests/data_engine/annotation_import/res/audio_annotation.json |
Adds a fixture representing an unsupported (audio) LS task. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Defensive check to not mangle annotation fields by accident | ||
| if isinstance(dp.metadata[fld], MetadataAnnotations): | ||
| continue | ||
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): | ||
| dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes() | ||
| dp.metadata[fld] = dp.metadata[fld].decode("utf-8") |
There was a problem hiding this comment.
When _get_blob fails it returns a string like "Error while downloading binary blob: ..." (see Datapoint._get_blob). In that case this conversion path will call Path(<error string>).read_bytes() and raise FileNotFoundError, turning a download warning into a hard failure. Add a guard before Path(...).read_bytes() to detect error strings / non-path values (or make _get_blob raise and handle it), and skip decoding/conversion for that field when download failed.
| # Already loaded - skip | ||
| if isinstance(dp.metadata[fld], MetadataAnnotations): | ||
| continue | ||
| # Override the load_into_memory flag, because we need the contents | ||
| if not load_into_memory: | ||
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): | ||
| dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes() | ||
| try: | ||
| dp.metadata[fld] = MetadataAnnotations.from_ls_task( | ||
| datapoint=dp, field=fld, ls_task=dp.metadata[fld] | ||
| ) | ||
| except ValidationError: | ||
| dp.metadata[fld] = UnsupportedMetadataAnnotations( | ||
| datapoint=dp, field=fld, original_value=dp.metadata[fld] | ||
| ) |
There was a problem hiding this comment.
Similarly for annotation conversion: if _get_blob returned an error string (or any unexpected non-path, non-bytes value), Path(dp.metadata[fld]).read_bytes() will raise and the exception is not caught (only ValidationError is handled). Add a type/value guard before reading from disk, and consider treating download failures separately from ValidationError so unsupported/invalid annotations don’t crash get_blob_fields().
| @@ -0,0 +1,48 @@ | |||
| class NotImplementedMeta(type): | |||
| """ | |||
| A metaclass that replaces all parent class method and properties that aren't overridden in the subclass | |||
There was a problem hiding this comment.
Docstring grammar: “method and properties” should be “methods and properties”.
| A metaclass that replaces all parent class method and properties that aren't overridden in the subclass | |
| A metaclass that replaces all parent class methods and properties that aren't overridden in the subclass |
| # Get all attributes from base classes | ||
| for base in bases: | ||
| for attr_name in dir(base): | ||
| if attr_name.startswith("_"): | ||
| continue |
There was a problem hiding this comment.
NotImplementedMeta iterates over dir(base) which includes attributes coming from type (e.g. mro) and other inherited members that are not actually defined on the parent class. This means the metaclass can unintentionally overwrite built-in class/introspection behavior on subclasses. Consider iterating over base.__dict__ instead (and handling staticmethod/classmethod descriptors explicitly) to only wrap methods/properties actually declared on the parent class.
Code Review by Qodo
✅ 1.
|
| return self._original_value | ||
|
|
||
| def __repr__(self): | ||
| return "Unsupported LS annotations" |
There was a problem hiding this comment.
| return "Unsupported LS annotations" | |
| return "Label Studio annotations of unrecognized type" |
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): |
There was a problem hiding this comment.
I don't get the condition and the comment.
The comment says to ignore load_into_memory - then why do you check it? I assume it's because when load_into_memory == True, then in some earlier part of the code the field was already read as bytes. Then why not just assert that? I guess maybe because then the field content would be string or bytes and you can't know for sure whether that's the blob content or a path without relying on the assumed earlier side effects of the field (would be solved by the type safety we discussed, but out of scope in this PR).
Assuming everything I've guessed is correct - I think a comment explaining your assumptions would really help to make this code understandable
| # Defensive check to not mangle annotation fields by accident | ||
| if isinstance(dp.metadata[fld], MetadataAnnotations): | ||
| continue | ||
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): | ||
| dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes() | ||
| dp.metadata[fld] = dp.metadata[fld].decode("utf-8") |
| # Force load the content into memory, even if load_into_memory was set to False | ||
| if not load_into_memory or isinstance(dp.metadata[fld], Path): |
There was a problem hiding this comment.
This condition should probably be refactored to a function and made DRY
| ) | ||
| bad_annotations[fld].append(dp.path) | ||
| else: | ||
| # Empty annotation container |
There was a problem hiding this comment.
I recommend flipping the if-else order to make it readable. Or even just use continue and get rid of a nesting level.
| "Warning: The following datapoints had unsupported or invalid annotations, " | ||
| "any annotation-related operations will not work on these:" |
There was a problem hiding this comment.
IMO this warning message is too scary - the users can do whatever they want with the result just fine.
If what you mean is that they won't be able to use our fancy add_bounding_box etc methods - then just say that. The users don't know what you mean.
This PR addresses two bugs:
Bug 1: unsupported metadata annotations break
ds.all(), if they were also tagged as document.Reproduction:
DOCUMENTtag on it (will happen automatically right now on DagsHub)ds.all()Expected:
The warning shows up, but otherwise the program continues execution.
Actual:
Following stack trace, that happens due to trying to read the annotation content a second time:
Solution:
Created a new subclass of
MetadataAnnotationsnamedUnsupportedMetadataAnnotations, that prevents interacting with the annotations, but still allows to get the value via.valueand.to_ls_task(), making it soqr.get_annotations()acts consistently.No matter if an annotation is valid or not, all metadata values of this field will be of
MetadataAnnotationsThis makes it so that trying to load the annotation a second time will fall through due to this line:
client/dagshub/data_engine/model/query_result.py
Line 451 in 511801c
This bug has a reproduction test added, that fails on current main, and is fixed in this PR
Bug 2: Possible to load an annotation as a document a second time, despite the annotation already having been parsed
I couldn't get an actual reproduction for it, aside from the stack trace that was throwing in a similar way, but in a function
get_annotations, leading toget_blob_fields:client/dagshub/data_engine/model/query_result.py
Line 430 in 4e254d4
Solution:
Added a safety guard against it at this line, and added a test that loads the annotations a second time.
Implementation Details
Core Components
NotImplementedMeta Metaclass (
dagshub/data_engine/util/not_implemented.py)A new metaclass that automatically creates stubs for inherited methods and properties. For each attribute not overridden in a subclass, it either injects a property that raises
NotImplementedErroron access (for properties) or a method that raisesNotImplementedErrorwhen called (for callables). This enables subclasses to explicitly indicate unsupported functionality while maintaining a consistent interface.UnsupportedMetadataAnnotations Class (
dagshub/data_engine/annotation/metadata.py)A new subclass of
MetadataAnnotationsusingNotImplementedMetaas its metaclass. It wraps unsupported annotation values (e.g., audio metadata) and exposes them throughvalueandto_ls_task()properties without attempting full metadata annotation processing. This allows consistent handling of unsupported annotations across the codebase while preventing type errors when accessing unsupported features.Query Processing Improvements (
dagshub/data_engine/model/query_result.py)The blob and annotation loading logic now includes safety guards:
MetadataAnnotationsinstances, the loader skips further processing to prevent double-loading errorsUnsupportedMetadataAnnotationsinstead of raising an exception, ensuring downstream code can still access the raw annotation dataload_into_memoryis False or when the existing field is a Path, then decoded to UTF-8Test Coverage
Tests now exercise the critical double-loading scenario with a new fixture (
ds_with_unsupported_annotation) and explicit multi-load verification. Test mocks were enhanced to support both bytes and PathLike return types, allowing more realistic blob retrieval simulation. Audio annotation test data (audio_annotation.json) provides a concrete example of unsupported metadata for testing purposes.