Skip to content

Bug: cannot query metadata with unsupported annotations#653

Open
kbolashev wants to merge 4 commits intomainfrom
bug/document-annotation-fields
Open

Bug: cannot query metadata with unsupported annotations#653
kbolashev wants to merge 4 commits intomainfrom
bug/document-annotation-fields

Conversation

@kbolashev
Copy link
Member

@kbolashev kbolashev commented Feb 15, 2026

This PR addresses two bugs:

Bug 1: unsupported metadata annotations break ds.all(), if they were also tagged as document.

Reproduction:

  • Create datasource and upload metadata that is not currently supported by the converter (e.g. audio metadata)
  • Make sure that the metadata has the DOCUMENT tag on it (will happen automatically right now on DagsHub)
  • Try to do ds.all()

Expected:
The warning shows up, but otherwise the program continues execution.

Actual:
Following stack trace, that happens due to trying to read the annotation content a second time:

  File "/Users/kirillbolashev/Job/GitRepos/client-test-suite/manual_tests/bugs/querying/autoloading_bugs.py", line 10, in load_audio_dataset
    qr = ds.all()
  File "/Users/kirillbolashev/Job/GitRepos/client/dagshub/data_engine/model/datasource.py", line 337, in all
    return ds.fetch(load_documents=load_documents, load_annotations=load_annotations)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kirillbolashev/Job/GitRepos/client/dagshub/data_engine/model/datasource.py", line 289, in fetch
    res._load_autoload_fields(documents=load_documents, annotations=load_annotations)
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kirillbolashev/Job/GitRepos/client/dagshub/data_engine/model/query_result.py", line 1105, in _load_autoload_fields
    self.get_annotations()
    ~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/kirillbolashev/Job/GitRepos/client/dagshub/data_engine/model/query_result.py", line 679, in get_annotations
    self.get_blob_fields(*self.annotation_fields, **kwargs)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kirillbolashev/Job/GitRepos/client/dagshub/data_engine/model/query_result.py", line 418, in get_blob_fields
    self._convert_annotation_fields(*fields, load_into_memory=load_into_memory)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kirillbolashev/Job/GitRepos/client/dagshub/data_engine/model/query_result.py", line 451, in _convert_annotation_fields
    dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes()
                       ~~~~^^^^^^^^^^^^^^^^^^
  File "/Users/kirillbolashev/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py", line 503, in __init__
    super().__init__(*args)
    ~~~~~~~~~~~~~~~~^^^^^^^
  File "/Users/kirillbolashev/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py", line 132, in __init__
    raise TypeError(
    ...<2 lines>...
        f"not {type(path).__name__!r}")
TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'bytes'

Solution:

Created a new subclass of MetadataAnnotations named UnsupportedMetadataAnnotations, that prevents interacting with the annotations, but still allows to get the value via .value and .to_ls_task(), making it so qr.get_annotations() acts consistently.
No matter if an annotation is valid or not, all metadata values of this field will be of MetadataAnnotations

This makes it so that trying to load the annotation a second time will fall through due to this line:

if isinstance(dp.metadata[fld], MetadataAnnotations):

This bug has a reproduction test added, that fails on current main, and is fixed in this PR

Bug 2: Possible to load an annotation as a document a second time, despite the annotation already having been parsed

I couldn't get an actual reproduction for it, aside from the stack trace that was throwing in a similar way, but in a function get_annotations, leading to get_blob_fields:

dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes()

TypeError: expected str, bytes or os.PathLike object, not MetadataAnnotations`

Solution:

Added a safety guard against it at this line, and added a test that loads the annotations a second time.

Implementation Details

Core Components

NotImplementedMeta Metaclass (dagshub/data_engine/util/not_implemented.py)
A new metaclass that automatically creates stubs for inherited methods and properties. For each attribute not overridden in a subclass, it either injects a property that raises NotImplementedError on access (for properties) or a method that raises NotImplementedError when called (for callables). This enables subclasses to explicitly indicate unsupported functionality while maintaining a consistent interface.

UnsupportedMetadataAnnotations Class (dagshub/data_engine/annotation/metadata.py)
A new subclass of MetadataAnnotations using NotImplementedMeta as its metaclass. It wraps unsupported annotation values (e.g., audio metadata) and exposes them through value and to_ls_task() properties without attempting full metadata annotation processing. This allows consistent handling of unsupported annotations across the codebase while preventing type errors when accessing unsupported features.

Query Processing Improvements (dagshub/data_engine/model/query_result.py)

The blob and annotation loading logic now includes safety guards:

  • Skip re-loading: When processing fields that are already MetadataAnnotations instances, the loader skips further processing to prevent double-loading errors
  • Graceful fallback: When annotation validation fails, the original value is wrapped in UnsupportedMetadataAnnotations instead of raising an exception, ensuring downstream code can still access the raw annotation data
  • Conditional memory loading: Content is force-loaded into memory when load_into_memory is False or when the existing field is a Path, then decoded to UTF-8

Test Coverage

Tests now exercise the critical double-loading scenario with a new fixture (ds_with_unsupported_annotation) and explicit multi-load verification. Test mocks were enhanced to support both bytes and PathLike return types, allowing more realistic blob retrieval simulation. Audio annotation test data (audio_annotation.json) provides a concrete example of unsupported metadata for testing purposes.

@kbolashev kbolashev self-assigned this Feb 15, 2026
@kbolashev kbolashev added the bug Something isn't working label Feb 15, 2026
@dagshub
Copy link

dagshub bot commented Feb 15, 2026

@coderabbitai
Copy link

coderabbitai bot commented Feb 15, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This change introduces support for handling unsupported metadata annotations by adding a new UnsupportedMetadataAnnotations class backed by a NotImplementedMeta metaclass. The annotation loading logic is updated to gracefully wrap invalid annotations instead of failing, and tests are refactored to verify this behavior.

Changes

Cohort / File(s) Summary
Core Metaclass Support
dagshub/data_engine/util/not_implemented.py
New NotImplementedMeta metaclass that injects stubs raising NotImplementedError for inherited methods and properties not explicitly overridden in subclasses.
Metadata Annotation Handling
dagshub/data_engine/annotation/metadata.py
New UnsupportedMetadataAnnotations class extending MetadataAnnotations with NotImplementedMeta metaclass; stores and exposes original_value without full metadata annotation support.
Query Result Processing
dagshub/data_engine/model/query_result.py
Updated blob loading and annotation conversion logic to skip already-loaded MetadataAnnotations, conditionally force-load content into memory, and wrap validation errors in UnsupportedMetadataAnnotations instead of failing.
Test Infrastructure
tests/data_engine/annotation_import/test_annotation_parsing.py
Refactored fixtures with _ds_with_annotation helper; added ds_with_unsupported_annotation fixture; updated mock_get_blob to return Union[bytes, PathLike]; added imports for QueryResult, UnsupportedMetadataAnnotations, and type hints.
Test Fixture Data
tests/data_engine/annotation_import/res/audio_annotation.json
New JSON test fixture containing complete audio annotation payload with metadata and result items.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bug/document-annotation-fields

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link

ⓘ You are approaching your monthly quota for Qodo. Upgrade your plan

Review Summary by Qodo

Fix unsupported annotations breaking queries and double-loading

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Handle unsupported metadata annotations gracefully without breaking queries
• Create UnsupportedMetadataAnnotations subclass to prevent double-parsing errors
• Add defensive checks to prevent re-processing already-converted annotation fields
• Improve error messaging for unsupported or invalid annotations
Diagram
flowchart LR
  A["Unsupported Annotation"] --> B["UnsupportedMetadataAnnotations"]
  B --> C["Prevents Double-Parsing"]
  C --> D["Query Succeeds"]
  E["Defensive Checks"] --> F["Skip MetadataAnnotations"]
  F --> D
Loading

Grey Divider

File Changes

1. dagshub/data_engine/annotation/metadata.py ✨ Enhancement +29/-7

Create UnsupportedMetadataAnnotations for graceful handling

• Added UnsupportedMetadataAnnotations subclass using NotImplementedMeta metaclass
• Stores original bytes value for unsupported annotations
• Implements value property and to_ls_task() method to return original content
• Prevents interaction with unsupported annotations while preserving data access

dagshub/data_engine/annotation/metadata.py


2. dagshub/data_engine/model/query_result.py 🐞 Bug fix +14/-6

Add defensive checks and unsupported annotation handling

• Added import for UnsupportedMetadataAnnotations
• Added defensive check in _get_blob_fn to skip MetadataAnnotations instances
• Added check in _convert_annotation_fields to skip already-loaded annotations
• Wrap ValidationError exceptions by creating UnsupportedMetadataAnnotations instead
• Improved warning message to mention "unsupported or invalid annotations"

dagshub/data_engine/model/query_result.py


3. dagshub/data_engine/util/not_implemented.py ✨ Enhancement +48/-0

Implement NotImplementedMeta metaclass for method stubbing

• New metaclass NotImplementedMeta that automatically raises NotImplementedError
• Replaces all non-overridden parent methods and properties with error-raising stubs
• Handles both regular methods and properties correctly
• Skips private attributes (starting with underscore)

dagshub/data_engine/util/not_implemented.py


View more (3)
4. tests/data_engine/annotation_import/test_annotation_parsing.py 🧪 Tests +54/-7

Add tests for double-loading and unsupported annotations

• Refactored mock_get_blob to support both in-memory and path-based returns
• Extracted common fixture logic into _ds_with_annotation helper function
• Added ds_with_unsupported_annotation fixture for testing unsupported annotations
• Added test_double_loading_annotation_works to verify double-loading doesn't break
• Added test_handling_unsupported_annotation to verify unsupported annotations work correctly
• Extracted annotation assertion logic into _test_annotation helper

tests/data_engine/annotation_import/test_annotation_parsing.py


5. tests/data_engine/annotation_import/res/audio_annotation.json 🧪 Tests +82/-0

Add audio annotation test fixture file

• New test fixture file containing audio metadata annotation in Label Studio format
• Includes audio data reference and multiple annotation types (choices, rating, textarea)
• Used to test handling of unsupported annotation types

tests/data_engine/annotation_import/res/audio_annotation.json


6. dagshub/data_engine/util/__init__.py Additional files +0/-0

...

dagshub/data_engine/util/init.py


Grey Divider

Qodo Logo

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/data_engine/annotation_import/res/audio_annotation.json (1)

44-71: Consider using fake PII in test fixtures.

The test data contains what appears to be a real email address and name. Even in test fixtures, it's better practice to use obviously fake data (e.g., test@example.com, John, Doe) to avoid any compliance concerns around PII in source control.

dagshub/data_engine/util/not_implemented.py (1)

9-12: Edge case: dir(base) includes inherited attributes from object and all ancestors.

dir(base) returns attributes from the entire MRO, not just base itself. If MetadataAnnotations inherits public methods from other classes in the future, those will also be replaced with NotImplementedError in the subclass. This is likely the desired behavior, but worth being aware of.

Also, staticmethod and classmethod descriptors from the base are not detected separately — they'll be seen as callable and wrapped as instance methods, which changes their calling convention. Currently not a problem since from_ls_task (a @staticmethod) is always called on the base class directly, but could be surprising if the pattern is reused.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c964d0 and e4ace64.

📒 Files selected for processing (6)
  • dagshub/data_engine/annotation/metadata.py
  • dagshub/data_engine/model/query_result.py
  • dagshub/data_engine/util/__init__.py
  • dagshub/data_engine/util/not_implemented.py
  • tests/data_engine/annotation_import/res/audio_annotation.json
  • tests/data_engine/annotation_import/test_annotation_parsing.py
🧰 Additional context used
🧬 Code graph analysis (2)
dagshub/data_engine/annotation/metadata.py (3)
dagshub/common/helpers.py (1)
  • log_message (100-107)
dagshub/data_engine/util/not_implemented.py (1)
  • NotImplementedMeta (1-48)
dagshub/data_engine/model/datapoint.py (1)
  • Datapoint (29-263)
tests/data_engine/annotation_import/test_annotation_parsing.py (2)
dagshub/data_engine/annotation/metadata.py (7)
  • MetadataAnnotations (36-319)
  • UnsupportedMetadataAnnotations (322-339)
  • add_image_bbox (146-183)
  • value (104-114)
  • value (332-333)
  • to_ls_task (88-101)
  • to_ls_task (335-336)
dagshub/data_engine/model/query_result.py (1)
  • get_blob_fields (334-437)
🪛 Ruff (0.15.0)
tests/data_engine/annotation_import/test_annotation_parsing.py

[warning] 64-64: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Agent
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.9)
  • GitHub Check: build (3.13)
  • GitHub Check: build (3.12)
🔇 Additional comments (5)
dagshub/data_engine/annotation/metadata.py (1)

322-339: LGTM!

The UnsupportedMetadataAnnotations class correctly wraps unsupported annotations while preserving value and to_ls_task behavior. The metaclass will replace all non-overridden public methods (like add_image_bbox, add_image_segmentation, etc.) with NotImplementedError, which is the intended behavior.

dagshub/data_engine/model/query_result.py (2)

429-435: Good defensive guard against re-processing parsed annotations as documents.

This correctly prevents the TypeError: expected str, bytes or os.PathLike object, not MetadataAnnotations by skipping fields that were already parsed as annotations. Solid fix for Bug 2.


449-463: LGTM — annotation conversion guards and unsupported annotation fallback.

The isinstance guard at line 451 prevents double-loading, and the ValidationError catch at line 460 gracefully wraps invalid annotations in UnsupportedMetadataAnnotations instead of crashing.

tests/data_engine/annotation_import/test_annotation_parsing.py (2)

99-115: LGTM — good coverage of basic annotation parsing and double-load safety.

test_double_loading_annotation_works at line 104 is a valuable regression test ensuring that calling get_blob_fields a second time doesn't crash or corrupt the already-parsed annotations.


118-134: Good test coverage for unsupported annotations.

Verifies the UnsupportedMetadataAnnotations subclass behavior: isinstance checks, NotImplementedError on unsupported methods, and content preservation via value/to_ls_task.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Data Engine metadata blob handling so unsupported/invalid Label Studio annotations no longer break Datasource.all() / QueryResult.get_annotations(), and adds regression tests for double-loading behavior.

Changes:

  • Add UnsupportedMetadataAnnotations to represent annotations that fail parsing while keeping consistent MetadataAnnotations typing.
  • Harden QueryResult.get_blob_fields() conversions to avoid re-processing already-parsed annotation fields and improve handling of path-vs-bytes blobs.
  • Extend annotation parsing tests and add an “unsupported” annotation fixture.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
dagshub/data_engine/model/query_result.py Adjusts blob/document/annotation conversion logic and introduces unsupported-annotation wrapper usage.
dagshub/data_engine/annotation/metadata.py Adds UnsupportedMetadataAnnotations type for unsupported/invalid LS tasks.
dagshub/data_engine/util/not_implemented.py Introduces a metaclass to block inherited methods/properties on unsupported annotations.
tests/data_engine/annotation_import/test_annotation_parsing.py Adds regression tests for double-loading and unsupported annotations.
tests/data_engine/annotation_import/res/audio_annotation.json Adds a fixture representing an unsupported (audio) LS task.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +429 to 435
# Defensive check to not mangle annotation fields by accident
if isinstance(dp.metadata[fld], MetadataAnnotations):
continue
# Force load the content into memory, even if load_into_memory was set to False
if not load_into_memory or isinstance(dp.metadata[fld], Path):
dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes()
dp.metadata[fld] = dp.metadata[fld].decode("utf-8")
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _get_blob fails it returns a string like "Error while downloading binary blob: ..." (see Datapoint._get_blob). In that case this conversion path will call Path(<error string>).read_bytes() and raise FileNotFoundError, turning a download warning into a hard failure. Add a guard before Path(...).read_bytes() to detect error strings / non-path values (or make _get_blob raise and handle it), and skip decoding/conversion for that field when download failed.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also sounds correct

Comment on lines 450 to +463
# Already loaded - skip
if isinstance(dp.metadata[fld], MetadataAnnotations):
continue
# Override the load_into_memory flag, because we need the contents
if not load_into_memory:
# Force load the content into memory, even if load_into_memory was set to False
if not load_into_memory or isinstance(dp.metadata[fld], Path):
dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes()
try:
dp.metadata[fld] = MetadataAnnotations.from_ls_task(
datapoint=dp, field=fld, ls_task=dp.metadata[fld]
)
except ValidationError:
dp.metadata[fld] = UnsupportedMetadataAnnotations(
datapoint=dp, field=fld, original_value=dp.metadata[fld]
)
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly for annotation conversion: if _get_blob returned an error string (or any unexpected non-path, non-bytes value), Path(dp.metadata[fld]).read_bytes() will raise and the exception is not caught (only ValidationError is handled). Add a type/value guard before reading from disk, and consider treating download failures separately from ValidationError so unsupported/invalid annotations don’t crash get_blob_fields().

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,48 @@
class NotImplementedMeta(type):
"""
A metaclass that replaces all parent class method and properties that aren't overridden in the subclass
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring grammar: “method and properties” should be “methods and properties”.

Suggested change
A metaclass that replaces all parent class method and properties that aren't overridden in the subclass
A metaclass that replaces all parent class methods and properties that aren't overridden in the subclass

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +12
# Get all attributes from base classes
for base in bases:
for attr_name in dir(base):
if attr_name.startswith("_"):
continue
Copy link

Copilot AI Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplementedMeta iterates over dir(base) which includes attributes coming from type (e.g. mro) and other inherited members that are not actually defined on the parent class. This means the metaclass can unintentionally overwrite built-in class/introspection behavior on subclasses. Consider iterating over base.__dict__ instead (and handling staticmethod/classmethod descriptors explicitly) to only wrap methods/properties actually declared on the parent class.

Copilot uses AI. Check for mistakes.
@qodo-code-review
Copy link

qodo-code-review bot commented Feb 15, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider


Action required

✅ 1. __repr__ leaks raw bytes 📘 Rule violation ⛨ Security
Description
UnsupportedMetadataAnnotations.__repr__() embeds _original_value directly, which can expose
potentially sensitive annotation content via logs, tracebacks, or debug output. This conflicts with
secure logging/error-handling expectations and can leak user-provided data.
Code

dagshub/data_engine/annotation/metadata.py[R338-339]

+    def __repr__(self):
+        return f"Unsupported LS annotations with original value: {self._original_value}"
Evidence
Secure logging/error handling requires avoiding sensitive data exposure; the new __repr__ returns
a string containing the full original annotation bytes, which may include PII or other sensitive
content and can surface in logs or user-visible errors.

Rule 5: Generic: Secure Logging Practices
Rule 4: Generic: Secure Error Handling
dagshub/data_engine/annotation/metadata.py[338-339]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`UnsupportedMetadataAnnotations.__repr__()` currently includes the full `_original_value` bytes in the returned string, which may leak sensitive annotation content through logs, tracebacks, or debug printing.
## Issue Context
This object may be implicitly stringified (logging, exceptions, debugging). To align with secure logging/error handling, the representation should avoid printing raw content.
## Fix Focus Areas
- dagshub/data_engine/annotation/metadata.py[338-339]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Warning logs full datapoints 📘 Rule violation ⛨ Security
Description
The warning for unsupported/invalid annotations logs full datapoint paths, which may contain
sensitive identifiers. This reduces log hygiene and can leak sensitive information into application
logs.
Code

dagshub/data_engine/model/query_result.py[R471-478]

+                "Warning: The following datapoints had unsupported or invalid annotations, "
               "any annotation-related operations will not work on these:"
           )
           err_msg = ""
           for fld, dps in bad_annotations.items():
-                err_msg += f'Field "{fld}" in datapoints:\n\t'
+                err_msg += f'\nField "{fld}" in datapoints:\n\t'
               err_msg += "\n\t".join(dps)
           log_message(err_msg)
Evidence
Secure logging guidance requires avoiding sensitive data in logs; the updated warning builds and
logs a message containing field names and the complete list of affected datapoint paths.

Rule 5: Generic: Secure Logging Practices
dagshub/data_engine/model/query_result.py[471-478]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The warning emitted for unsupported/invalid annotations logs full datapoint paths, which may embed sensitive identifiers and should not be written to general logs.
## Issue Context
The code constructs `err_msg` by joining all datapoint paths and then calls `log_message(err_msg)`.
## Fix Focus Areas
- dagshub/data_engine/model/query_result.py[471-478]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Metaclass too broad 🐞 Bug ⛯ Reliability
Description
NotImplementedMeta iterates dir(base) and replaces every callable/property not overridden, which
can unintentionally override inherited type attributes (e.g., mro) and descriptor semantics,
leading to surprising runtime/introspection behavior for UnsupportedMetadataAnnotations. This is
risky and hard to debug if any code (or tooling) relies on class attributes inherited from type or
uses descriptors/classmethods.
Code

dagshub/data_engine/util/not_implemented.py[R7-46]

+    def __new__(mcs, name, bases, namespace):
+        # Get all attributes from base classes
+        for base in bases:
+            for attr_name in dir(base):
+                if attr_name.startswith("_"):
+                    continue
+
+                # Skip if already defined in subclass
+                if attr_name in namespace:
+                    continue
+
+                base_attr = getattr(base, attr_name)
+
+                # Handle properties
+                if isinstance(base_attr, property):
+                    # Create a property that raises NotImplementedError
+                    def make_not_implemented_property(prop_name):
+                        def getter(self):
+                            raise NotImplementedError(f"Property '{prop_name}' not implemented")
+
+                        def setter(self, value):
+                            raise NotImplementedError(f"Property '{prop_name}' not implemented")
+
+                        def deleter(self):
+                            raise NotImplementedError(f"Property '{prop_name}' not implemented")
+
+                        return property(getter, setter, deleter)
+
+                    namespace[attr_name] = make_not_implemented_property(attr_name)
+
+                # Handle regular methods
+                elif callable(base_attr):
+
+                    def make_not_implemented(method_name):
+                        def not_impl(self, *args, **kwargs):
+                            raise NotImplementedError(f"Method '{method_name}' not implemented")
+
+                        return not_impl
+
+                    namespace[attr_name] = make_not_implemented(attr_name)
Evidence
The metaclass walks dir(base) (includes inherited attributes, not just attributes defined on the
base class) and replaces any callable/property with a stub raising NotImplementedError, without
filtering to base.__dict__ or preserving descriptor types (staticmethod/classmethod).

dagshub/data_engine/util/not_implemented.py[7-46]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`NotImplementedMeta` currently iterates `dir(base)` and replaces any callable/property not overridden. This can also catch attributes inherited from `type` and can change descriptor semantics.
### Issue Context
The intent is to disable the *instance API* of `MetadataAnnotations` for `UnsupportedMetadataAnnotations`, not to shadow inherited class/type behaviors.
### Fix Focus Areas
- dagshub/data_engine/util/not_implemented.py[7-46]
### Suggested approach
- Iterate `for attr_name, raw in base.__dict__.items()` instead of `dir(base)`.
- Skip dunder/private names.
- Preserve descriptor wrappers:
- if `isinstance(raw, staticmethod)`: wrap the underlying function in a `staticmethod` that raises
- if `isinstance(raw, classmethod)`: wrap in a `classmethod` that raises
- if `isinstance(raw, property)`: create a property raising
- if `inspect.isfunction(raw)`: create a normal method raising
- Consider an allowlist of methods/properties that should remain usable (e.g., `value`, `to_ls_task`, `__repr__`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Document Path(bytes) risk 🐞 Bug ✓ Correctness
Description
Document conversion still unconditionally runs Path(dp.metadata[fld]).read_bytes() whenever
load_into_memory=False, even if the field value is already bytes from a prior in-memory load.
Calling get_blob_fields on document fields multiple times with different load_into_memory values
can still raise TypeError: argument should be a str or os.PathLike, not 'bytes'.
Code

dagshub/data_engine/model/query_result.py[R429-435]

+                        # Defensive check to not mangle annotation fields by accident
+                        if isinstance(dp.metadata[fld], MetadataAnnotations):
+                            continue
+                        # Force load the content into memory, even if load_into_memory was set to False
+                        if not load_into_memory or isinstance(dp.metadata[fld], Path):
                           dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes()
                       dp.metadata[fld] = dp.metadata[fld].decode("utf-8")
Evidence
The document conversion block converts requested document fields regardless of prior state. Its
condition is based on load_into_memory and isinstance(..., Path) only, but does not handle the
case where the value is already bytes, and then passes it to Path(...).

dagshub/data_engine/model/query_result.py[425-435]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Document conversion can crash if a document field was previously loaded into `bytes` and then `get_blob_fields(..., load_into_memory=False)` is called again: the code will do `Path(bytes)`.
### Issue Context
This is a common pattern when users call `get_blob_fields` multiple times with different flags.
### Fix Focus Areas
- dagshub/data_engine/model/query_result.py[425-436]
### Suggested approach
- Normalize handling:
- `val = dp.metadata[fld]`
- if `isinstance(val, (Path, os.PathLike))`: `val = Path(val).read_bytes()`
- elif `isinstance(val, str)`:
  - if it is a filesystem path and exists, read bytes
  - else assume it is already decoded content and skip reading
- if `isinstance(val, bytes)`: `dp.metadata[fld] = val.decode(&amp;#x27;utf-8&amp;#x27;)`
- elif `isinstance(val, str)`: leave as-is
- Avoid `Path(val)` unless `val` is path-like.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

return self._original_value

def __repr__(self):
return "Unsupported LS annotations"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "Unsupported LS annotations"
return "Label Studio annotations of unrecognized type"

Comment on lines +432 to +433
# Force load the content into memory, even if load_into_memory was set to False
if not load_into_memory or isinstance(dp.metadata[fld], Path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the condition and the comment.

The comment says to ignore load_into_memory - then why do you check it? I assume it's because when load_into_memory == True, then in some earlier part of the code the field was already read as bytes. Then why not just assert that? I guess maybe because then the field content would be string or bytes and you can't know for sure whether that's the blob content or a path without relying on the assumed earlier side effects of the field (would be solved by the type safety we discussed, but out of scope in this PR).

Assuming everything I've guessed is correct - I think a comment explaining your assumptions would really help to make this code understandable

Comment on lines +429 to 435
# Defensive check to not mangle annotation fields by accident
if isinstance(dp.metadata[fld], MetadataAnnotations):
continue
# Force load the content into memory, even if load_into_memory was set to False
if not load_into_memory or isinstance(dp.metadata[fld], Path):
dp.metadata[fld] = Path(dp.metadata[fld]).read_bytes()
dp.metadata[fld] = dp.metadata[fld].decode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also sounds correct

Comment on lines +453 to +454
# Force load the content into memory, even if load_into_memory was set to False
if not load_into_memory or isinstance(dp.metadata[fld], Path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition should probably be refactored to a function and made DRY

)
bad_annotations[fld].append(dp.path)
else:
# Empty annotation container
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend flipping the if-else order to make it readable. Or even just use continue and get rid of a nesting level.

Comment on lines +471 to 472
"Warning: The following datapoints had unsupported or invalid annotations, "
"any annotation-related operations will not work on these:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this warning message is too scary - the users can do whatever they want with the result just fine.
If what you mean is that they won't be able to use our fancy add_bounding_box etc methods - then just say that. The users don't know what you mean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants