Skip to content

[DataLoader] Add performance observability with tags/metrics separation#540

Open
sumedhsakdeo wants to merge 2 commits intolinkedin:mainfrom
sumedhsakdeo:ssakdeo/dataloader-perf-observability
Open

[DataLoader] Add performance observability with tags/metrics separation#540
sumedhsakdeo wants to merge 2 commits intolinkedin:mainfrom
sumedhsakdeo:ssakdeo/dataloader-perf-observability

Conversation

@sumedhsakdeo
Copy link
Copy Markdown
Collaborator

@sumedhsakdeo sumedhsakdeo commented Apr 12, 2026

Summary

  • Adds a pluggable PerfObserver pattern for performance instrumentation of the data loader
  • New _observability.py module with PerfEvent, PerfObserver protocol, NullPerfObserver, LoggingPerfObserver, CompositeObserver, and perf_timer context manager
  • PerfEvent separates tags (low-cardinality dimensions) from metrics (measured values) for Prometheus/OTEL/Kafka compatibility
  • Instruments 7 stages: dataloader.iter, dataloader.split_iter, dataloader.resolve_snapshot, dataloader.build_query, dataloader.apply_transform, dataloader.create_transform_session, catalog.load_table
  • Default observer is LoggingPerfObserver (DEBUG level to openhouse.dataloader.perf), auto-set on first OpenHouseDataLoader creation unless a custom observer is already configured
  • set_observer() bridges to pyiceberg.observability.set_observer() when available

Why a separate _observability.py instead of importing from pyiceberg?

The dataloader's _observability.py is intentionally independent from pyiceberg.observability (li-iceberg-python#52). The two modules share the same design (PerfEvent/PerfObserver/perf_timer) but are not coupled because:

  1. Different release cycles. The dataloader ships on its own cadence; depending on an unreleased pyiceberg observability module would block this PR until the upstream fork is published.
  2. Different logger namespaces. The dataloader logs to openhouse.dataloader.perf while pyiceberg logs to pyiceberg.perf, keeping the two layers distinguishable in log output.
  3. Bridge, not dependency. set_observer() in the dataloader forwards to pyiceberg.observability.set_observer() when available, so a single call enables both layers without a hard import dependency.
  4. Convergence path. Once pyiceberg publishes observability.py in a released version, the dataloader module can be reduced to a thin re-export + bridge. The PerfEvent/PerfObserver contracts are identical by design to make this migration straightforward.

Test plan

  • 217 unit tests pass (make verify)
  • 15 dedicated observability tests covering PerfEvent, all observer types, perf_timer, tag/metric separation, and full integration with data loader iteration and transform paths
  • Integration tests against Docker OpenHouse

🤖 Generated with Claude Code

Add a pluggable PerfObserver pattern to instrument data loader stages
(iter, split_iter, resolve_snapshot, build_transform_sql, apply_transform,
create_transform_session, catalog.load_table) with duration tracking and
typed attributes split into tags (dimensions) and metrics (values) for
compatibility with Prometheus/OTEL/Kafka backends.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sumedhsakdeo sumedhsakdeo force-pushed the ssakdeo/dataloader-perf-observability branch from 273dddc to ef2fa9b Compare April 12, 2026 00:39
…vability

PerfConfig is a frozen, pickle-safe dataclass that carries session-level
tags, observer type, and observer kwargs through TableScanContext to
remote workers. bootstrap_observer() performs idempotent worker-side
setup, and EnrichingObserver wraps any observer to prepend session tags
to every event. This enables lipy-openhouse to inject cluster/tenant
tags and custom observers without subclassing or process-global state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant