[DataLoader] Add performance observability with tags/metrics separation by sumedhsakdeo · Pull Request #540 · linkedin/openhouse

sumedhsakdeo · 2026-04-12T00:31:40Z

Summary

Adds a pluggable PerfObserver pattern for performance instrumentation of the data loader
New _observability.py module with PerfEvent, PerfObserver protocol, NullPerfObserver, LoggingPerfObserver, CompositeObserver, and perf_timer context manager
PerfEvent separates tags (low-cardinality dimensions) from metrics (measured values) for Prometheus/OTEL/Kafka compatibility
Instruments 7 stages: dataloader.iter, dataloader.split_iter, dataloader.resolve_snapshot, dataloader.build_query, dataloader.apply_transform, dataloader.create_transform_session, catalog.load_table
Default observer is LoggingPerfObserver (DEBUG level to openhouse.dataloader.perf), auto-set on first OpenHouseDataLoader creation unless a custom observer is already configured
set_observer() bridges to pyiceberg.observability.set_observer() when available

Why a separate `_observability.py` instead of importing from pyiceberg?

The dataloader's _observability.py is intentionally independent from pyiceberg.observability (li-iceberg-python#52). The two modules share the same design (PerfEvent/PerfObserver/perf_timer) but are not coupled because:

Different release cycles. The dataloader ships on its own cadence; depending on an unreleased pyiceberg observability module would block this PR until the upstream fork is published.
Different logger namespaces. The dataloader logs to openhouse.dataloader.perf while pyiceberg logs to pyiceberg.perf, keeping the two layers distinguishable in log output.
Bridge, not dependency. set_observer() in the dataloader forwards to pyiceberg.observability.set_observer() when available, so a single call enables both layers without a hard import dependency.
Convergence path. Once pyiceberg publishes observability.py in a released version, the dataloader module can be reduced to a thin re-export + bridge. The PerfEvent/PerfObserver contracts are identical by design to make this migration straightforward.

Test plan

217 unit tests pass (make verify)
15 dedicated observability tests covering PerfEvent, all observer types, perf_timer, tag/metric separation, and full integration with data loader iteration and transform paths
Integration tests against Docker OpenHouse

🤖 Generated with Claude Code

Add a pluggable PerfObserver pattern to instrument data loader stages (iter, split_iter, resolve_snapshot, build_transform_sql, apply_transform, create_transform_session, catalog.load_table) with duration tracking and typed attributes split into tags (dimensions) and metrics (values) for compatibility with Prometheus/OTEL/Kafka backends. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…vability PerfConfig is a frozen, pickle-safe dataclass that carries session-level tags, observer type, and observer kwargs through TableScanContext to remote workers. bootstrap_observer() performs idempotent worker-side setup, and EnrichingObserver wraps any observer to prepend session tags to every event. This enables lipy-openhouse to inject cluster/tenant tags and custom observers without subclassing or process-global state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sumedhsakdeo force-pushed the ssakdeo/dataloader-perf-observability branch from 273dddc to ef2fa9b Compare April 12, 2026 00:39

sumedhsakdeo requested review from ShreyeshArangath, cbb330 and robreeves April 12, 2026 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Add performance observability with tags/metrics separation#540

[DataLoader] Add performance observability with tags/metrics separation#540
sumedhsakdeo wants to merge 2 commits intolinkedin:mainfrom
sumedhsakdeo:ssakdeo/dataloader-perf-observability

sumedhsakdeo commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sumedhsakdeo commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a separate _observability.py instead of importing from pyiceberg?

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sumedhsakdeo commented Apr 12, 2026 •

edited

Loading

Why a separate `_observability.py` instead of importing from pyiceberg?