Skip to content

feat(console): rewrite TUI with worker-centric dashboard and live metrics#363

Merged
gabotechs merged 41 commits intodatafusion-contrib:mainfrom
EdsonPetry:edson.petry/console-dashboard
Mar 17, 2026
Merged

feat(console): rewrite TUI with worker-centric dashboard and live metrics#363
gabotechs merged 41 commits intodatafusion-contrib:mainfrom
EdsonPetry:edson.petry/console-dashboard

Conversation

@EdsonPetry
Copy link
Copy Markdown
Contributor

@EdsonPetry EdsonPetry commented Mar 4, 2026

Summary

  • Rewrite the console TUI with a worker-centric dashboard featuring a cluster overview and per-worker detail views, replacing the previous global task view
  • Extend the observability service to collect worker-level metrics including CPU usage, RSS memory, output rows, unique query IDs, and task counts via updated protobuf definitions
  • Add CPU/RSS sparklines and sortable columns to the worker table, with queries-in-flight metric in the cluster metrics panel
  • Add new examples: a cluster spawner/killer example and concurrent TPC-DS query runner with explain analyze support

Changes

Console TUI (console/)

  • Split monolithic ui.rs into modular components: cluster.rs, worker.rs, header.rs, footer.rs, help.rs
  • New input.rs for keyboard event handling and state.rs for shared UI state
  • Worker table with sortable columns (Name, Stage, Tasks, Queries, CPU, Memory)
  • Sparkline charts for CPU and RSS per worker
  • Cluster-level metrics: throughput, active/total workers, completed queries, avg duration
  • Help overlay (toggle with ?)

Observability (src/observability/)

  • Extended protobuf schema with cpu_usage, rss_bytes, output_rows, query_ids, num_tasks fields
  • Updated service to generate and collect the new per-worker metrics
  • Worker now reports richer metrics through the progress API

Flight service (src/flight_service/)

  • Changed observability_service() generator method on Worker to with_observability_service(), users now no longer need to wrap the ObservbilityServiceImpl that the old observability service generator created with a ObservabilityServiceServer.
    old:
    let worker = Worker::default();
   let observability_service = worker.observability_service();

    Server::builder()
        .add_service(ObservabilityServiceServer::new(observability_service))
        .add_service(worker.into_flight_server())
        .serve(SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), args.port))
        .await?;

new

    let worker = Worker::default();

    Server::builder()
        .add_service(worker.with_observability_service())
        .add_service(worker.into_flight_server())
        .serve(SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), args.port))
        .await?;

Examples

  • cluster.rs: spawns and manages a local cluster of workers with observability
  • tpcds_runner.rs: supports concurrent query execution and explain analyze

EdsonPetry and others added 29 commits March 3, 2026 12:42
Replace the proof-of-concept console with a production-quality
monitoring dashboard focused on cluster health and worker load
distribution rather than misleading per-partition progress bars.

Two-view architecture:
- Cluster Overview: scrollable worker table with task counts,
  query counts, longest task duration, hot spot highlighting,
  task distribution summary, and sortable columns
- Worker Detail: active tasks sorted by duration, recently
  completed tasks with observed duration, connection info

Key improvements:
- Scrollable tables supporting 50+ workers
- Tab navigation (1/2), vim keys (j/k/h/l), drill-down (Enter)
- Sort cycling (s key): name, tasks, status, longest task
- Pause/resume polling (p key), help overlay (? key)
- Separate poll rate (250ms) from render rate (60fps)
- Task duration tracking via first-seen timestamps
- Hot spot detection: workers with >2x avg tasks highlighted red
- Stuck task detection: tasks >30s yellow, >60s red
- Query aggregation across workers
- Replace structopt with clap v4, remove unused hex dep
- Fix ClusterStats.completed always being 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add WorkerMetrics (RSS, CPU) proto message and output_rows field to
TaskProgress. Introduce system-metrics feature flag with sysinfo for
per-process metric collection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Display per-worker CPU and RSS in the cluster table, replacing longest-task
column. Add cluster-wide metrics panel with throughput and task stats.
Worker detail view now includes sparkline graphs for CPU, memory, and
row throughput. Replace sort-mode cycling with column-based sorting
(arrow keys + space). Update keybindings and help overlay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass explain_analyze and show_distributed_plan as parameters to
run_single_query instead of re-parsing CLI args inside the function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent panic if worker_idx becomes stale after workers list changes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use unwrap_or_else to recover from a poisoned mutex instead of panicking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deduplicate format_bytes, format_duration, format_row_count,
format_rows_throughput, and cpu_color from cluster.rs and worker.rs
into a shared format module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid silent truncation when casting count to u32 by computing the
average with f64 seconds instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove commented-out pretty_format_batches in tpcds_runner
- Remove unreachable '\t' key branch in worker detail input handler
- Replace unused QuerySummary struct with a simple active_query_count,
  eliminating the O(queries * workers * tasks) second pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add hours support to the shared format_duration in ui/format.rs and
reuse it from header.rs, removing the duplicate implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents silent overflow when base_port + worker index exceeds u16::MAX.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…unner

Return result messages from spawned tasks and print them from the
join_next loop, preventing concurrent println interleaving.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use consistent "Memory" column label in narrow mode (was "RSS")
- Derive Copy for View enum and pass by value in footer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace structopt derive macros with clap v4 Parser/arg across all
four console examples, matching the main binary. Remove the structopt
dev-dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move WorkerConn, ConnectionStatus, CompletedTaskRecord, and all
per-worker gRPC polling/connection logic out of app.rs into worker.rs.

app.rs is now a 247-line coordinator (tick, sorting, cluster stats,
throughput). worker.rs is a 366-line self-contained module for
connection management, task tracking, and metric history.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@EdsonPetry EdsonPetry marked this pull request as ready for review March 4, 2026 21:55
@EdsonPetry EdsonPetry force-pushed the edson.petry/console-dashboard branch from b4842dd to 6dbcb45 Compare March 5, 2026 16:02
@EdsonPetry EdsonPetry force-pushed the edson.petry/console-dashboard branch from 68f2286 to abbc05e Compare March 12, 2026 14:11
@EdsonPetry EdsonPetry force-pushed the edson.petry/console-dashboard branch from ad30424 to cee8747 Compare March 12, 2026 19:07
@EdsonPetry EdsonPetry force-pushed the edson.petry/console-dashboard branch from 55a93c9 to 5491902 Compare March 13, 2026 14:06
Copy link
Copy Markdown
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have not made a deep review over the code in the console/ folder, but I trust you on that. It looks great!

@gabotechs gabotechs merged commit e56922a into datafusion-contrib:main Mar 17, 2026
7 checks passed
// Spawn background thread to send system metrics.
// This is done to prevent stalling the tokio thread
// due to the sys call, leading to task pool starvation.
thread::spawn(async move || {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabotechs I didn't catch this before, but this code here isn't correct. I should use move here instead of async move because the body of the closure is never executed within the spawned thread. The spawned thread receives a future here that is goes unpolled. Also because this is a separate background thread that isn't in tokio's thread pool thus I believe calling thread::sleep is correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push a fix for this in the PR #372

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I don't understand why this should be incorrect. Can you push the fix as a separate PR so that we can review it in isolation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, the issue I was running into is that because the Future within that async closure was going unpolled the system metrics collection loop was never executing and thus the console was only displaying 0 values, this compiles fine because what the compiler sees is just thread::spawn(async move || { Future<...> }).

Opening a separate PR for this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants