feat(console): rewrite TUI with worker-centric dashboard and live metrics#363
Conversation
Replace the proof-of-concept console with a production-quality monitoring dashboard focused on cluster health and worker load distribution rather than misleading per-partition progress bars. Two-view architecture: - Cluster Overview: scrollable worker table with task counts, query counts, longest task duration, hot spot highlighting, task distribution summary, and sortable columns - Worker Detail: active tasks sorted by duration, recently completed tasks with observed duration, connection info Key improvements: - Scrollable tables supporting 50+ workers - Tab navigation (1/2), vim keys (j/k/h/l), drill-down (Enter) - Sort cycling (s key): name, tasks, status, longest task - Pause/resume polling (p key), help overlay (? key) - Separate poll rate (250ms) from render rate (60fps) - Task duration tracking via first-seen timestamps - Hot spot detection: workers with >2x avg tasks highlighted red - Stuck task detection: tasks >30s yellow, >60s red - Query aggregation across workers - Replace structopt with clap v4, remove unused hex dep - Fix ClusterStats.completed always being 0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…enerator on worker
Add WorkerMetrics (RSS, CPU) proto message and output_rows field to TaskProgress. Introduce system-metrics feature flag with sysinfo for per-process metric collection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Display per-worker CPU and RSS in the cluster table, replacing longest-task column. Add cluster-wide metrics panel with throughput and task stats. Worker detail view now includes sparkline graphs for CPU, memory, and row throughput. Replace sort-mode cycling with column-based sorting (arrow keys + space). Update keybindings and help overlay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass explain_analyze and show_distributed_plan as parameters to run_single_query instead of re-parsing CLI args inside the function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent panic if worker_idx becomes stale after workers list changes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use unwrap_or_else to recover from a poisoned mutex instead of panicking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deduplicate format_bytes, format_duration, format_row_count, format_rows_throughput, and cpu_color from cluster.rs and worker.rs into a shared format module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid silent truncation when casting count to u32 by computing the average with f64 seconds instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove commented-out pretty_format_batches in tpcds_runner - Remove unreachable '\t' key branch in worker detail input handler - Replace unused QuerySummary struct with a simple active_query_count, eliminating the O(queries * workers * tasks) second pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add hours support to the shared format_duration in ui/format.rs and reuse it from header.rs, removing the duplicate implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents silent overflow when base_port + worker index exceeds u16::MAX. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…unner Return result messages from spawned tasks and print them from the join_next loop, preventing concurrent println interleaving. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use consistent "Memory" column label in narrow mode (was "RSS") - Derive Copy for View enum and pass by value in footer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace structopt derive macros with clap v4 Parser/arg across all four console examples, matching the main binary. Remove the structopt dev-dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move WorkerConn, ConnectionStatus, CompletedTaskRecord, and all per-worker gRPC polling/connection logic out of app.rs into worker.rs. app.rs is now a 247-line coordinator (tick, sorting, cluster stats, throughput). worker.rs is a 366-line self-contained module for connection management, task tracking, and metric history. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b4842dd to
6dbcb45
Compare
68f2286 to
abbc05e
Compare
ad30424 to
cee8747
Compare
55a93c9 to
5491902
Compare
gabotechs
left a comment
There was a problem hiding this comment.
Looks good! I have not made a deep review over the code in the console/ folder, but I trust you on that. It looks great!
| // Spawn background thread to send system metrics. | ||
| // This is done to prevent stalling the tokio thread | ||
| // due to the sys call, leading to task pool starvation. | ||
| thread::spawn(async move || { |
There was a problem hiding this comment.
@gabotechs I didn't catch this before, but this code here isn't correct. I should use move here instead of async move because the body of the closure is never executed within the spawned thread. The spawned thread receives a future here that is goes unpolled. Also because this is a separate background thread that isn't in tokio's thread pool thus I believe calling thread::sleep is correct.
There was a problem hiding this comment.
🤔 I don't understand why this should be incorrect. Can you push the fix as a separate PR so that we can review it in isolation?
There was a problem hiding this comment.
Sounds good, the issue I was running into is that because the Future within that async closure was going unpolled the system metrics collection loop was never executing and thus the console was only displaying 0 values, this compiles fine because what the compiler sees is just thread::spawn(async move || { Future<...> }).
Opening a separate PR for this now.
Summary
Changes
Console TUI (
console/)ui.rsinto modular components:cluster.rs,worker.rs,header.rs,footer.rs,help.rsinput.rsfor keyboard event handling andstate.rsfor shared UI state?)Observability (
src/observability/)cpu_usage,rss_bytes,output_rows,query_ids,num_tasksfieldsFlight service (
src/flight_service/)observability_service()generator method onWorkertowith_observability_service(), users now no longer need to wrap theObservbilityServiceImplthat the old observability service generator created with aObservabilityServiceServer.old:
new
Examples
cluster.rs: spawns and manages a local cluster of workers with observabilitytpcds_runner.rs: supports concurrent query execution and explain analyze