From 9d2a894403b922857886c8f61f23997d16b71f61 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 16 Feb 2026 12:38:38 +0000 Subject: [PATCH 1/5] Initial plan From 97a04140c04d842c228d27f001d3d5757987d387 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 16 Feb 2026 12:43:26 +0000 Subject: [PATCH 2/5] docs: add comprehensive user documentation Co-authored-by: Garbee <868301+Garbee@users.noreply.github.com> --- README.md | 22 ++ docs/examples/README.md | 36 +++ docs/examples/basic-monitoring.md | 147 ++++++++++++ docs/examples/build-optimization.md | 348 +++++++++++++++++++++++++++ docs/examples/debug-mode.md | 286 ++++++++++++++++++++++ docs/faq.md | 352 +++++++++++++++++++++++++++ docs/quick-reference.md | 355 ++++++++++++++++++++++++++++ docs/troubleshooting.md | 257 ++++++++++++++++++++ 8 files changed, 1803 insertions(+) create mode 100644 docs/examples/README.md create mode 100644 docs/examples/basic-monitoring.md create mode 100644 docs/examples/build-optimization.md create mode 100644 docs/examples/debug-mode.md create mode 100644 docs/faq.md create mode 100644 docs/quick-reference.md create mode 100644 docs/troubleshooting.md diff --git a/README.md b/README.md index 5a3ee1e..bd21ea1 100644 --- a/README.md +++ b/README.md @@ -191,6 +191,28 @@ The `runner.debug` context is documented in the [GitHub Actions contexts referen > [!NOTE] > Metrics are displayed with timestamps rather than correlated with specific workflow steps. You can manually correlate metrics with your workflow steps by matching timestamps with the execution times shown in your workflow run logs. +## Documentation + +### Quick Links + +- **[Quick Reference](docs/quick-reference.md)** - Fast lookup for common configurations and scenarios +- **[FAQ](docs/faq.md)** - Frequently asked questions and answers +- **[Troubleshooting Guide](docs/troubleshooting.md)** - Solutions to common issues +- **[Examples](docs/examples/)** - Real-world usage scenarios: + - [Basic Monitoring](docs/examples/basic-monitoring.md) - Simple setup for general workflows + - [Debug Mode Only](docs/examples/debug-mode.md) - Conditional metrics collection on demand + - [Build Optimization](docs/examples/build-optimization.md) - Identifying and resolving bottlenecks +- **[Architecture](docs/architecture.md)** - Technical implementation details + +### Getting Help + +If you encounter issues: + +1. Check the [FAQ](docs/faq.md) for common questions +2. Review the [Troubleshooting Guide](docs/troubleshooting.md) for solutions +3. Search [existing issues](https://github.com/Garbee/runner-resource-usage/issues) +4. Open a new issue with details about your problem + ## Development Setup ### 1. Install Dependencies diff --git a/docs/examples/README.md b/docs/examples/README.md new file mode 100644 index 0000000..ae37c32 --- /dev/null +++ b/docs/examples/README.md @@ -0,0 +1,36 @@ +# Usage Examples + +This directory contains real-world examples of using the runner-resource-usage action in different scenarios. + +## Available Examples + +### By Use Case + +- [**Basic Monitoring**](./basic-monitoring.md) - Simple setup for general workflow monitoring +- [**Build Optimization**](./build-optimization.md) - Identifying bottlenecks in build processes +- [**Test Suite Performance**](./test-performance.md) - Monitoring test execution resource usage +- [**Debug Mode Only**](./debug-mode.md) - Conditional metrics collection on demand +- [**CI/CD Pipeline**](./cicd-pipeline.md) - Comprehensive pipeline with deployment stages +- [**Memory-Intensive Workflows**](./memory-intensive.md) - Handling data processing and machine learning +- [**Multi-Job Workflows**](./multi-job.md) - Collecting metrics across multiple jobs + +### By Workflow Type + +- **Node.js Projects**: See [Basic Monitoring](./basic-monitoring.md) +- **Docker Builds**: See [Build Optimization](./build-optimization.md) +- **Python/Data Science**: See [Memory-Intensive Workflows](./memory-intensive.md) +- **Matrix Builds**: See [Multi-Job Workflows](./multi-job.md) + +## Quick Start + +If you're new to this action, start with [Basic Monitoring](./basic-monitoring.md) to understand the fundamental setup. + +## Contributing Examples + +Have a useful example? Please submit a PR! Examples should: + +- Focus on a specific use case or problem +- Include complete, working workflow YAML +- Explain the context and goals +- Highlight key configuration decisions +- Show expected output or interpretation diff --git a/docs/examples/basic-monitoring.md b/docs/examples/basic-monitoring.md new file mode 100644 index 0000000..b4d83fc --- /dev/null +++ b/docs/examples/basic-monitoring.md @@ -0,0 +1,147 @@ +# Basic Monitoring Example + +This example shows the simplest way to add resource monitoring to your workflow. + +## Use Case + +You have a standard Node.js project with build and test steps. You want to understand resource consumption patterns without any complex configuration. + +## Workflow Configuration + +```yaml +name: CI + +on: + push: + branches: + - main + pull_request: + branches: + - main + +permissions: + contents: read # Required to clone repository + +jobs: + test: + name: Build and Test + runs-on: ubuntu-24.04 + timeout-minutes: 15 + steps: + # Start metrics collection at the very beginning + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + cache: 'npm' + + - name: Install Dependencies + run: npm ci + + - name: Build + run: npm run build + + - name: Test + run: npm test +``` + +## What You Get + +After the workflow completes, check the job summary to see: + +### Metrics Tables + +Three collapsible sections showing timestamped data: + +**CPU Usage**: +``` +| Timestamp | Used | Available | +|---------------------------|--------|-----------| +| 2026-02-16T10:30:00.000Z | 12.5% | 87.5% | +| 2026-02-16T10:30:05.000Z | 78.3% | 21.7% | +| 2026-02-16T10:30:10.000Z | 15.2% | 84.8% | +``` + +**Memory Usage**: +``` +| Timestamp | Used | Available | +|---------------------------|------------|-------------| +| 2026-02-16T10:30:00.000Z | 512.5 MB | 15475.5 MB | +| 2026-02-16T10:30:05.000Z | 2048.3 MB | 13939.7 MB | +| 2026-02-16T10:30:10.000Z | 1024.8 MB | 14963.2 MB | +``` + +**Disk Usage**: +``` +| Timestamp | Used | Available | +|---------------------------|----------|-----------| +| 2026-02-16T10:30:00.000Z | 45.2 GB | 98.8 GB | +| 2026-02-16T10:30:05.000Z | 48.5 GB | 95.5 GB | +| 2026-02-16T10:30:10.000Z | 46.1 GB | 97.9 GB | +``` + +### Alerts (if thresholds exceeded) + +If any resource exceeds default thresholds, you'll see alerts: + +> [!WARNING] +> πŸ”₯ Sustained CPU usage above 85% for more than 60 seconds (92.0%) + +## Interpreting Results + +### 1. Correlate with Workflow Steps + +Match metric timestamps with your workflow run timeline: + +1. Open your workflow run in GitHub Actions +2. Note the start/end times of each step +3. Compare with metric timestamps + +**Example**: +- Build step: 10:30:05 - 10:30:35 β†’ High CPU usage expected +- Test step: 10:30:40 - 10:31:10 β†’ High CPU and memory expected +- Idle periods: Low resource usage + +### 2. Identify Resource Patterns + +Look for: +- **Spikes**: Sudden increases indicating intensive operations +- **Plateaus**: Sustained high usage that might indicate inefficiency +- **Drops**: Completion of resource-intensive steps +- **Baseline**: Typical "idle" usage between steps + +### 3. Common Patterns + +**Normal Build Pattern**: +``` +Checkout: Low CPU (5-15%), Low Memory (500-1000 MB) +Install Deps: Medium CPU (30-50%), Medium Memory (1-2 GB) +Build: High CPU (60-90%), Medium-High Memory (2-4 GB) +Test: High CPU (70-95%), Variable Memory (1-8 GB) +``` + +**Potential Issues**: +- CPU consistently > 90%: Consider optimizing or larger runner +- Memory growing continuously: Possible memory leak in build/test +- Disk usage spiking: Large artifacts or insufficient cleanup + +## Next Steps + +Once you understand basic metrics: + +1. **Optimize**: Use insights to improve slow steps +2. **Adjust Thresholds**: Customize based on your baseline (see [Build Optimization](./build-optimization.md)) +3. **Debug Mode**: Enable only when needed (see [Debug Mode Only](./debug-mode.md)) +4. **Advanced Scenarios**: Explore other examples for specific use cases + +## Related Examples + +- [Build Optimization](./build-optimization.md) - Deep dive into improving build performance +- [Debug Mode Only](./debug-mode.md) - Collect metrics on demand +- [Test Suite Performance](./test-performance.md) - Focus on test execution diff --git a/docs/examples/build-optimization.md b/docs/examples/build-optimization.md new file mode 100644 index 0000000..2925ab8 --- /dev/null +++ b/docs/examples/build-optimization.md @@ -0,0 +1,348 @@ +# Build Optimization Example + +This example demonstrates how to use resource metrics to identify and resolve build performance bottlenecks. + +## Use Case + +Your Docker build process is slow, taking 8-10 minutes. You want to understand which parts of the build consume the most resources and optimize accordingly. + +## Initial Workflow + +```yaml +name: Docker Build + +on: + push: + branches: + - main + pull_request: + branches: + - main + +permissions: + contents: read + +jobs: + build: + name: Build Docker Image + runs-on: ubuntu-24.04 + timeout-minutes: 20 + steps: + # Enable metrics collection with optimized settings for builds + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "85" # Docker builds can use significant memory + cpu_alert_threshold: "90" # Compilation is CPU-intensive + cpu_alert_duration: "120" # Allow sustained high CPU during build + disk_alert_threshold: "85" # Docker layers can consume disk space + + - name: Checkout + uses: actions/checkout@v4 + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + + - name: Build Image + uses: docker/build-push-action@v5 + with: + context: . + push: false + tags: myapp:latest + cache-from: type=gha + cache-to: type=gha,mode=max +``` + +## Step 1: Collect Baseline Metrics + +Run the workflow and examine the metrics in the job summary. + +### Example Baseline Results + +**CPU Usage Pattern**: +``` +10:00:00 - 15% (Checkout) +10:00:05 - 20% (Setup Buildx) +10:00:10 - 85% (Build - dependency installation) +10:00:30 - 95% (Build - compilation) +10:05:00 - 90% (Build - asset processing) +10:08:00 - 25% (Build - finalizing layers) +``` + +**Memory Usage Pattern**: +``` +10:00:00 - 500 MB (Checkout) +10:00:05 - 800 MB (Setup Buildx) +10:00:10 - 3.5 GB (Build - dependencies) +10:00:30 - 5.2 GB (Build - compilation) +10:05:00 - 6.8 GB (Build - assets) +10:08:00 - 4.1 GB (Build - cleanup) +``` + +**Disk Usage Pattern**: +``` +10:00:00 - 45 GB (Initial) +10:00:10 - 52 GB (Docker layers) +10:00:30 - 58 GB (Build cache) +10:08:00 - 62 GB (Final image) +``` + +### Key Insights + +1. **CPU**: Consistently high (85-95%) during build phases +2. **Memory**: Peaks at 6.8 GB during asset processing +3. **Disk**: Grows by 17 GB during build +4. **Duration**: Total 8 minutes + +## Step 2: Identify Bottlenecks + +Based on metrics, identify what's limiting performance: + +### Is it CPU-bound? + +**Indicators**: +- βœ… CPU usage 85-95% for extended periods +- βœ… Build phase takes longest (5+ minutes) +- βœ… Memory and disk have capacity remaining + +**Conclusion**: Primary bottleneck is CPU + +### Optimization Strategy for CPU-bound Builds + +1. **Enable parallelization** in build tools +2. **Use layer caching** effectively +3. **Consider larger runner** with more cores +4. **Optimize compilation settings** + +## Step 3: Apply Optimizations + +### Optimization 1: Multi-stage Build with Caching + +```dockerfile +# Before: Single-stage build +FROM node:20 +WORKDIR /app +COPY . . +RUN npm install +RUN npm run build +CMD ["npm", "start"] + +# After: Multi-stage build with better caching +FROM node:20 AS dependencies +WORKDIR /app +COPY package*.json ./ +RUN npm ci --only=production + +FROM node:20 AS builder +WORKDIR /app +COPY package*.json ./ +RUN npm ci +COPY . . +RUN npm run build + +FROM node:20-slim AS runtime +WORKDIR /app +COPY --from=dependencies /app/node_modules ./node_modules +COPY --from=builder /app/dist ./dist +COPY package*.json ./ +CMD ["npm", "start"] +``` + +### Optimization 2: Build Arguments for Parallelization + +```yaml +- name: Build Image + uses: docker/build-push-action@v5 + with: + context: . + push: false + tags: myapp:latest + cache-from: type=gha + cache-to: type=gha,mode=max + build-args: | + NODE_OPTIONS=--max-old-space-size=4096 + JOBS=4 # Parallel compilation if supported +``` + +### Optimization 3: Larger Runner (if budget allows) + +```yaml +jobs: + build: + runs-on: ubuntu-24.04-4-core # Larger runner with 4 cores + # More cores = faster parallel compilation +``` + +## Step 4: Measure Impact + +Run the optimized workflow with metrics collection: + +### After Optimization Results + +**CPU Usage Pattern**: +``` +10:00:00 - 15% (Checkout) +10:00:05 - 20% (Setup Buildx) +10:00:10 - 75% (Build - cached dependencies) ⬇️ Improved +10:00:20 - 85% (Build - compilation) ⬇️ Shorter +10:02:00 - 80% (Build - assets) ⬇️ Shorter +10:03:30 - 25% (Build - finalizing) +``` + +**Duration Comparison**: +- Before: 8 minutes +- After: 3.5 minutes +- **Improvement: 56% faster** πŸŽ‰ + +**Resource Changes**: +- Peak CPU: 95% β†’ 85% (better distribution) +- Peak Memory: 6.8 GB β†’ 5.2 GB (eliminated waste) +- Disk Usage: +17 GB β†’ +12 GB (better layer caching) + +## Step 5: Set Appropriate Alerts + +Now that you know the optimized baseline, set thresholds to catch regressions: + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "90" # Alert if exceeds optimized usage + cpu_alert_threshold: "95" # Alert if maxing out CPU + cpu_alert_duration: "90" # Alert if sustained beyond expected + disk_alert_threshold: "85" # Alert if approaching limits +``` + +These thresholds will alert you if: +- A code change introduces memory inefficiency +- Build process starts taking longer (CPU sustained beyond 90s) +- Docker layers grow unexpectedly large + +## Common Build Optimization Patterns + +### Pattern 1: Memory-Bound Build + +**Symptoms**: +- Memory consistently near limit +- CPU under 70% +- Swap usage (if visible in system metrics) + +**Solutions**: +```yaml +# Increase Node.js heap size +build-args: | + NODE_OPTIONS=--max-old-space-size=8192 + +# Or use larger runner +runs-on: ubuntu-24.04-16gb +``` + +### Pattern 2: Disk I/O-Bound Build + +**Symptoms**: +- Moderate CPU (40-60%) +- Disk usage growing slowly but steadily +- Long pauses between build steps + +**Solutions**: +```yaml +# Use aggressive caching +- name: Cache Dependencies + uses: actions/cache@v4 + with: + path: | + ~/.npm + node_modules + .next/cache + key: ${{ runner.os }}-build-${{ hashFiles('**/package-lock.json') }} + +# Or use faster storage runners +runs-on: ubuntu-24.04-ssd +``` + +### Pattern 3: Network-Bound Build + +**Symptoms**: +- Low CPU during dependency installation +- Long "Installing dependencies" step +- Disk not growing during this time + +**Solutions**: +```yaml +# Pre-cache dependencies +- name: Cache Dependencies + uses: actions/cache@v4 + with: + path: ~/.npm + key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }} + +# Or use NPM registry mirrors +- name: Configure NPM Mirror + run: npm config set registry https://registry.npmjs.org/ +``` + +## Advanced: Parallel Matrix Builds + +If building for multiple platforms: + +```yaml +jobs: + build: + strategy: + matrix: + platform: [linux/amd64, linux/arm64] + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + - name: Build for ${{ matrix.platform }} + uses: docker/build-push-action@v5 + with: + platforms: ${{ matrix.platform }} +``` + +Compare metrics across matrix builds to identify platform-specific issues. + +## Tracking Progress Over Time + +Create a performance tracking workflow: + +```yaml +name: Performance Tracking + +on: + schedule: + - cron: '0 0 * * 1' # Weekly on Monday + +jobs: + baseline: + runs-on: ubuntu-24.04 + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + - name: Checkout + uses: actions/checkout@v4 + + - name: Build + run: npm run build + + - name: Record Duration + run: echo "Build completed - check metrics in summary" +``` + +Review weekly metrics to catch performance regressions early. + +## Related Examples + +- [Basic Monitoring](./basic-monitoring.md) - Understanding metrics fundamentals +- [Memory-Intensive Workflows](./memory-intensive.md) - Handling high memory usage +- [Multi-Job Workflows](./multi-job.md) - Comparing metrics across jobs + +## Further Reading + +- [Docker Build Best Practices](https://docs.docker.com/build/building/best-practices/) +- [GitHub Actions: Larger Runners](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners) +- [Optimizing Node.js Builds](https://nodejs.org/en/docs/guides/simple-profiling/) diff --git a/docs/examples/debug-mode.md b/docs/examples/debug-mode.md new file mode 100644 index 0000000..1b89871 --- /dev/null +++ b/docs/examples/debug-mode.md @@ -0,0 +1,286 @@ +# Debug Mode Only Example + +This example shows how to enable metrics collection on demand using GitHub Actions debug mode. + +## Use Case + +You have a workflow that runs frequently (on every push/PR). You don't need metrics on every run, but want them available when investigating performance issues or failuresβ€”without modifying the workflow file. + +## Benefits + +- **Zero Performance Impact**: No overhead on regular runs +- **Always Available**: Ready to use when needed +- **No Code Changes**: Enable via UI, not by editing workflow +- **Team Friendly**: Any team member can enable it + +## Workflow Configuration + +```yaml +name: CI + +on: + push: + branches: + - main + pull_request: + branches: + - main + +permissions: + contents: read + +jobs: + test: + name: Build and Test + runs-on: ubuntu-24.04 + timeout-minutes: 15 + steps: + # Conditionally start metrics collection + - name: Start Workflow Telemetry + if: ${{ runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + cache: 'npm' + + - name: Install Dependencies + run: npm ci + + - name: Build + run: npm run build + + - name: Test + run: npm test +``` + +## How to Enable Metrics Collection + +When you need to collect metrics: + +### For a Completed Run + +1. Navigate to the workflow run in GitHub Actions +2. Click **"Re-run jobs"** button (top-right) +3. Check **"Enable debug logging"** checkbox +4. Click **"Re-run jobs"** + +The workflow will re-execute with metrics collection enabled. + +### For a New Run + +You can also manually trigger a workflow with debug mode: + +1. Go to Actions tab +2. Select your workflow +3. Click **"Run workflow"** button +4. The debug option may be available depending on trigger type +5. Click **"Run workflow"** + +## What Happens + +### Without Debug Mode (Normal Run) + +``` +βœ“ Start Workflow Telemetry (skipped) +βœ“ Checkout (2s) +βœ“ Setup Node.js (1s) +βœ“ Install Dependencies (15s) +βœ“ Build (45s) +βœ“ Test (30s) +``` + +The telemetry step is skipped, no metrics are collected, zero overhead. + +### With Debug Mode Enabled + +``` +βœ“ Start Workflow Telemetry (1s) +βœ“ Checkout (2s) +βœ“ Setup Node.js (1s) +βœ“ Install Dependencies (15s) +βœ“ Build (45s) +βœ“ Test (30s) + +πŸ“Š Job Summary includes: +- CPU Usage metrics table +- Memory Usage metrics table +- Disk Usage metrics table +- Alerts (if thresholds exceeded) +``` + +Telemetry step runs, metrics are collected, results appear in job summary. + +## Common Scenarios + +### Investigating Slow Builds + +**Problem**: Build suddenly takes 5 minutes instead of 2 minutes. + +**Solution**: +1. Re-run with debug mode enabled +2. Check metrics to identify which resource is constrained +3. Look for patterns: + - CPU maxed out β†’ Compute-bound (optimize parallelization) + - Memory high β†’ Memory-bound (reduce memory usage or use larger runner) + - Disk growing β†’ I/O-bound (optimize disk operations) + +### Debugging Test Failures + +**Problem**: Tests occasionally fail with timeout or out-of-memory errors. + +**Solution**: +1. Re-run failing test with debug mode +2. Observe resource patterns during test execution +3. Identify if specific tests correlate with resource spikes +4. Use timestamps to pinpoint problematic test suites + +### Performance Regression Analysis + +**Problem**: New PR causes workflow to run slower. + +**Solution**: +1. Run main branch workflow with debug mode +2. Run PR branch workflow with debug mode +3. Compare metrics between runs +4. Identify which step(s) show increased resource usage + +## Advanced: Custom Thresholds with Debug Mode + +You can combine debug mode with custom thresholds for sensitive monitoring: + +```yaml +- name: Start Workflow Telemetry + if: ${{ runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "3" # More granular when debugging + memory_alert_threshold: "70" # More sensitive alerts + cpu_alert_threshold: "75" + cpu_alert_duration: "30" + disk_alert_threshold: "80" +``` + +This gives you detailed metrics and early alerts only when actively investigating issues. + +## Environment Variable Alternative + +If you prefer environment-based control: + +```yaml +env: + ENABLE_METRICS: "false" # Set to "true" to enable + +jobs: + test: + steps: + - name: Start Workflow Telemetry + if: ${{ env.ENABLE_METRICS == 'true' || runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 +``` + +This allows enabling via: +- Repository secrets/variables +- Debug mode checkbox +- Manual workflow dispatch inputs + +## Best Practices + +### When to Use Debug Mode + +βœ… **Good for**: +- Workflows that run frequently (every commit) +- Production pipelines where overhead matters +- Teams with multiple contributors who may need metrics occasionally +- Troubleshooting intermittent issues + +❌ **Not ideal for**: +- Continuous performance monitoring +- Automated performance regression detection +- When you need metrics on every run + +### Alternative Approaches + +If you need metrics more often: + +**Option 1: Always On** +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 +``` + +**Option 2: Scheduled Runs Only** +```yaml +- name: Start Workflow Telemetry + if: ${{ github.event_name == 'schedule' }} + uses: garbee/runner-resource-usage@v1 +``` + +**Option 3: Main Branch Only** +```yaml +- name: Start Workflow Telemetry + if: ${{ github.ref == 'refs/heads/main' }} + uses: garbee/runner-resource-usage@v1 +``` + +## Troubleshooting + +### Debug Mode Not Working + +**Symptom**: Checked "Enable debug logging" but no metrics appear. + +**Checks**: +1. Verify the `if` condition uses `runner.debug == '1'` (string comparison) +2. Look for "Start Workflow Telemetry" step in logs - is it skipped? +3. Check post-action logs for errors + +**Debug**: +```yaml +- name: Check Debug Mode + run: echo "Debug mode is ${{ runner.debug }}" + +- name: Start Workflow Telemetry + if: ${{ runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 +``` + +### Want Metrics Without Full Debug Logging + +Debug mode enables verbose logging across all actions. If you only want metrics: + +```yaml +# Use a custom workflow input instead +workflow_dispatch: + inputs: + enable_metrics: + description: 'Enable resource metrics collection' + required: false + default: 'false' + type: choice + options: + - 'true' + - 'false' + +jobs: + test: + steps: + - name: Start Workflow Telemetry + if: ${{ inputs.enable_metrics == 'true' }} + uses: garbee/runner-resource-usage@v1 +``` + +## Related Examples + +- [Basic Monitoring](./basic-monitoring.md) - Understanding metrics output +- [Build Optimization](./build-optimization.md) - Using metrics to improve performance +- [CI/CD Pipeline](./cicd-pipeline.md) - Selective metrics in complex pipelines + +## Reference + +- [GitHub Actions Contexts - runner.debug](https://docs.github.com/en/actions/learn-github-actions/contexts#runner-context) +- [Enabling Debug Logging](https://docs.github.com/en/actions/monitoring-and-troubleshooting-workflows/enabling-debug-logging) diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 0000000..fea692f --- /dev/null +++ b/docs/faq.md @@ -0,0 +1,352 @@ +# Frequently Asked Questions (FAQ) + +## General Questions + +### What does this action do? + +This action collects system resource metrics (CPU, memory, and disk usage) throughout your workflow execution and displays them in clear tables in the job summary. It helps you identify performance bottlenecks, optimize resource usage, and prevent failures due to resource exhaustion. + +### Do I need any special permissions? + +No special permissions required. The action only needs `contents: read` (to clone the repository), which is typically already granted in most workflows. + +### Does this work with self-hosted runners? + +Yes, but self-hosted runners must have: +- Node.js 24 or later +- Access to system metrics (Linux: `/proc`, macOS/Windows: standard system utilities) +- Proper permissions for the runner process + +### How much overhead does this add? + +Minimal impact: +- **CPU**: < 1% typical usage +- **Memory**: ~1MB +- **Disk I/O**: Small writes every collection interval +- **Workflow Duration**: < 5 seconds added (startup + summary generation) + +For most workflows, the overhead is negligible. + +## Setup and Configuration + +### Where should I place this action in my workflow? + +Always place it as the **first step** in your job: + +```yaml +steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + # All other steps follow + - name: Checkout + uses: actions/checkout@v4 +``` + +This ensures metrics collection starts immediately and covers the entire workflow. + +### Can I use this in multiple jobs? + +Yes! Add the action to each job you want to monitor: + +```yaml +jobs: + build: + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + # ... build steps + + test: + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + # ... test steps +``` + +Each job will have its own metrics in its job summary. + +### What collection interval should I use? + +**Default (5 seconds)**: Good for most workflows +- Provides detailed visibility +- Minimal overhead +- Recommended starting point + +**3 seconds**: For debugging specific issues +- More granular data +- Slightly higher overhead +- Use when investigating performance problems + +**10+ seconds**: For long-running workflows +- Reduces overhead +- Less granular data +- Good for workflows > 30 minutes + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "10" +``` + +### How do I adjust alert thresholds? + +Set thresholds based on your workflow's normal resource usage: + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + memory_alert_threshold: "85" # Alert at 85% memory usage + cpu_alert_threshold: "90" # Alert at 90% CPU usage + cpu_alert_duration: "120" # Alert after 120 seconds of high CPU + disk_alert_threshold: "90" # Alert at 90% disk usage +``` + +**Tips**: +- Start with defaults and adjust based on actual usage +- Set thresholds 5-10% above your normal peak usage +- Use longer `cpu_alert_duration` for compile-heavy workflows + +## Metrics and Interpretation + +### How do I correlate metrics with workflow steps? + +Metrics include ISO 8601 timestamps. To correlate: + +1. Open your workflow run in GitHub Actions +2. Note the start/end times of steps in the timeline +3. Match these times with metric timestamps in the tables + +**Example**: +``` +Workflow Timeline: +- 10:30:00-10:30:05: Checkout +- 10:30:05-10:30:35: Build +- 10:30:35-10:31:00: Test + +Metrics show high CPU at 10:30:15 β†’ During Build step +``` + +### Why don't metrics show step names? + +This is a deliberate design choice for simplicity and reliability: +- No external API calls required +- Works without additional permissions +- Timestamp-based correlation is straightforward +- Reduces complexity and potential failure points + +### What do the alert emojis mean? + +- ⚠️ **Warning sign**: Memory threshold exceeded +- πŸ”₯ **Fire**: Sustained high CPU usage +- πŸ’Ύ **Floppy disk**: Disk usage threshold exceeded + +### How accurate are the metrics? + +Metrics are accurate within the collection interval: +- **Collection interval**: 5 seconds (default) +- **Timestamp precision**: Milliseconds +- **Measurement accuracy**: Depends on OS utilities (systeminformation library) + +Short-lived spikes between collections may be missed. Decrease interval for better granularity. + +### What if I see no metrics in the summary? + +Check these common causes: + +1. **Action skipped**: If using `if` condition, ensure it evaluated to true +2. **Collector failed**: Check main action logs for errors +3. **Post action failed**: Check post-action logs for errors +4. **Permissions issue**: Ensure runner can write to state directory + +See [Troubleshooting Guide](./troubleshooting.md#no-metrics-displayed-in-job-summary) for detailed debugging steps. + +## Advanced Usage + +### Can I collect metrics only sometimes? + +Yes! Use conditional execution with debug mode: + +```yaml +- name: Start Workflow Telemetry + if: ${{ runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 +``` + +Enable by re-running the workflow with "Enable debug logging" checked. + +See [Debug Mode Example](./examples/debug-mode.md) for details. + +### Can I export metrics to external systems? + +Not directly. The action outputs to GitHub Actions job summary only. + +To export metrics: +1. Access the state file created by the action +2. Parse the JSON data +3. Send to your monitoring system + +The state file is located at: +``` +$GITHUB_STATE_DIR/metrics-state-{runId}-{job}.json +``` + +### Can I use this with reusable workflows? + +Yes! Add the action to your reusable workflow: + +```yaml +# .github/workflows/reusable-build.yml +name: Reusable Build + +on: + workflow_call: + +jobs: + build: + runs-on: ubuntu-24.04 + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + - name: Checkout + uses: actions/checkout@v4 + # ... other steps +``` + +Call it from other workflows: + +```yaml +# .github/workflows/ci.yml +jobs: + build: + uses: ./.github/workflows/reusable-build.yml +``` + +### Can I disable specific metric types? + +No, the action collects all three metrics (CPU, memory, disk). However, you can: +- Ignore specific sections in the output +- Set very high thresholds for metrics you don't care about +- Fork the action and modify it for your needs + +## Troubleshooting + +### The action slows down my workflow significantly + +This is unexpected. Check: + +1. **Collection interval**: Is it very frequent (< 3 seconds)? + - Solution: Increase to 5+ seconds + +2. **Runner resources**: Is the runner already at capacity? + - Solution: Optimize workflow or use larger runner + +3. **Disk I/O contention**: Slow disk on self-hosted runner? + - Solution: Increase interval to reduce writes + +See [Troubleshooting Guide](./troubleshooting.md#workflow-becomes-slower-after-adding-action) for details. + +### I get alerts on every run + +Your thresholds may be too low for your workflow's normal usage. + +**Solution**: Establish a baseline first: + +1. Run workflow with default settings +2. Note peak resource usage in normal runs +3. Set thresholds 5-10% above peaks + +```yaml +# After observing typical peaks: CPU 75%, Memory 65% +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + memory_alert_threshold: "75" # 10% above normal peak + cpu_alert_threshold: "85" # 10% above normal peak +``` + +### Metrics show unexpected patterns + +**High baseline usage**: Runner may have other processes running +- Check runner system load before workflow +- Consider dedicated runners for accurate metrics + +**Periodic spikes**: Normal for some operations +- Build tools often spike CPU +- Package installations spike disk I/O +- Test frameworks may spike memory + +**Continuous growth**: Potential issues +- Memory leak in build/test code +- Accumulating temporary files +- Missing cleanup steps + +## Comparison with Other Tools + +### How does this compare to DataDog CI Visibility? + +**This Action**: +- βœ… Free, built into GitHub Actions +- βœ… Zero configuration +- βœ… Immediate results in job summary +- ❌ No historical trending +- ❌ No cross-repository analysis +- ❌ Basic alerting only + +**DataDog CI Visibility**: +- βœ… Rich historical analysis +- βœ… Cross-repository dashboards +- βœ… Advanced alerting and correlation +- ❌ Costs money +- ❌ Requires setup and integration +- ❌ External dependency + +**Choose this action** for simple, immediate insights. +**Choose DataDog** for enterprise-wide CI/CD monitoring. + +### Can I use both this action and other monitoring tools? + +Absolutely! They serve complementary purposes: + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + +- name: DataDog CI Setup + uses: datadog/ci-action@v1 + # ... DataDog configuration +``` + +Use this action for immediate insights, external tools for long-term analysis. + +## Contributing and Support + +### How do I report a bug? + +1. Check [existing issues](https://github.com/Garbee/runner-resource-usage/issues) +2. Create a new issue with: + - Workflow configuration + - Runner OS and version + - Complete error messages + - Steps to reproduce + +### How do I request a feature? + +Open an issue describing: +- The use case +- Expected behavior +- Why existing features don't suffice + +### Can I contribute? + +Yes! Pull requests are welcome. See [CONTRIBUTING.md](../CONTRIBUTING.md) if available, or open an issue to discuss your idea first. + +## Related Resources + +- [README](../README.md) - Main documentation +- [Architecture](./architecture.md) - Technical details +- [Troubleshooting Guide](./troubleshooting.md) - Common issues and solutions +- [Examples](./examples/) - Real-world usage scenarios diff --git a/docs/quick-reference.md b/docs/quick-reference.md new file mode 100644 index 0000000..37ddb9f --- /dev/null +++ b/docs/quick-reference.md @@ -0,0 +1,355 @@ +# Quick Reference + +Fast lookup for common tasks and configurations. + +## Basic Setup + +### Minimal Configuration + +```yaml +steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 +``` + +### With Custom Thresholds + +```yaml +steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + memory_alert_threshold: "85" + cpu_alert_threshold: "90" + cpu_alert_duration: "120" + disk_alert_threshold: "90" +``` + +### Debug Mode Only + +```yaml +steps: + - name: Start Workflow Telemetry + if: ${{ runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 +``` + +## Configuration Options + +| Input | Description | Default | Range | +|-------|-------------|---------|-------| +| `interval_seconds` | Collection interval | `5` | 1-300 | +| `memory_alert_threshold` | Memory alert percentage | `80` | 0-100 | +| `cpu_alert_threshold` | CPU alert percentage | `85` | 0-100 | +| `cpu_alert_duration` | CPU sustained duration (seconds) | `60` | 1-3600 | +| `disk_alert_threshold` | Disk alert percentage | `90` | 0-100 | + +## Common Scenarios + +### Scenario: Build Optimization + +```yaml +- uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "85" + cpu_alert_threshold: "90" + cpu_alert_duration: "120" +``` + +**Why**: Builds are CPU-intensive and may legitimately use high CPU for extended periods. + +### Scenario: Memory-Intensive Processing + +```yaml +- uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "90" + cpu_alert_threshold: "85" +``` + +**Why**: Data processing or ML workloads need high memory; alert only when approaching limits. + +### Scenario: Long-Running Tests + +```yaml +- uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "10" + cpu_alert_duration: "300" +``` + +**Why**: Longer interval reduces overhead; longer duration avoids false alerts during test suites. + +### Scenario: Disk-Heavy Workflows + +```yaml +- uses: garbee/runner-resource-usage@v1 + with: + disk_alert_threshold: "85" +``` + +**Why**: Build artifacts or Docker images can quickly consume disk space. + +### Scenario: Debugging Performance Issues + +```yaml +- uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "3" + memory_alert_threshold: "70" + cpu_alert_threshold: "75" + cpu_alert_duration: "30" +``` + +**Why**: More granular data and sensitive alerts help pinpoint issues quickly. + +## Alert Interpretation + +### CPU Alert + +> πŸ”₯ Sustained CPU usage above 90% for more than 60 seconds (92.0%) + +**Meaning**: CPU was at or above threshold for the specified duration. + +**Actions**: +- Review metrics to identify which step caused high CPU +- Consider parallelization or optimization +- Evaluate if larger runner is needed + +### Memory Alert + +> ⚠️ Memory utilization exceeded 85% (86.8%) + +**Meaning**: Memory usage reached or exceeded threshold at some point. + +**Actions**: +- Check for memory leaks +- Reduce memory usage in builds/tests +- Consider larger runner if legitimately needed + +### Disk Alert + +> πŸ’Ύ Disk usage exceeded 90% (91.2%) + +**Meaning**: Disk space consumed exceeded threshold. + +**Actions**: +- Clean up temporary files between steps +- Remove large artifacts after use +- Use Docker layer caching more effectively +- Consider runner with more disk space + +## Threshold Selection Guide + +### Conservative (Early Warning) + +```yaml +memory_alert_threshold: "70" +cpu_alert_threshold: "75" +cpu_alert_duration: "30" +disk_alert_threshold: "75" +``` + +**Use when**: Establishing baseline, debugging issues + +### Balanced (Default) + +```yaml +memory_alert_threshold: "80" +cpu_alert_threshold: "85" +cpu_alert_duration: "60" +disk_alert_threshold: "90" +``` + +**Use when**: Normal monitoring + +### Permissive (Avoid False Positives) + +```yaml +memory_alert_threshold: "90" +cpu_alert_threshold: "95" +cpu_alert_duration: "180" +disk_alert_threshold: "95" +``` + +**Use when**: Resource-intensive workflows with known high usage + +## Metric Correlation Guide + +### Step 1: View Workflow Timeline + +In GitHub Actions UI: +1. Open workflow run +2. Click on job name +3. Note start/end times of each step + +### Step 2: Check Job Summary + +Scroll to bottom of job logs to see: +- CPU Usage table +- Memory Usage table +- Disk Usage table +- Alert section (if any) + +### Step 3: Match Timestamps + +Compare metric timestamps with step times: + +**Example**: +``` +Workflow: +β”œβ”€ 10:30:00 Checkout (5s) +β”œβ”€ 10:30:05 Build (120s) +└─ 10:32:05 Test (60s) + +Metrics: +β”œβ”€ 10:30:00 CPU: 15% β†’ Checkout +β”œβ”€ 10:30:45 CPU: 90% β†’ Build (peak) +└─ 10:32:15 CPU: 60% β†’ Test +``` + +### Step 4: Identify Patterns + +Look for: +- **Spikes**: Sudden resource increases +- **Sustained high usage**: Extended periods near limits +- **Gradual growth**: Memory leaks or accumulation +- **Drops**: Step completion or cleanup + +## Platform-Specific Notes + +### Linux Runners + +**Mount Point**: `/` + +**Standard Setup**: +```yaml +runs-on: ubuntu-24.04 +steps: + - uses: garbee/runner-resource-usage@v1 +``` + +**Typical Resources**: +- 2 cores, 7 GB RAM, 14 GB disk (standard) +- 4 cores, 16 GB RAM, 150 GB disk (4-core) + +### macOS Runners + +**Mount Point**: `/System/Volumes/Data` + +**Standard Setup**: +```yaml +runs-on: macos-24 +steps: + - uses: garbee/runner-resource-usage@v1 +``` + +**Typical Resources**: +- 3 cores, 14 GB RAM, 14 GB disk + +**Note**: macOS runners have less available memory; adjust thresholds accordingly. + +### Windows Runners + +**Mount Point**: `C:` + +**Standard Setup**: +```yaml +runs-on: windows-2025 +steps: + - uses: garbee/runner-resource-usage@v1 +``` + +**Typical Resources**: +- 2 cores, 7 GB RAM, 14 GB disk (standard) + +**Note**: Windows runners may show higher baseline memory usage due to OS overhead. + +## Troubleshooting Quick Fixes + +### Problem: No Metrics Appear + +```yaml +# Add debug step +- name: Check State Directory + if: always() + run: | + echo "State dir: $GITHUB_STATE" + ls -la "$(dirname "$GITHUB_STATE")" || true +``` + +### Problem: Too Many Alerts + +```yaml +# Increase thresholds +- uses: garbee/runner-resource-usage@v1 + with: + memory_alert_threshold: "90" + cpu_alert_threshold: "95" +``` + +### Problem: Missing Resource Spikes + +```yaml +# Decrease interval +- uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "3" +``` + +### Problem: High Overhead + +```yaml +# Increase interval +- uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "10" +``` + +## Version Pinning + +### Floating Major Version (Recommended) + +```yaml +uses: garbee/runner-resource-usage@v1 +``` + +**Pros**: Automatic updates for bug fixes and features +**Cons**: May receive breaking changes within major version + +### Pinned Specific Version + +```yaml +uses: garbee/runner-resource-usage@v1.2.3 +``` + +**Pros**: Guaranteed stability +**Cons**: Must manually update for fixes + +### Pinned to Commit SHA + +```yaml +uses: garbee/runner-resource-usage@abc123def456... +``` + +**Pros**: Maximum security and stability +**Cons**: No automatic updates, more maintenance + +## Performance Impact + +| Interval | CPU Overhead | Memory | Disk I/O | Use Case | +|----------|--------------|---------|----------|----------| +| 3s | ~1% | ~1MB | 3KB/s | Debugging | +| 5s (default) | <1% | ~1MB | 2KB/s | General use | +| 10s | <0.5% | ~500KB | 1KB/s | Long workflows | +| 30s | <0.2% | ~300KB | 0.3KB/s | Minimal overhead | + +## Related Documentation + +- [README](../README.md) - Main documentation +- [FAQ](./faq.md) - Frequently asked questions +- [Troubleshooting](./troubleshooting.md) - Common issues +- [Examples](./examples/) - Real-world scenarios +- [Architecture](./architecture.md) - Technical details diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..eeff4b1 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,257 @@ +# Troubleshooting Guide + +This guide helps you resolve common issues when using the runner-resource-usage action. + +## Common Issues + +### No Metrics Displayed in Job Summary + +**Symptom**: The action runs successfully, but no metrics appear in the job summary. + +**Possible Causes and Solutions**: + +1. **Collector process terminated prematurely** + - Check if your workflow has resource constraints or timeout issues + - Verify that the collector process has permission to write to the state directory + - Look for error messages in the action logs + +2. **State file not persisted** + - Ensure the `GITHUB_STATE` environment variable is set (GitHub Actions provides this automatically) + - Check that the runner has write permissions to the state directory + - Review action logs for file I/O errors + +3. **Post action failed** + - Check the post-action logs for error messages + - Ensure metrics were collected (check main action logs) + - Verify the state file exists and contains valid JSON data + +**Debug Steps**: +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + +# Add this after your workflow to check state +- name: Debug State File + if: always() + run: | + echo "GITHUB_STATE directory: $GITHUB_STATE" + ls -la "$(dirname "$GITHUB_STATE")" || true +``` + +### High Resource Usage During Collection + +**Symptom**: The action itself consumes significant CPU or memory. + +**Solution**: Increase the collection interval to reduce overhead. + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "10" # Collect every 10 seconds instead of 5 +``` + +**Performance Impact**: +- Default (5 seconds): < 1% CPU, ~1MB memory +- 10 seconds: < 0.5% CPU, ~500KB memory +- Trade-off: Longer intervals mean less granular data + +### False Positive Alerts + +**Symptom**: Alerts trigger for normal workflow operations. + +**Solution**: Adjust thresholds based on your workflow's resource profile. + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + memory_alert_threshold: "90" # Increase from default 80% + cpu_alert_threshold: "95" # Increase from default 85% + cpu_alert_duration: "120" # Require 2 minutes instead of 60 seconds + disk_alert_threshold: "95" # Increase from default 90% +``` + +**Threshold Selection Guidelines**: +- **Memory**: Set to 90-95% for workflows that legitimately use high memory +- **CPU**: Set to 90-95% for compute-intensive operations (builds, tests) +- **CPU Duration**: Increase to 120-300 seconds for expected sustained CPU usage +- **Disk**: Set to 95% if workflow creates large artifacts or build outputs + +### Missing Alerts for Resource Issues + +**Symptom**: Workflow experiences resource problems, but no alerts are generated. + +**Possible Causes**: + +1. **Thresholds too high** + - Lower thresholds to catch issues earlier + - Review typical resource usage in successful runs first + +2. **Spike occurs between collection intervals** + - Decrease `interval_seconds` for more granular monitoring + - Note: More frequent collection = slightly higher overhead + +3. **Resource issue affects action itself** + - If the system is completely resource-starved, the collector may not run + - Check runner system logs for out-of-memory or disk full errors + +**Solution**: Start with conservative thresholds and adjust based on baseline usage. + +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "3" # More granular data + memory_alert_threshold: "70" # Catch issues earlier + cpu_alert_threshold: "75" + cpu_alert_duration: "30" # Shorter duration + disk_alert_threshold: "80" +``` + +### Metrics Not Correlating with Workflow Steps + +**Symptom**: Timestamps in metrics don't align with expected workflow step execution. + +**Explanation**: This is expected behavior. The action displays metrics with ISO 8601 timestamps for manual correlation. + +**How to Correlate Metrics**: + +1. **View workflow run timeline**: In GitHub Actions UI, click on a workflow run to see step start/end times +2. **Match timestamps**: Compare metric timestamps with step execution times +3. **Identify patterns**: Look for resource spikes during specific steps + +**Example Correlation**: + +Workflow step timeline: +``` +11:25:30 - Checkout (completed in 5s) +11:25:35 - Build (completed in 30s) +11:26:05 - Test (completed in 20s) +``` + +Metrics table shows: +``` +11:25:32 - CPU: 15% (during Checkout) +11:25:45 - CPU: 85% (during Build - high as expected) +11:26:10 - CPU: 45% (during Test) +``` + +This manual correlation allows you to understand which steps consume the most resources. + +### Action Fails on Self-Hosted Runners + +**Symptom**: Action works on GitHub-hosted runners but fails on self-hosted runners. + +**Possible Causes**: + +1. **Node.js version mismatch** + - This action requires Node.js 24+ + - Check: `node --version` on your self-hosted runner + - Solution: Upgrade Node.js to version 24 or later + +2. **Missing system utilities** + - The `systeminformation` library requires certain OS utilities + - Linux: Ensure `/proc` filesystem is accessible + - macOS: Ensure standard system commands are available + - Windows: Ensure PowerShell and system commands are available + +3. **Permission issues** + - Runner must have permission to: + - Read system metrics + - Write to state directory + - Fork processes + - Solution: Review runner service permissions + +**Debug Steps**: +```yaml +- name: Check Node Version + run: node --version + +- name: Check System Access + run: | + # Linux/macOS + ls -la /proc 2>/dev/null || echo "No /proc access" + + # Test systeminformation + node -e "import('systeminformation').then(si => si.currentLoad()).then(console.log)" +``` + +### Workflow Becomes Slower After Adding Action + +**Symptom**: Workflow execution time increases noticeably after adding metrics collection. + +**Analysis**: The action's overhead is typically < 1% CPU and minimal memory. If you observe significant slowdown: + +**Possible Causes**: + +1. **Very short collection interval** + - Solution: Use default 5 seconds or increase to 10 seconds + - Avoid intervals < 3 seconds unless necessary + +2. **Resource-constrained runner** + - If runner is already at capacity, any additional process can impact performance + - Solution: Consider larger runner or optimize existing workflow steps + +3. **Disk I/O contention** + - On slow disks, frequent state writes might impact performance + - Solution: Increase collection interval + +**Measure Impact**: +```yaml +# Run workflow without action +- name: Benchmark Step + run: time npm test + +# Then run with action and compare +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + +- name: Benchmark Step + run: time npm test +``` + +### Windows-Specific Issues + +**Symptom**: Action fails or behaves unexpectedly on Windows runners. + +**Common Issues**: + +1. **Path separators** + - The action handles Windows paths automatically + - No action needed from users + +2. **Disk metrics showing drive C:** + - Expected behavior: Windows uses `C:` as root mount point + - Linux uses `/`, macOS uses `/System/Volumes/Data` + +3. **PowerShell vs CMD** + - Action works with both shells + - Use `shell: bash` for cross-platform scripts + +**Example Windows Configuration**: +```yaml +- name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + # Works on windows-latest, windows-2022, windows-2019 +``` + +## Getting Help + +If you encounter issues not covered here: + +1. **Check Action Logs**: Review both main and post-action logs for error messages +2. **Search Issues**: Check [GitHub Issues](https://github.com/Garbee/runner-resource-usage/issues) for similar problems +3. **Create Issue**: Open a new issue with: + - Your workflow configuration + - Runner OS and version + - Complete error messages + - Steps to reproduce + - Expected vs actual behavior + +## Additional Resources + +- [README](../README.md) - Main documentation and usage examples +- [Architecture](./architecture.md) - Technical implementation details +- [Examples](./examples/) - Real-world usage scenarios +- [GitHub Actions Documentation](https://docs.github.com/en/actions) From c645f501860fcd4f320bf9c9b4b5845efe558b01 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 16 Feb 2026 12:45:10 +0000 Subject: [PATCH 3/5] docs: add memory-intensive and multi-job workflow examples Co-authored-by: Garbee <868301+Garbee@users.noreply.github.com> --- docs/examples/README.md | 2 - docs/examples/memory-intensive.md | 468 ++++++++++++++++++++++++ docs/examples/multi-job.md | 589 ++++++++++++++++++++++++++++++ 3 files changed, 1057 insertions(+), 2 deletions(-) create mode 100644 docs/examples/memory-intensive.md create mode 100644 docs/examples/multi-job.md diff --git a/docs/examples/README.md b/docs/examples/README.md index ae37c32..0c3dcf9 100644 --- a/docs/examples/README.md +++ b/docs/examples/README.md @@ -8,9 +8,7 @@ This directory contains real-world examples of using the runner-resource-usage a - [**Basic Monitoring**](./basic-monitoring.md) - Simple setup for general workflow monitoring - [**Build Optimization**](./build-optimization.md) - Identifying bottlenecks in build processes -- [**Test Suite Performance**](./test-performance.md) - Monitoring test execution resource usage - [**Debug Mode Only**](./debug-mode.md) - Conditional metrics collection on demand -- [**CI/CD Pipeline**](./cicd-pipeline.md) - Comprehensive pipeline with deployment stages - [**Memory-Intensive Workflows**](./memory-intensive.md) - Handling data processing and machine learning - [**Multi-Job Workflows**](./multi-job.md) - Collecting metrics across multiple jobs diff --git a/docs/examples/memory-intensive.md b/docs/examples/memory-intensive.md new file mode 100644 index 0000000..e572c05 --- /dev/null +++ b/docs/examples/memory-intensive.md @@ -0,0 +1,468 @@ +# Memory-Intensive Workflows Example + +This example shows how to monitor and optimize workflows with high memory usage, such as data processing, machine learning, or large-scale builds. + +## Use Case + +You're running a Python data science workflow that processes large datasets and trains machine learning models. Memory usage is unpredictable and sometimes causes out-of-memory errors. + +## Workflow Configuration + +```yaml +name: ML Pipeline + +on: + push: + branches: + - main + pull_request: + branches: + - main + +permissions: + contents: read + +jobs: + train: + name: Train Model + runs-on: ubuntu-24.04-16gb # Larger runner with 16GB RAM + timeout-minutes: 60 + steps: + # Configure for memory-intensive workflow + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "90" # High threshold for legitimate usage + cpu_alert_threshold: "85" + cpu_alert_duration: "300" # ML training uses CPU for extended periods + disk_alert_threshold: "85" # Large datasets and models + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + cache: 'pip' + + - name: Install Dependencies + run: | + pip install --upgrade pip + pip install -r requirements.txt + + - name: Download Dataset + run: | + python scripts/download_data.py --size large + + - name: Preprocess Data + run: | + python scripts/preprocess.py --chunk-size 10000 + + - name: Train Model + run: | + python scripts/train.py --epochs 100 --batch-size 64 + + - name: Evaluate Model + run: | + python scripts/evaluate.py + + - name: Cleanup Datasets + if: always() + run: | + rm -rf data/raw data/processed +``` + +## Expected Memory Pattern + +### Typical Memory Usage Timeline + +``` +00:00 - Checkout: 500 MB (baseline) +00:05 - Setup Python: 800 MB (Python runtime) +00:10 - Install Deps: 1.2 GB (packages loaded) +00:15 - Download Dataset: 1.5 GB (network buffer) +00:25 - Preprocess Data: 8.5 GB (dataset in memory) +01:00 - Train Model: 12.0 GB (model + dataset + gradients) +01:45 - Evaluate Model: 6.0 GB (model + test data) +02:00 - Cleanup: 1.0 GB (returning to baseline) +``` + +## Common Memory Issues and Solutions + +### Issue 1: Out-of-Memory During Training + +**Symptoms in Metrics**: +- Memory steadily increases during training +- Reaches 100% or close to it +- Workflow fails with OOM error + +**Example Metrics**: +``` +10:30:00 - 4.5 GB +10:30:30 - 8.2 GB +10:31:00 - 11.8 GB +10:31:30 - 14.9 GB ← Approaching limit +10:32:00 - [WORKFLOW FAILED] +``` + +**Solutions**: + +1. **Reduce Batch Size** +```python +# Before: batch_size=128 +python scripts/train.py --batch-size 64 # or 32 + +# Or use dynamic batch sizing +python scripts/train.py --auto-batch-size +``` + +2. **Use Gradient Accumulation** +```python +# Simulates larger batch size without memory spike +python scripts/train.py --batch-size 32 --gradient-accumulation-steps 4 +# Effective batch size: 32 Γ— 4 = 128 +``` + +3. **Process Data in Chunks** +```python +# Instead of loading entire dataset +for chunk in pd.read_csv('data.csv', chunksize=10000): + process(chunk) +``` + +4. **Use Memory-Mapped Files** +```python +# Load data without fully loading into RAM +import numpy as np +data = np.memmap('data.npy', dtype='float32', mode='r') +``` + +### Issue 2: Memory Leak in Preprocessing + +**Symptoms in Metrics**: +- Memory increases linearly during preprocessing +- Never decreases between chunks +- Growth rate consistent + +**Example Metrics**: +``` +10:00:00 - 2.0 GB (start preprocessing) +10:05:00 - 4.5 GB (chunk 1) +10:10:00 - 7.0 GB (chunk 2) +10:15:00 - 9.5 GB (chunk 3) +10:20:00 - 12.0 GB (chunk 4) ← Should plateau, but keeps growing +``` + +**Solutions**: + +1. **Explicitly Free Memory** +```python +import gc + +for chunk in data_chunks: + processed = process_chunk(chunk) + save_chunk(processed) + + # Explicitly free memory + del chunk, processed + gc.collect() +``` + +2. **Use Context Managers** +```python +def process_in_context(): + with open('data.csv') as f: + for chunk in pd.read_csv(f, chunksize=1000): + yield process(chunk) + # Memory automatically freed when exiting context +``` + +3. **Monitor Python Memory** +```python +import tracemalloc + +tracemalloc.start() +# ... your code ... +snapshot = tracemalloc.take_snapshot() +top_stats = snapshot.statistics('lineno') +for stat in top_stats[:10]: + print(stat) +``` + +### Issue 3: Dataset Too Large for Runner + +**Symptoms in Metrics**: +- Memory maxes out during data loading +- Disk usage also high (swap usage) +- System becomes unresponsive + +**Example Metrics**: +``` +10:15:00 - Memory: 14.5 GB (96% of 15GB) +10:15:05 - Disk: 65 GB (45%) ← Swap file growing +10:15:10 - Memory: 14.9 GB (99%) +10:15:15 - [TIMEOUT or FAILURE] +``` + +**Solutions**: + +1. **Use Larger Runner** +```yaml +jobs: + train: + runs-on: ubuntu-24.04-32gb # Upgrade to 32GB RAM +``` + +2. **Stream Data Instead of Loading** +```python +# Before: Load entire dataset +data = pd.read_csv('large_file.csv') # 20GB file β†’ OOM + +# After: Stream processing +for chunk in pd.read_csv('large_file.csv', chunksize=5000): + results = model.predict(chunk) + save_results(results) +``` + +3. **Use Data Sampling for CI** +```yaml +- name: Download Dataset + run: | + if [ "$GITHUB_EVENT_NAME" == "pull_request" ]; then + # Use smaller sample for PRs + python scripts/download_data.py --size small + else + # Full dataset for main branch + python scripts/download_data.py --size large + fi +``` + +## Best Practices for Memory-Intensive Workflows + +### 1. Establish Memory Budget + +Know your limits before you hit them: + +```yaml +# Document expected memory usage +- name: Train Model + run: | + # Expected memory: ~12GB peak + # Allocation: Model (4GB) + Data (6GB) + Gradients (2GB) + python scripts/train.py +``` + +Use metrics to validate your assumptions. + +### 2. Implement Cleanup Steps + +Always clean up after memory-intensive steps: + +```yaml +- name: Preprocess Data + run: python scripts/preprocess.py + +- name: Cleanup Preprocessed Data + if: always() + run: | + rm -rf data/processed/*.tmp + rm -rf data/cache/ + +- name: Train Model + run: python scripts/train.py +``` + +### 3. Use Memory-Efficient Libraries + +Choose libraries designed for large data: + +**Data Processing**: +- **Polars** instead of Pandas (more memory-efficient) +- **Dask** for distributed computing +- **Vaex** for out-of-core dataframes + +**Machine Learning**: +- **TensorFlow/PyTorch with mixed precision** (FP16 uses less memory) +- **ONNX Runtime** for inference (smaller than training) +- **Model quantization** (reduce model size) + +### 4. Monitor Peak Memory Usage + +Add logging to track memory: + +```python +import psutil +import os + +def log_memory(): + process = psutil.Process(os.getpid()) + mem_gb = process.memory_info().rss / 1024 / 1024 / 1024 + print(f"Current memory: {mem_gb:.2f} GB") + +# Log at key points +log_memory() # After data loading +train_model() +log_memory() # After training +``` + +Compare with metrics from the action to understand total system memory. + +### 5. Use Appropriate Alert Thresholds + +For memory-intensive workflows, adjust thresholds to avoid false alerts: + +```yaml +# Conservative - catch issues early +memory_alert_threshold: "85" + +# Balanced - for expected high memory usage +memory_alert_threshold: "90" + +# Permissive - only alert at critical levels +memory_alert_threshold: "95" +``` + +## Advanced Configuration + +### Matrix Strategy for Different Dataset Sizes + +```yaml +jobs: + train: + strategy: + matrix: + dataset: [small, medium, large] + include: + - dataset: small + runner: ubuntu-24.04 + memory_threshold: "80" + - dataset: medium + runner: ubuntu-24.04-16gb + memory_threshold: "85" + - dataset: large + runner: ubuntu-24.04-32gb + memory_threshold: "90" + + runs-on: ${{ matrix.runner }} + + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + memory_alert_threshold: ${{ matrix.memory_threshold }} + + - name: Train + run: python train.py --dataset ${{ matrix.dataset }} +``` + +### Conditional Cleanup Based on Memory Usage + +While you can't directly read metrics during workflow, you can add defensive cleanup: + +```yaml +- name: Aggressive Cleanup Before Memory-Intensive Step + run: | + # Remove unnecessary files + rm -rf ~/.cache/pip + docker system prune -af || true + + # Clear system caches (if sudo available) + sync && echo 3 | sudo tee /proc/sys/vm/drop_caches || true + +- name: Memory-Intensive Step + run: python scripts/process_large_data.py +``` + +## Real-World Example: Image Processing Pipeline + +```yaml +name: Image Processing + +jobs: + process: + runs-on: ubuntu-24.04-16gb + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "90" + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Python + uses: actions/setup-python@v5 + with: + python-version: '3.11' + + - name: Process Images in Batches + run: | + # Process 100 images at a time to limit memory + python scripts/process_images.py \ + --input data/images/ \ + --output data/processed/ \ + --batch-size 100 \ + --workers 4 + + - name: Generate Thumbnails + run: | + # Parallel processing with memory limit per worker + python scripts/thumbnails.py \ + --workers 4 \ + --memory-per-worker 2G + + - name: Cleanup + if: always() + run: | + rm -rf data/processed/ +``` + +Expected metrics show controlled memory usage: +- Never exceeds 14GB (well below 16GB limit) +- Memory returns to baseline between batches +- No alert triggers + +## Troubleshooting Memory Issues + +### Step 1: Identify the Problem Step + +Check metrics to see when memory spikes occur: +1. Look for timestamps when memory increases sharply +2. Correlate with workflow step times +3. Focus optimization on that step + +### Step 2: Reproduce Locally + +Run the problematic step locally with monitoring: + +```bash +# Monitor memory usage +python -m memory_profiler scripts/problematic_step.py + +# Or use py-spy for live monitoring +py-spy top --pid +``` + +### Step 3: Implement Fix and Verify + +Make changes, then verify with metrics: +1. Update code to reduce memory usage +2. Run workflow with metrics collection +3. Compare before/after metrics +4. Ensure memory stays below threshold + +## Related Examples + +- [Basic Monitoring](./basic-monitoring.md) - Understanding metrics fundamentals +- [Build Optimization](./build-optimization.md) - General optimization techniques +- [Multi-Job Workflows](./multi-job.md) - Comparing metrics across jobs + +## Further Reading + +- [Python Memory Management](https://docs.python.org/3/c-api/memory.html) +- [PyTorch Memory Management](https://pytorch.org/docs/stable/notes/cuda.html#memory-management) +- [Dask for Large Datasets](https://docs.dask.org/) +- [GitHub Actions: Larger Runners](https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners) diff --git a/docs/examples/multi-job.md b/docs/examples/multi-job.md new file mode 100644 index 0000000..8fb5917 --- /dev/null +++ b/docs/examples/multi-job.md @@ -0,0 +1,589 @@ +# Multi-Job Workflows Example + +This example demonstrates how to collect and compare metrics across multiple jobs in a workflow. + +## Use Case + +You have a workflow with multiple jobs (lint, build, test, deploy) running in parallel or sequence. You want to understand resource consumption across all jobs to identify bottlenecks and optimize the overall pipeline. + +## Workflow Configuration + +```yaml +name: CI/CD Pipeline + +on: + push: + branches: + - main + pull_request: + branches: + - main + +permissions: + contents: read + +jobs: + lint: + name: Lint Code + runs-on: ubuntu-24.04 + timeout-minutes: 10 + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + cpu_alert_threshold: "90" + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + cache: 'npm' + + - name: Install Dependencies + run: npm ci + + - name: Run ESLint + run: npm run lint + + - name: Run Prettier + run: npm run format:check + + test: + name: Test (${{ matrix.os }}) + runs-on: ${{ matrix.os }} + timeout-minutes: 20 + strategy: + fail-fast: false + matrix: + os: [ubuntu-24.04, windows-2025, macos-24] + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "85" + cpu_alert_threshold: "90" + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + cache: 'npm' + + - name: Install Dependencies + run: npm ci + + - name: Run Unit Tests + run: npm run test:unit + + - name: Run Integration Tests + run: npm run test:integration + + build: + name: Build + needs: [lint, test] + runs-on: ubuntu-24.04 + timeout-minutes: 15 + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + cpu_alert_threshold: "90" + cpu_alert_duration: "120" # Build can sustain high CPU + + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Node.js + uses: actions/setup-node@v4 + with: + node-version: '20' + cache: 'npm' + + - name: Install Dependencies + run: npm ci + + - name: Build Application + run: npm run build + + - name: Upload Build Artifacts + uses: actions/upload-artifact@v4 + with: + name: dist + path: dist/ + retention-days: 1 + + deploy: + name: Deploy + needs: build + if: github.ref == 'refs/heads/main' + runs-on: ubuntu-24.04 + timeout-minutes: 10 + permissions: + contents: read + id-token: write # For OIDC authentication + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + - name: Download Build Artifacts + uses: actions/download-artifact@v4 + with: + name: dist + + - name: Deploy to Production + run: | + # Deployment logic here + echo "Deploying to production..." +``` + +## Analyzing Multi-Job Metrics + +### Job Summary Structure + +After the workflow completes, each job will have its own metrics in its job summary: + +``` +Workflow Run: CI/CD Pipeline #123 +β”œβ”€ Job: Lint Code +β”‚ └─ Summary: CPU, Memory, Disk tables +β”œβ”€ Job: Test (ubuntu-24.04) +β”‚ └─ Summary: CPU, Memory, Disk tables +β”œβ”€ Job: Test (windows-2025) +β”‚ └─ Summary: CPU, Memory, Disk tables +β”œβ”€ Job: Test (macos-24) +β”‚ └─ Summary: CPU, Memory, Disk tables +β”œβ”€ Job: Build +β”‚ └─ Summary: CPU, Memory, Disk tables +└─ Job: Deploy + └─ Summary: CPU, Memory, Disk tables +``` + +### Comparing Metrics Across Jobs + +#### Example Comparison: Test Job Across Platforms + +**Ubuntu-24.04**: +``` +Duration: 5 minutes +Peak CPU: 85% +Peak Memory: 3.2 GB +Peak Disk: 52 GB +``` + +**Windows-2025**: +``` +Duration: 7 minutes +Peak CPU: 78% +Peak Memory: 4.1 GB +Peak Disk: 58 GB +``` + +**macOS-24**: +``` +Duration: 6 minutes +Peak CPU: 70% +Peak Memory: 3.8 GB +Peak Disk: 55 GB +``` + +**Insights**: +- Windows tests run 40% slower +- Windows uses more memory (OS overhead) +- All platforms have similar CPU patterns +- **Action**: Investigate Windows-specific performance issues + +## Common Patterns and Optimizations + +### Pattern 1: Fast Parallel Jobs + +**Lint Job** (Low resource usage): +``` +Duration: 2 minutes +CPU: 30-50% (lightweight checks) +Memory: 1.5 GB (minimal) +Disk: 45 GB (code + dependencies) +``` + +**Optimization**: None needed - already efficient + +### Pattern 2: Resource-Intensive Parallel Jobs + +**Test Job** (High resource usage): +``` +Duration: 8 minutes +CPU: 85-95% (test execution) +Memory: 4-6 GB (test fixtures + app) +Disk: 52 GB (test data + coverage) +``` + +**Optimization**: Consider parallel test execution + +```yaml +- name: Run Tests in Parallel + run: npm run test -- --parallel --max-workers=4 +``` + +### Pattern 3: Sequential Bottleneck + +If build job waits for slow test jobs: + +**Problem**: +``` +Lint: 2 min βœ“ +Test: 15 min ← Bottleneck +Build: 5 min (waits for Test) +Total: 22 minutes +``` + +**Solution 1: Parallelize Tests** +```yaml +test: + strategy: + matrix: + shard: [1, 2, 3, 4] + steps: + - run: npm run test -- --shard=${{ matrix.shard }}/4 +``` + +**Result**: +``` +Lint: 2 min βœ“ +Test: 4 min (4 shards Γ— 4 min each, parallel) +Build: 5 min +Total: 11 minutes (50% faster) +``` + +**Solution 2: Remove Build Dependency on Test** +```yaml +build: + needs: [lint] # Only wait for lint, not tests +``` + +**Result**: +``` +Lint: 2 min βœ“ +Test: 15 min (running in parallel with build) +Build: 5 min (starts after lint) +Deploy: 2 min +Total: 15 minutes (vs 22 minutes) +``` + +## Platform-Specific Insights + +### Comparing Cross-Platform Resource Usage + +Use matrix jobs to understand platform differences: + +```yaml +jobs: + analyze-platforms: + strategy: + matrix: + os: [ubuntu-24.04, windows-2025, macos-24] + runs-on: ${{ matrix.os }} + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + - name: Identical Workload + run: | + # Run the same task on all platforms + npm ci + npm run build + npm test +``` + +#### Example Platform Comparison Results + +**Build Performance**: +| Platform | Duration | Peak CPU | Peak Memory | +|----------|----------|----------|-------------| +| Ubuntu | 3m 45s | 92% | 2.8 GB | +| Windows | 5m 12s | 85% | 3.6 GB | +| macOS | 4m 20s | 78% | 3.2 GB | + +**Insights**: +- Ubuntu is fastest (best for CI) +- Windows has higher memory overhead +- macOS has lower CPU utilization (fewer cores?) + +**Decision**: Use ubuntu-24.04 for CI, test on other platforms only occasionally + +### Resource-Aware Runner Selection + +Choose runners based on job requirements: + +```yaml +jobs: + # Lightweight job - use standard runner + lint: + runs-on: ubuntu-24.04 # 2 cores, 7 GB RAM + + # CPU-intensive - use larger runner + build: + runs-on: ubuntu-24.04-4-core # 4 cores, 16 GB RAM + + # Memory-intensive - use memory-optimized runner + test-e2e: + runs-on: ubuntu-24.04-16gb # 2 cores, 16 GB RAM +``` + +Monitor with metrics to validate choices. + +## Advanced: Dynamic Job Adjustment + +### Conditional Metrics Collection + +Collect metrics only where useful: + +```yaml +jobs: + # Always collect for resource-intensive jobs + build: + steps: + - name: Start Workflow Telemetry + uses: garbee/runner-resource-usage@v1 + + # Conditionally collect for debugging + lint: + steps: + - name: Start Workflow Telemetry + if: ${{ runner.debug == '1' }} + uses: garbee/runner-resource-usage@v1 +``` + +### Job-Specific Thresholds + +Adjust thresholds based on job characteristics: + +```yaml +jobs: + lint: + steps: + - uses: garbee/runner-resource-usage@v1 + with: + cpu_alert_threshold: "70" # Linting shouldn't max CPU + memory_alert_threshold: "80" + + build: + steps: + - uses: garbee/runner-resource-usage@v1 + with: + cpu_alert_threshold: "95" # Build can max CPU + cpu_alert_duration: "180" + memory_alert_threshold: "85" + + test: + steps: + - uses: garbee/runner-resource-usage@v1 + with: + cpu_alert_threshold: "90" + memory_alert_threshold: "90" # Tests may use lots of memory +``` + +## Real-World Optimization Story + +### Before Optimization + +```yaml +jobs: + test: + runs-on: ubuntu-24.04 + # Single job runs all tests + steps: + - uses: garbee/runner-resource-usage@v1 + - run: npm test # 20 minutes + + build: + needs: test + runs-on: ubuntu-24.04 + steps: + - uses: garbee/runner-resource-usage@v1 + - run: npm run build # 5 minutes + +# Total time: 25 minutes +``` + +**Metrics showed**: +- Test CPU at 55% (single-threaded) +- Build waiting 20 minutes doing nothing +- Memory and disk had plenty of capacity + +### After Optimization + +```yaml +jobs: + test: + runs-on: ubuntu-24.04-4-core + strategy: + matrix: + shard: [1, 2, 3, 4] + steps: + - uses: garbee/runner-resource-usage@v1 + - run: npm test -- --shard=${{ matrix.shard }}/4 + # 6 minutes per shard + + build: + runs-on: ubuntu-24.04 + # No longer waits for test + steps: + - uses: garbee/runner-resource-usage@v1 + - run: npm run build # 5 minutes + +# Total time: 6 minutes (test shards run in parallel) +``` + +**Metrics showed**: +- Test CPU now at 85% (better utilization) +- Build completes independently +- Total workflow 76% faster + +**Result**: Saved 19 minutes per workflow run + +## Troubleshooting Multi-Job Issues + +### Problem: Job A Faster Than Job B + +**Investigate**: +1. Compare metrics between jobs +2. Look for resource bottlenecks +3. Check for different workloads + +**Example**: +- Job A: CPU 90%, finishes in 3 min +- Job B: CPU 40%, finishes in 8 min + +**Conclusion**: Job B is I/O or network bound, not CPU bound + +**Solutions**: +- Add caching to reduce I/O +- Parallelize I/O operations +- Use faster storage runners + +### Problem: Matrix Jobs Have Inconsistent Duration + +**Example Matrix Execution**: +``` +Job (ubuntu-24.04): 5 min +Job (windows-2025): 12 min ← Much slower +Job (macos-24): 6 min +``` + +**Investigation Steps**: +1. Compare metrics across matrix jobs +2. Identify where time is spent differently +3. Look for platform-specific issues + +**Common Causes**: +- Windows antivirus scanning (CPU spikes) +- Platform-specific dependency installation (network/disk) +- Different test behavior on platforms + +### Problem: Jobs Fail After Adding Metrics + +**Symptom**: Jobs that previously passed now fail with resource exhaustion. + +**Cause**: Jobs were already close to limits; metrics pushed over edge. + +**Solution**: Not a real problem with the actionβ€”workflow was already fragile. Address the underlying resource issue: + +```yaml +# Before (marginal) +runs-on: ubuntu-24.04 # 7 GB RAM, using 6.8 GB + +# After (headroom) +runs-on: ubuntu-24.04-16gb # 16 GB RAM, comfortable margin +``` + +## Best Practices + +### 1. Always Collect on Resource-Intensive Jobs + +```yaml +jobs: + lint: + # Optional - lightweight job + steps: + - uses: garbee/runner-resource-usage@v1 + if: ${{ runner.debug == '1' }} + + build: + # Always - resource-intensive + steps: + - uses: garbee/runner-resource-usage@v1 + + test: + # Always - resource-intensive + steps: + - uses: garbee/runner-resource-usage@v1 +``` + +### 2. Use Consistent Configuration for Comparison + +```yaml +# Define as YAML anchor for consistency +.metrics: &metrics + uses: garbee/runner-resource-usage@v1 + with: + interval_seconds: "5" + memory_alert_threshold: "85" + cpu_alert_threshold: "90" + +jobs: + test: + steps: + - <<: *metrics + build: + steps: + - <<: *metrics +``` + +Or use a reusable workflow: + +```yaml +# .github/workflows/with-metrics.yml +name: Reusable with Metrics + +on: + workflow_call: + +jobs: + run: + steps: + - uses: garbee/runner-resource-usage@v1 + # ... other steps +``` + +### 3. Document Expected Resource Usage + +Add comments to help team understand normal patterns: + +```yaml +jobs: + build: + # Expected resources: CPU 85%, Memory 4GB, Duration 5min + steps: + - uses: garbee/runner-resource-usage@v1 + - run: npm run build +``` + +## Related Examples + +- [Basic Monitoring](./basic-monitoring.md) - Understanding single-job metrics +- [Build Optimization](./build-optimization.md) - Optimizing specific jobs +- [CI/CD Pipeline](./cicd-pipeline.md) - Complete pipeline example + +## Further Reading + +- [GitHub Actions: Matrix Strategy](https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs) +- [Reusable Workflows](https://docs.github.com/en/actions/using-workflows/reusing-workflows) +- [Job Dependencies](https://docs.github.com/en/actions/using-jobs/using-jobs-in-a-workflow#defining-prerequisite-jobs) From 52708ca2bc8befa42848cd74afd612823e4f2f7f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 16 Feb 2026 12:46:11 +0000 Subject: [PATCH 4/5] docs: add comprehensive documentation index and final polish Co-authored-by: Garbee <868301+Garbee@users.noreply.github.com> --- README.md | 2 + docs/README.md | 204 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 206 insertions(+) create mode 100644 docs/README.md diff --git a/README.md b/README.md index bd21ea1..c3b1108 100644 --- a/README.md +++ b/README.md @@ -202,6 +202,8 @@ The `runner.debug` context is documented in the [GitHub Actions contexts referen - [Basic Monitoring](docs/examples/basic-monitoring.md) - Simple setup for general workflows - [Debug Mode Only](docs/examples/debug-mode.md) - Conditional metrics collection on demand - [Build Optimization](docs/examples/build-optimization.md) - Identifying and resolving bottlenecks + - [Memory-Intensive Workflows](docs/examples/memory-intensive.md) - Handling data processing and ML + - [Multi-Job Workflows](docs/examples/multi-job.md) - Comparing metrics across jobs - **[Architecture](docs/architecture.md)** - Technical implementation details ### Getting Help diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..7370312 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,204 @@ +# Documentation Index + +Complete guide to the runner-resource-usage action documentation. + +## Quick Start + +New to this action? Start here: + +1. **[README](../README.md)** - Overview, features, and basic usage +2. **[Quick Reference](./quick-reference.md)** - Common configurations at a glance +3. **[Basic Monitoring Example](./examples/basic-monitoring.md)** - Your first workflow with metrics + +## User Guides + +### Getting Started + +- **[README](../README.md)** - Main documentation with installation and usage +- **[Quick Reference](./quick-reference.md)** - Fast lookup for configurations and scenarios +- **[Examples Index](./examples/README.md)** - Overview of all available examples + +### Understanding Metrics + +- **[FAQ](./faq.md)** - Frequently asked questions + - What does this action do? + - How do I read the metrics? + - How do I correlate metrics with workflow steps? + - What are the alert thresholds? + - How much overhead does this add? + +### Solving Problems + +- **[Troubleshooting Guide](./troubleshooting.md)** - Common issues and solutions + - No metrics displayed + - High resource usage + - False positive alerts + - Platform-specific issues + - Performance problems + +## Examples by Use Case + +### By Experience Level + +**Beginner**: +- [Basic Monitoring](./examples/basic-monitoring.md) - Simple setup and interpretation + +**Intermediate**: +- [Debug Mode Only](./examples/debug-mode.md) - Conditional metrics collection +- [Build Optimization](./examples/build-optimization.md) - Using metrics to improve performance + +**Advanced**: +- [Memory-Intensive Workflows](./examples/memory-intensive.md) - Data processing and ML +- [Multi-Job Workflows](./examples/multi-job.md) - Cross-job analysis + +### By Workflow Type + +**General CI/CD**: +- [Basic Monitoring](./examples/basic-monitoring.md) - Standard build/test workflows +- [Multi-Job Workflows](./examples/multi-job.md) - Multiple jobs in one workflow + +**Performance Optimization**: +- [Build Optimization](./examples/build-optimization.md) - Identifying bottlenecks +- [Memory-Intensive Workflows](./examples/memory-intensive.md) - High memory usage + +**Debugging & Investigation**: +- [Debug Mode Only](./examples/debug-mode.md) - On-demand metrics collection +- [Build Optimization](./examples/build-optimization.md) - Performance analysis + +### By Technology + +**Node.js / JavaScript**: +- [Basic Monitoring](./examples/basic-monitoring.md) +- [Build Optimization](./examples/build-optimization.md) + +**Python / Data Science**: +- [Memory-Intensive Workflows](./examples/memory-intensive.md) + +**Docker**: +- [Build Optimization](./examples/build-optimization.md) + +**Cross-Platform**: +- [Multi-Job Workflows](./examples/multi-job.md) + +## Technical Documentation + +### Architecture & Design + +- **[Architecture](./architecture.md)** - Technical implementation details + - Execution flow + - Data collection mechanism + - Storage architecture + - Performance considerations + - Security model + +### Configuration Reference + +- **[Quick Reference](./quick-reference.md)** - All configuration options + - Input parameters + - Default values + - Threshold recommendations + - Platform-specific notes + +## Documentation by Topic + +### Resource Monitoring + +| Topic | Documents | +|-------|-----------| +| Basic setup | [README](../README.md), [Basic Monitoring](./examples/basic-monitoring.md) | +| CPU metrics | [FAQ](./faq.md), [Build Optimization](./examples/build-optimization.md) | +| Memory metrics | [FAQ](./faq.md), [Memory-Intensive](./examples/memory-intensive.md) | +| Disk metrics | [FAQ](./faq.md), [Build Optimization](./examples/build-optimization.md) | + +### Configuration & Tuning + +| Topic | Documents | +|-------|-----------| +| Alert thresholds | [Quick Reference](./quick-reference.md), [FAQ](./faq.md) | +| Collection interval | [Quick Reference](./quick-reference.md), [FAQ](./faq.md) | +| Conditional execution | [Debug Mode](./examples/debug-mode.md) | +| Platform-specific | [Quick Reference](./quick-reference.md), [Multi-Job](./examples/multi-job.md) | + +### Problem Solving + +| Issue Type | Documents | +|------------|-----------| +| No metrics showing | [Troubleshooting](./troubleshooting.md) | +| Performance problems | [Troubleshooting](./troubleshooting.md), [Build Optimization](./examples/build-optimization.md) | +| Memory issues | [Troubleshooting](./troubleshooting.md), [Memory-Intensive](./examples/memory-intensive.md) | +| False alerts | [Troubleshooting](./troubleshooting.md), [FAQ](./faq.md) | + +## Learning Paths + +### Path 1: Quick Start + +For users who want to get started immediately: + +1. Read [README](../README.md) β†’ Understand what the action does +2. Follow [Basic Monitoring](./examples/basic-monitoring.md) β†’ Add to your workflow +3. Check [Quick Reference](./quick-reference.md) β†’ Adjust settings if needed + +**Time required**: 15 minutes + +### Path 2: Performance Optimization + +For users investigating slow workflows: + +1. Add action using [Basic Monitoring](./examples/basic-monitoring.md) +2. Collect baseline metrics (run workflow 2-3 times) +3. Follow [Build Optimization](./examples/build-optimization.md) +4. Check [Troubleshooting](./troubleshooting.md) for specific issues + +**Time required**: 1-2 hours + +### Path 3: Advanced Usage + +For users with complex workflows: + +1. Start with [Basic Monitoring](./examples/basic-monitoring.md) +2. Read [Architecture](./architecture.md) β†’ Understand how it works +3. Review [Multi-Job Workflows](./examples/multi-job.md) +4. Check [Memory-Intensive](./examples/memory-intensive.md) if applicable +5. Set up [Debug Mode](./examples/debug-mode.md) for production + +**Time required**: 2-3 hours + +### Path 4: Troubleshooting + +For users experiencing issues: + +1. Check [Troubleshooting Guide](./troubleshooting.md) β†’ Find your issue +2. Review [FAQ](./faq.md) β†’ Common questions +3. Check relevant example for your use case +4. Read [Architecture](./architecture.md) if needed for deeper understanding + +**Time required**: 30 minutes - 1 hour + +## Contributing Documentation + +Have suggestions for improving documentation? + +1. Open an issue describing the improvement +2. Submit a PR with documentation changes +3. Follow the style and format of existing docs + +Good documentation contributions: +- Fix errors or unclear explanations +- Add missing examples or use cases +- Update outdated information +- Improve navigation or organization + +## Getting Help + +Can't find what you need? + +1. **Search existing documentation**: Use browser search (Ctrl/Cmd+F) in this index +2. **Check the FAQ**: [FAQ.md](./faq.md) covers many common questions +3. **Search issues**: [GitHub Issues](https://github.com/Garbee/runner-resource-usage/issues) +4. **Ask a question**: Open a new issue with the "question" label + +--- + +**Last Updated**: 2026-02-16 + +**Documentation Version**: 1.0 From 32d451fbca0ae402d477d1b9d1082a8f5711d646 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 16 Feb 2026 13:09:51 +0000 Subject: [PATCH 5/5] docs: fix incorrect interval references - default is 5 seconds not 1 second Co-authored-by: Garbee <868301+Garbee@users.noreply.github.com> --- docs/architecture.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index ba261e9..8a0e818 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -28,7 +28,7 @@ sequenceDiagram Main-->>User: Exit immediately Collector->>Collector: Start collection loop - loop Every 1 second (default) + loop Every 5 seconds (default) Collector->>Collector: Collect CPU, Memory, Disk metrics Collector->>State: Write metrics to state file end @@ -54,7 +54,7 @@ For accessibility, here is a text description of the execution flow diagram abov 1. **Main Action Execution**: The workflow executes the main action (`dist/main/index.js`), which immediately spawns a collector process as a detached background process and exits. -2. **Collector Process**: The collector process (`dist/main/collector.js`) creates a Metrics instance and starts a collection loop that runs every 1 second (by default). During each cycle, it: +2. **Collector Process**: The collector process (`dist/main/collector.js`) creates a Metrics instance and starts a collection loop that runs every 5 seconds (by default). During each cycle, it: - Collects CPU, memory, and disk usage metrics using the `systeminformation` library - Stores the metrics in memory - Writes the metrics to state file immediately after collection @@ -93,7 +93,7 @@ A simple background process that: The core metrics collection component: - **Initialization**: Starts async collection in the constructor -- **Periodic Collection**: Collects metrics every 1 second (default) using drift-compensated timers +- **Periodic Collection**: Collects metrics every 5 seconds (default) using drift-compensated timers - **Data Collection**: Uses `systeminformation` library to gather: - CPU usage (user and system, 0-100%) - Memory usage (active and available in MB) @@ -263,7 +263,7 @@ The action is designed for Node.js 24+ with: ### CPU Impact - Collection process uses minimal CPU (< 1% typical) - `systeminformation` library is efficient -- 1-second interval provides high-resolution data with minimal overhead +- 5-second default interval provides good resolution with minimal overhead ### Disk I/O