autooptimization

Autonomous AI-driven code optimization framework. Inspired by karpathy/autoresearch.

Prerequisites

Required tools: docker, kubectl, kind, git, envsubst (from gettext), bc, curl

Reference Examples

See examples/lifecycle/ for annotated shell scripts showing each phase of the optimization lifecycle (build, deploy, workload, collect, profile, analyze, validate, teardown). These are teaching patterns — adapt them for each target rather than running them directly.

See examples/kind-cluster/setup.sh for local Kubernetes cluster setup. See examples/demo/ for a complete end-to-end demo with the pyserver target.

Setup

Agree on target, environment, metric, and tag
Read target config: targets/<target>/target.md, targets/<target>/hints.md
Clone target source (shallow, selective submodules for large C++ projects)
Verify environment: docker, kubectl, kind cluster (see examples/kind-cluster/setup.sh)

Initialize results:

mkdir -p "results/<target>/<env>/logs" "results/<target>/<env>/profiles"

Phase 0: Workload Discovery — Know What to Stress

Before profiling, you need queries/requests that represent real usage. An idle process or trivial workload produces meaningless profiles. The workload determines what code paths are exercised and what shows up in the profiler.

Sources for workload discovery (in priority order):

Target's own benchmark suite — most projects ship standard benchmarks:
- ClickHouse: tests/performance/*.xml (382 XML test files), clickhouse-benchmark tool
- PostgreSQL: pgbench, TPC-H
- Redis: redis-benchmark
- Any project: check tests/, benchmarks/, perf/ directories
Production-representative queries — from documentation, tutorials, or user forums:
- Official docs "getting started" queries
- Blog posts showing real use cases
- GitHub issues tagged "performance" (users share their slow queries)
- Slow query logs if available
Metric-targeted stress patterns:
- Memory (peak_rss): high-cardinality GROUP BY, large JOINs, window functions, groupArray with many values
- CPU (latency): complex WHERE filters, regex matching, sorting large results, aggregation with many functions
- I/O: full table scans, FINAL queries, heavy merge operations
Search online for the target's known performance issues:
- GitHub issues/PRs tagged "performance"
- Blog posts about optimization
- Conference talks about internals
- This reveals which workload patterns actually stress the system

Workload design requirements:

Queries exercise the code paths relevant to the primary metric
Data scale is production-representative (not toy: 100K rows is rarely enough)
Workload is reproducible (deterministic data generation, fixed row counts)
Multiple query types cover different code paths (aggregation, scan, sort, join)
Concurrent queries are mandatory — production services always handle multiple simultaneous requests. Single-query testing misses contention effects and under-represents peak memory by 2-3×. Run at least 4 concurrent queries during profiling.
Both single-query AND concurrent profiling should be collected — some bottlenecks only appear under contention (MemoryTracker races, lock contention, allocator fragmentation)

Output:

Write targets/<target>/workload.sh and targets/<target>/profile_workload.sh with queries that:

Create test data at sufficient scale
Run representative queries both sequentially and concurrently
Output standardized metrics (latency_p99_ms, throughput_qps, error_rate)

Phase 1: Deep Profile — Find the Real Bottleneck

This is the most important phase. Do NOT scan code or guess optimizations. Profile FIRST with stack-level allocation tracing to find WHERE resources are actually spent.

1a. Build with Debug Symbols

Build with RelWithDebInfo (or equivalent) so profiling tools can resolve function names and source locations. Without symbols, profiling data is useless.

-DCMAKE_BUILD_TYPE=RelWithDebInfo  # not Release

1b. Deploy and Run Representative Workload

The workload should stress the primary metric (e.g., peak_rss_mb for memory, latency for CPU). Use production-representative data sizes, not toy data.

1c. Collect Stack-Level Allocation Traces

Use the target's built-in profiling tools first. Examples:

ClickHouse: system.trace_log with trace_type='Memory', memory_profiler_step=1048576
Go: pprof heap profile via HTTP endpoint
Python: tracemalloc with stack traces
Generic C/C++: heaptrack, jemalloc prof (MALLOC_CONF=prof:true), or valgrind --tool=massif

If no built-in tools exist, use:

/proc/PID/smaps for memory region breakdown (coarse)
perf record -g for CPU flame graphs

1d. Identify the EXACT Allocation Path

From the traces, answer:

Which function allocates the most bytes? (not "which module" — the FUNCTION)
What call stack leads to it? (at least 5 frames deep)
What data structure is growing? (PODArray, Arena, std::vector, hash table buffer?)
How much does it allocate? (absolute bytes, % of total)
WHY is it allocating? (realloc doubling, new element insertion, copy-on-write, serialization?)

Output: Write profiles/baseline-stacks.txt with the top 10 allocation paths and their byte counts.

1e. Add Custom Instrumentation (if needed)

If the profiler gives aggregate data but not enough detail (e.g., "Arena = 56% of peak" but which Arena method?), add lightweight counters directly to the suspected hot path:

// Example: count calls and bytes per method
++stats_alloc_calls; stats_alloc_bytes += size;
++stats_realloc_calls; stats_realloc_bytes += old_size;

Log stats in the destructor for objects > 10MB. This gives ground truth about which methods are actually called, not which methods COULD be called based on code reading.

This step prevented us from wasting experiments on Arena::realloc (0 calls) and Arena::allocContinue (0 calls) when 100% of allocation was through alignedAlloc.

Phase 2: Identify Candidates from Profile Data

Now scan source code — but only the functions identified in Phase 1 traces. Don't scan broadly; focus on the specific call stacks from the profiler.

For each top allocation path from Phase 1:
  - Read the source code for that specific function
  - Understand WHY it allocates that much
  - Identify whether it's: avoidable, reducible, or deferrable
  - Propose a code-level change (data structure, algorithm, or logic)

Write candidates to candidates.md with:

Profile evidence: exact function, bytes, % of total (from Phase 1)
Root cause: why the allocation happens (realloc doubling, unbounded growth, etc.)
Proposed fix: what structural change would reduce it
Targeted workload: a query/test that exercises THIS specific path at scale

Candidate Validation Checklist

Before any candidate proceeds:

Profile stack trace shows this function as a top allocator (>5% of total)
Root cause is understood (not guessed from code reading)
Proposed fix is structural (data structure, algorithm, logic — NOT constant/threshold tuning)
Targeted workload is designed that exercises the EXACT code path
Size calculation shows the workload produces allocations large enough to trigger the inefficiency

Phase 2.75: Verify Workload Triggers the Path

Deploy unmodified baseline, run targeted workload, profile again:

Does the targeted function appear in the top allocators?
Are the allocation sizes in the expected range?
If NOT → the workload is wrong, redesign it. DO NOT PROCEED.

This step prevented us from optimizing Arena::realloc when groupArray doesn't use Arena for its array growth — it uses PODArray via the system allocator.

Phase 3: Experiment Loop

Run experiments only on profile-confirmed candidates with verified workloads.

The Loop

1.  Pick top confirmed candidate from candidates.md
2.  If none remain: report to human or re-profile with different workload
3.  Implement the optimization
4.  Generate reproducible scripts for both baseline and experiment
    (see "Reproducible Experiment Scripts" below)
5.  Build (Release mode for benchmarking — debug symbols not needed here)
6.  Run N>=3 iterations with TARGETED workload:
      deploy → workload → collect RSS → teardown
7.  Run N>=3 baseline iterations (reuse if already collected for this workload)
8.  Compute statistics: mean, stddev, range for both
9.  Decision:
    - Mean improved > 1 stddev AND distributions don't overlap → KEEP
    - Distributions overlap → DISCARD (not statistically significant)
    - Profile shows no change in targeted function → DISCARD (wrong hypothesis)
10. Record in results.tsv with multi-run stats
11. Save metrics.log and diff.patch in the experiment directory
12. Repeat

Reproducible Experiment Scripts

Every experiment MUST produce runnable scripts that a human can execute step-by-step to reproduce the results. This is the primary trust mechanism — if someone can't re-run your experiment, the numbers are just claims.

After each experiment (baseline or optimization), generate scripts in:

results/<target>/<env>/<exp_id>/
  build.sh          # exact docker build command(s)
  deploy.sh         # exact kubectl commands, port-forward setup
  workload.sh       # exact queries/requests that were run
  collect.sh        # exact metric collection commands
  profile.sh        # exact profiling commands (if profiling was done)
  teardown.sh       # cleanup commands
  metrics.log       # collected metrics output
  diff.patch        # the code change (experiment only, not baseline)
  README.md         # what this experiment tests and what to expect

Script requirements:

Self-contained — each script must run independently with no external state. Include all env vars, paths, and parameters inline. A human should be able to cd into the directory and run each script in order.
Idempotent where possible — re-running a script should produce the same result (use deterministic data, fixed seeds).
Commented — explain what each command does and what to look for in the output. A human unfamiliar with the target should be able to follow along.
Use patterns from examples/lifecycle/ — adapt the reference scripts for the specific target. For example, use /proc/1/status VmHWM for peak RSS collection (from examples/lifecycle/collect.sh), or target-specific profiling hooks (from examples/lifecycle/profile.sh).

README.md for each experiment should include:

# <exp_id>: <one-line description>

## Hypothesis
What optimization is being tested and why (with profiling evidence).

## How to reproduce
1. ./build.sh    # builds the container image (~X min)
2. ./deploy.sh   # deploys to K8s and sets up port-forward
3. ./workload.sh # runs the benchmark workload
4. ./collect.sh  # collects metrics
5. ./teardown.sh # cleans up

## Expected results
<metric>: <expected value> (baseline was <baseline value>)

## Actual results
Run 1: <value>, Run 2: <value>, Run 3: <value>
Mean: <mean> ± <stddev>
Decision: KEEP/DISCARD — <reason>

A human should be able to verify any experiment by running 5 scripts in order. If they can't, the experiment is incomplete.

Output Types

Not every finding leads to a code PR. The right output depends on what you find:

Finding	Right Output
Small, self-contained fix with measured impact	Pull Request with benchmark data
Design-level problem requiring API changes	Issue with profiling evidence and proposed design
No optimization opportunity after profiling	Report documenting what was investigated and why it's not a target

An issue with stack traces and a concrete proposal is more valuable than a PR with no measurable impact.

Optimization Priority

Code-level only — NOT configuration changes.

Data structure changes — replace a container with a more efficient one
Memory lifecycle changes — pre-reserve, pool, recycle, release earlier
Algorithm changes — reduce complexity, eliminate redundant work
Logic changes — move semantics, lazy evaluation, deferred computation

NOT optimization:

Changing growth factors (2x → 1.5x)
Changing thresholds (128MB → 64MB)
Changing initial sizes (4096 → 1024)
Changing buffer counts or pool sizes

Ask: "Does this change HOW the code works, or just WHAT numbers it uses?"

Rules

Profile before coding — never implement without stack-level allocation evidence
Verify workload exercises the path — before implementing, not after
Multi-run benchmarks — N>=3, distributions must not overlap for KEEP
Instrument when uncertain — add counters to get ground truth, don't guess from code
Right output type — PR for measured fixes, issue for design proposals, report for dead ends
Only edit files in editable scope from target.md
No config tuning — constants, thresholds, buffer sizes are NOT optimizations
Same-version A/B only — baseline and experiment MUST be built from the same commit; never compare stock image vs from-source build of a different version
Deterministic benchmark data — never use rand() in data generation for A/B tests; use number, sipHash64(number), or fixed seeds so control and experiment have identical data
Judge absolute impact first — estimate real-world savings before implementing; if <50 MB or <10% of total query memory, skip the candidate
Verify all code paths — before changing allocation strategy, check every path that touches the buffer post-allocation (overflow retries, appends, downstream growth)
Self-review before PR — ask: is methodology sound, is impact meaningful, are edge cases tested, would I approve this from someone else?
Log everything — results.tsv captures all experiments including failures
Generate reproducible scripts — every experiment must produce runnable scripts in results/<target>/<env>/<exp_id>/ that a human can execute to verify results (see "Reproducible Experiment Scripts" above)

Lessons Learned (from ClickHouse experiments)

These are hard-won lessons from 18+ experiments across 3 pipeline versions:

Aggregate profiling metrics (/proc/status, query_log) don't tell you WHERE memory goes. You need stack traces. "Arena = 56% of peak" is not actionable; "PODArray::resize called from GroupArrayGeneralImpl::insertResultInto = 536MB" IS actionable.
Code reading produces hypotheses, not facts. ClickHouse devs documented "quadratic waste in allocContinue" but our workload never triggered it (0 calls). Always validate hypotheses with instrumentation before implementing.
Your optimization must target the function that ACTUALLY allocates, not the one that LOOKS like it should. Arena::realloc was the obvious target but had 0 calls. The real allocator was PODArray::resize through the system allocator, called during result materialization (not during aggregation).
Generic workloads may never exercise the targeted path. 100K keys × 50 values = 400 bytes/key — too small to trigger realloc. 1K keys × 50K values = 500MB total — exercises the path. Size calculations matter.
Single-run results are noise. Process RSS varies 10-20% between identical runs with generic workloads, and ~0.1% with targeted workloads. Always run N>=3 and compare distributions.
Sometimes the right contribution is an issue, not a PR. ClickHouse's PODArray realloc peak during result materialization requires an API change (estimateResultBytes). Filing an issue with profiling evidence is more valuable than a PR that doesn't work.
Build with debug symbols for profiling, Release for benchmarking. addressToSymbol/addressToLine return empty strings without debug info. Switch to Release for the actual A/B benchmark.
A/B benchmarks MUST use the same source version. Build baseline and experiment from the SAME commit — the only difference must be your optimization patch. Never compare a stock Docker image (version X) against a from-source build (version Y). Different versions have hundreds of unrelated changes that contaminate results. The correct process: build from source WITHOUT your change (control), then build WITH your change (experiment), run identical workloads, compare. Incremental rebuilds with ccache make this fast — only the changed files recompile.
Judge impact BEFORE implementing. After profiling, estimate the absolute memory/CPU savings in the real-world scenario, not just percentages. "20% of FormatStringImpl allocations" sounds good but is only 3 MB per query — not worth a PR. Ask: would a ClickHouse user notice this? Would it change their capacity planning? If the absolute savings are under ~50 MB or ~10% of total query memory, the optimization is too marginal. Move on to the next candidate.
Use deterministic data for A/B tests. Never use rand() in INSERT for benchmark data — control and experiment get different data, contaminating the comparison. Use number, sipHash64(number), or fixed seeds. The data must be byte-identical between control and experiment runs.
Verify ALL code paths before changing an allocation strategy. When switching from resize() (with headroom) to resize_exact() (no headroom), check every code path that touches the buffer AFTER allocation. Look for: overflow/retry paths, append operations, downstream consumers that may grow the buffer. If ANY path relies on headroom, the change is unsafe or counterproductive. In ClickHouse, U_BUFFER_OVERFLOW_ERROR retry in LowerUpperUTF8Impl relied on power-of-two headroom — removing it made overflows more frequent.
Critically evaluate your own PR before submitting. Before creating a PR, ask: (a) Is the benchmark methodology sound? (b) Is the absolute impact meaningful to the project? (c) Have I tested edge cases? (d) Would I approve this PR if someone else submitted it? If any answer is no, fix it or don't submit. A weak PR wastes reviewer time and damages credibility.

Resume Protocol

# Find where you left off
cat results/<target>/<env>/candidates.md
tail -5 results/<target>/<env>/results.tsv
ls results/<target>/<env>/profiles/

# Read latest profile
cat results/<target>/<env>/profiles/<latest>-stacks.txt

# Review previous experiments (each has reproducible scripts)
ls results/<target>/<env>/
# Each exp_id/ directory has: build.sh, deploy.sh, workload.sh, collect.sh,
# teardown.sh, metrics.log, README.md, and optionally diff.patch + profile.sh

Metric Direction

From target.md: direction: lower (e.g., peak_rss_mb) or direction: higher (e.g., throughput_qps).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autooptimization

Prerequisites

Reference Examples

Setup

Phase 0: Workload Discovery — Know What to Stress

Sources for workload discovery (in priority order):

Workload design requirements:

Output:

Phase 1: Deep Profile — Find the Real Bottleneck

1a. Build with Debug Symbols

1b. Deploy and Run Representative Workload

1c. Collect Stack-Level Allocation Traces

1d. Identify the EXACT Allocation Path

1e. Add Custom Instrumentation (if needed)

Phase 2: Identify Candidates from Profile Data

Candidate Validation Checklist

Phase 2.75: Verify Workload Triggers the Path

Phase 3: Experiment Loop

The Loop

Reproducible Experiment Scripts

Output Types

Optimization Priority

Rules

Lessons Learned (from ClickHouse experiments)

Resume Protocol

Metric Direction

FilesExpand file tree

program.md

Latest commit

History

program.md

File metadata and controls

autooptimization

Prerequisites

Reference Examples

Setup

Phase 0: Workload Discovery — Know What to Stress

Sources for workload discovery (in priority order):

Workload design requirements:

Output:

Phase 1: Deep Profile — Find the Real Bottleneck

1a. Build with Debug Symbols

1b. Deploy and Run Representative Workload

1c. Collect Stack-Level Allocation Traces

1d. Identify the EXACT Allocation Path

1e. Add Custom Instrumentation (if needed)

Phase 2: Identify Candidates from Profile Data

Candidate Validation Checklist

Phase 2.75: Verify Workload Triggers the Path

Phase 3: Experiment Loop

The Loop

Reproducible Experiment Scripts

Output Types

Optimization Priority

Rules

Lessons Learned (from ClickHouse experiments)

Resume Protocol

Metric Direction