Autonomous AI-driven code optimization framework. Inspired by karpathy/autoresearch.
Required tools: docker, kubectl, kind, git, envsubst (from gettext), bc, curl
See examples/lifecycle/ for annotated shell scripts showing each phase of the
optimization lifecycle (build, deploy, workload, collect, profile, analyze, validate,
teardown). These are teaching patterns — adapt them for each target rather than
running them directly.
See examples/kind-cluster/setup.sh for local Kubernetes cluster setup.
See examples/demo/ for a complete end-to-end demo with the pyserver target.
- Agree on target, environment, metric, and tag
- Read target config:
targets/<target>/target.md,targets/<target>/hints.md - Clone target source (shallow, selective submodules for large C++ projects)
- Verify environment: docker, kubectl, kind cluster (see
examples/kind-cluster/setup.sh) - Initialize results:
mkdir -p "results/<target>/<env>/logs" "results/<target>/<env>/profiles"
Before profiling, you need queries/requests that represent real usage. An idle process or trivial workload produces meaningless profiles. The workload determines what code paths are exercised and what shows up in the profiler.
-
Target's own benchmark suite — most projects ship standard benchmarks:
- ClickHouse:
tests/performance/*.xml(382 XML test files),clickhouse-benchmarktool - PostgreSQL: pgbench, TPC-H
- Redis: redis-benchmark
- Any project: check
tests/,benchmarks/,perf/directories
- ClickHouse:
-
Production-representative queries — from documentation, tutorials, or user forums:
- Official docs "getting started" queries
- Blog posts showing real use cases
- GitHub issues tagged "performance" (users share their slow queries)
- Slow query logs if available
-
Metric-targeted stress patterns:
- Memory (peak_rss): high-cardinality GROUP BY, large JOINs, window functions, groupArray with many values
- CPU (latency): complex WHERE filters, regex matching, sorting large results, aggregation with many functions
- I/O: full table scans, FINAL queries, heavy merge operations
-
Search online for the target's known performance issues:
- GitHub issues/PRs tagged "performance"
- Blog posts about optimization
- Conference talks about internals
- This reveals which workload patterns actually stress the system
- Queries exercise the code paths relevant to the primary metric
- Data scale is production-representative (not toy: 100K rows is rarely enough)
- Workload is reproducible (deterministic data generation, fixed row counts)
- Multiple query types cover different code paths (aggregation, scan, sort, join)
- Concurrent queries are mandatory — production services always handle multiple simultaneous requests. Single-query testing misses contention effects and under-represents peak memory by 2-3×. Run at least 4 concurrent queries during profiling.
- Both single-query AND concurrent profiling should be collected — some bottlenecks only appear under contention (MemoryTracker races, lock contention, allocator fragmentation)
Write targets/<target>/workload.sh and targets/<target>/profile_workload.sh with queries that:
- Create test data at sufficient scale
- Run representative queries both sequentially and concurrently
- Output standardized metrics (latency_p99_ms, throughput_qps, error_rate)
This is the most important phase. Do NOT scan code or guess optimizations. Profile FIRST with stack-level allocation tracing to find WHERE resources are actually spent.
Build with RelWithDebInfo (or equivalent) so profiling tools can resolve function names and source locations. Without symbols, profiling data is useless.
-DCMAKE_BUILD_TYPE=RelWithDebInfo # not Release
The workload should stress the primary metric (e.g., peak_rss_mb for memory, latency for CPU). Use production-representative data sizes, not toy data.
Use the target's built-in profiling tools first. Examples:
- ClickHouse:
system.trace_logwithtrace_type='Memory',memory_profiler_step=1048576 - Go:
pprofheap profile via HTTP endpoint - Python:
tracemallocwith stack traces - Generic C/C++:
heaptrack,jemalloc prof(MALLOC_CONF=prof:true), orvalgrind --tool=massif
If no built-in tools exist, use:
/proc/PID/smapsfor memory region breakdown (coarse)perf record -gfor CPU flame graphs
From the traces, answer:
- Which function allocates the most bytes? (not "which module" — the FUNCTION)
- What call stack leads to it? (at least 5 frames deep)
- What data structure is growing? (PODArray, Arena, std::vector, hash table buffer?)
- How much does it allocate? (absolute bytes, % of total)
- WHY is it allocating? (realloc doubling, new element insertion, copy-on-write, serialization?)
Output: Write profiles/baseline-stacks.txt with the top 10 allocation paths and their byte counts.
If the profiler gives aggregate data but not enough detail (e.g., "Arena = 56% of peak" but which Arena method?), add lightweight counters directly to the suspected hot path:
// Example: count calls and bytes per method
++stats_alloc_calls; stats_alloc_bytes += size;
++stats_realloc_calls; stats_realloc_bytes += old_size;Log stats in the destructor for objects > 10MB. This gives ground truth about which methods are actually called, not which methods COULD be called based on code reading.
This step prevented us from wasting experiments on Arena::realloc (0 calls) and Arena::allocContinue (0 calls) when 100% of allocation was through alignedAlloc.
Now scan source code — but only the functions identified in Phase 1 traces. Don't scan broadly; focus on the specific call stacks from the profiler.
For each top allocation path from Phase 1:
- Read the source code for that specific function
- Understand WHY it allocates that much
- Identify whether it's: avoidable, reducible, or deferrable
- Propose a code-level change (data structure, algorithm, or logic)
Write candidates to candidates.md with:
- Profile evidence: exact function, bytes, % of total (from Phase 1)
- Root cause: why the allocation happens (realloc doubling, unbounded growth, etc.)
- Proposed fix: what structural change would reduce it
- Targeted workload: a query/test that exercises THIS specific path at scale
Before any candidate proceeds:
- Profile stack trace shows this function as a top allocator (>5% of total)
- Root cause is understood (not guessed from code reading)
- Proposed fix is structural (data structure, algorithm, logic — NOT constant/threshold tuning)
- Targeted workload is designed that exercises the EXACT code path
- Size calculation shows the workload produces allocations large enough to trigger the inefficiency
Deploy unmodified baseline, run targeted workload, profile again:
- Does the targeted function appear in the top allocators?
- Are the allocation sizes in the expected range?
- If NOT → the workload is wrong, redesign it. DO NOT PROCEED.
This step prevented us from optimizing Arena::realloc when groupArray doesn't use Arena for its array growth — it uses PODArray via the system allocator.
Run experiments only on profile-confirmed candidates with verified workloads.
1. Pick top confirmed candidate from candidates.md
2. If none remain: report to human or re-profile with different workload
3. Implement the optimization
4. Generate reproducible scripts for both baseline and experiment
(see "Reproducible Experiment Scripts" below)
5. Build (Release mode for benchmarking — debug symbols not needed here)
6. Run N>=3 iterations with TARGETED workload:
deploy → workload → collect RSS → teardown
7. Run N>=3 baseline iterations (reuse if already collected for this workload)
8. Compute statistics: mean, stddev, range for both
9. Decision:
- Mean improved > 1 stddev AND distributions don't overlap → KEEP
- Distributions overlap → DISCARD (not statistically significant)
- Profile shows no change in targeted function → DISCARD (wrong hypothesis)
10. Record in results.tsv with multi-run stats
11. Save metrics.log and diff.patch in the experiment directory
12. Repeat
Every experiment MUST produce runnable scripts that a human can execute step-by-step to reproduce the results. This is the primary trust mechanism — if someone can't re-run your experiment, the numbers are just claims.
After each experiment (baseline or optimization), generate scripts in:
results/<target>/<env>/<exp_id>/
build.sh # exact docker build command(s)
deploy.sh # exact kubectl commands, port-forward setup
workload.sh # exact queries/requests that were run
collect.sh # exact metric collection commands
profile.sh # exact profiling commands (if profiling was done)
teardown.sh # cleanup commands
metrics.log # collected metrics output
diff.patch # the code change (experiment only, not baseline)
README.md # what this experiment tests and what to expect
Script requirements:
- Self-contained — each script must run independently with no external state. Include all env vars, paths, and parameters inline. A human should be able to
cdinto the directory and run each script in order. - Idempotent where possible — re-running a script should produce the same result (use deterministic data, fixed seeds).
- Commented — explain what each command does and what to look for in the output. A human unfamiliar with the target should be able to follow along.
- Use patterns from
examples/lifecycle/— adapt the reference scripts for the specific target. For example, use/proc/1/statusVmHWM for peak RSS collection (fromexamples/lifecycle/collect.sh), or target-specific profiling hooks (fromexamples/lifecycle/profile.sh).
README.md for each experiment should include:
# <exp_id>: <one-line description>
## Hypothesis
What optimization is being tested and why (with profiling evidence).
## How to reproduce
1. ./build.sh # builds the container image (~X min)
2. ./deploy.sh # deploys to K8s and sets up port-forward
3. ./workload.sh # runs the benchmark workload
4. ./collect.sh # collects metrics
5. ./teardown.sh # cleans up
## Expected results
<metric>: <expected value> (baseline was <baseline value>)
## Actual results
Run 1: <value>, Run 2: <value>, Run 3: <value>
Mean: <mean> ± <stddev>
Decision: KEEP/DISCARD — <reason>A human should be able to verify any experiment by running 5 scripts in order. If they can't, the experiment is incomplete.
Not every finding leads to a code PR. The right output depends on what you find:
| Finding | Right Output |
|---|---|
| Small, self-contained fix with measured impact | Pull Request with benchmark data |
| Design-level problem requiring API changes | Issue with profiling evidence and proposed design |
| No optimization opportunity after profiling | Report documenting what was investigated and why it's not a target |
An issue with stack traces and a concrete proposal is more valuable than a PR with no measurable impact.
Code-level only — NOT configuration changes.
- Data structure changes — replace a container with a more efficient one
- Memory lifecycle changes — pre-reserve, pool, recycle, release earlier
- Algorithm changes — reduce complexity, eliminate redundant work
- Logic changes — move semantics, lazy evaluation, deferred computation
NOT optimization:
- Changing growth factors (2x → 1.5x)
- Changing thresholds (128MB → 64MB)
- Changing initial sizes (4096 → 1024)
- Changing buffer counts or pool sizes
Ask: "Does this change HOW the code works, or just WHAT numbers it uses?"
- Profile before coding — never implement without stack-level allocation evidence
- Verify workload exercises the path — before implementing, not after
- Multi-run benchmarks — N>=3, distributions must not overlap for KEEP
- Instrument when uncertain — add counters to get ground truth, don't guess from code
- Right output type — PR for measured fixes, issue for design proposals, report for dead ends
- Only edit files in editable scope from target.md
- No config tuning — constants, thresholds, buffer sizes are NOT optimizations
- Same-version A/B only — baseline and experiment MUST be built from the same commit; never compare stock image vs from-source build of a different version
- Deterministic benchmark data — never use
rand()in data generation for A/B tests; usenumber,sipHash64(number), or fixed seeds so control and experiment have identical data - Judge absolute impact first — estimate real-world savings before implementing; if <50 MB or <10% of total query memory, skip the candidate
- Verify all code paths — before changing allocation strategy, check every path that touches the buffer post-allocation (overflow retries, appends, downstream growth)
- Self-review before PR — ask: is methodology sound, is impact meaningful, are edge cases tested, would I approve this from someone else?
- Log everything — results.tsv captures all experiments including failures
- Generate reproducible scripts — every experiment must produce runnable scripts in
results/<target>/<env>/<exp_id>/that a human can execute to verify results (see "Reproducible Experiment Scripts" above)
These are hard-won lessons from 18+ experiments across 3 pipeline versions:
-
Aggregate profiling metrics (/proc/status, query_log) don't tell you WHERE memory goes. You need stack traces. "Arena = 56% of peak" is not actionable; "PODArray::resize called from GroupArrayGeneralImpl::insertResultInto = 536MB" IS actionable.
-
Code reading produces hypotheses, not facts. ClickHouse devs documented "quadratic waste in allocContinue" but our workload never triggered it (0 calls). Always validate hypotheses with instrumentation before implementing.
-
Your optimization must target the function that ACTUALLY allocates, not the one that LOOKS like it should. Arena::realloc was the obvious target but had 0 calls. The real allocator was PODArray::resize through the system allocator, called during result materialization (not during aggregation).
-
Generic workloads may never exercise the targeted path. 100K keys × 50 values = 400 bytes/key — too small to trigger realloc. 1K keys × 50K values = 500MB total — exercises the path. Size calculations matter.
-
Single-run results are noise. Process RSS varies 10-20% between identical runs with generic workloads, and ~0.1% with targeted workloads. Always run N>=3 and compare distributions.
-
Sometimes the right contribution is an issue, not a PR. ClickHouse's PODArray realloc peak during result materialization requires an API change (estimateResultBytes). Filing an issue with profiling evidence is more valuable than a PR that doesn't work.
-
Build with debug symbols for profiling, Release for benchmarking. addressToSymbol/addressToLine return empty strings without debug info. Switch to Release for the actual A/B benchmark.
-
A/B benchmarks MUST use the same source version. Build baseline and experiment from the SAME commit — the only difference must be your optimization patch. Never compare a stock Docker image (version X) against a from-source build (version Y). Different versions have hundreds of unrelated changes that contaminate results. The correct process: build from source WITHOUT your change (control), then build WITH your change (experiment), run identical workloads, compare. Incremental rebuilds with ccache make this fast — only the changed files recompile.
-
Judge impact BEFORE implementing. After profiling, estimate the absolute memory/CPU savings in the real-world scenario, not just percentages. "20% of FormatStringImpl allocations" sounds good but is only 3 MB per query — not worth a PR. Ask: would a ClickHouse user notice this? Would it change their capacity planning? If the absolute savings are under ~50 MB or ~10% of total query memory, the optimization is too marginal. Move on to the next candidate.
-
Use deterministic data for A/B tests. Never use
rand()in INSERT for benchmark data — control and experiment get different data, contaminating the comparison. Usenumber,sipHash64(number), or fixed seeds. The data must be byte-identical between control and experiment runs. -
Verify ALL code paths before changing an allocation strategy. When switching from
resize()(with headroom) toresize_exact()(no headroom), check every code path that touches the buffer AFTER allocation. Look for: overflow/retry paths, append operations, downstream consumers that may grow the buffer. If ANY path relies on headroom, the change is unsafe or counterproductive. In ClickHouse,U_BUFFER_OVERFLOW_ERRORretry inLowerUpperUTF8Implrelied on power-of-two headroom — removing it made overflows more frequent. -
Critically evaluate your own PR before submitting. Before creating a PR, ask: (a) Is the benchmark methodology sound? (b) Is the absolute impact meaningful to the project? (c) Have I tested edge cases? (d) Would I approve this PR if someone else submitted it? If any answer is no, fix it or don't submit. A weak PR wastes reviewer time and damages credibility.
# Find where you left off
cat results/<target>/<env>/candidates.md
tail -5 results/<target>/<env>/results.tsv
ls results/<target>/<env>/profiles/
# Read latest profile
cat results/<target>/<env>/profiles/<latest>-stacks.txt
# Review previous experiments (each has reproducible scripts)
ls results/<target>/<env>/
# Each exp_id/ directory has: build.sh, deploy.sh, workload.sh, collect.sh,
# teardown.sh, metrics.log, README.md, and optionally diff.patch + profile.shFrom target.md: direction: lower (e.g., peak_rss_mb) or direction: higher (e.g., throughput_qps).