diff --git a/benchmark/README.md b/benchmark/README.md
deleted file mode 100644
index db4c1a2b..00000000
--- a/benchmark/README.md
+++ /dev/null
@@ -1,190 +0,0 @@
-# Rune Benchmark
-
-Benchmark suite for evaluating Rune's capture, recall, and extraction quality.
-
-## Architecture
-
-Since v0.2.0, Rune operates in **agent-delegated mode**: the calling agent (Claude/Codex/Gemini) performs LLM reasoning using the scribe prompt, then passes pre-extracted JSON to the MCP server. The server handles only embedding, encryption, and storage.
-
-This means **capture quality is determined by the scribe prompt**, not by Rune's internal modules. The benchmark reflects this:
-
-| Benchmark | What it tests | Method |
-|-----------|---------------|--------|
-| **scribe_bench** (capture) | Does the scribe prompt correctly guide capture/skip decisions? | Feed scribe prompt + scenario input to LLM, score the output JSON |
-| **scribe_bench** (extraction) | Is the extracted JSON well-structured? | Same LLM call, score field coverage and extraction type |
-| **retriever_bench** | Can stored decisions be retrieved with various queries? | Offline embedding similarity (FHE is transparent to scores) |
-
-## Running
-
-```bash
-# Default: uses claude CLI (no API key needed)
-python benchmark/runners/scribe_bench.py
-
-# Use a different agent CLI
-python benchmark/runners/scribe_bench.py --agent gemini
-python benchmark/runners/scribe_bench.py --agent codex
-
-# Capture only / extraction only
-python benchmark/runners/scribe_bench.py --mode capture
-python benchmark/runners/scribe_bench.py --agent gemini --mode extraction
-
-# Filter by category
-python benchmark/runners/scribe_bench.py --category pr_review
-
-# Retriever benchmark (no API key or CLI needed — uses local embeddings)
-python benchmark/runners/retriever_bench.py
-python benchmark/runners/retriever_bench.py --category semantic_match
-
-# Save report
-python benchmark/runners/scribe_bench.py --report benchmark/reports/scribe.json
-
-# Fallback: direct API call (for CI without CLI auth)
-python benchmark/runners/scribe_bench.py --api-key $ANTHROPIC_API_KEY --provider anthropic
-python benchmark/runners/scribe_bench.py --api-key $OPENAI_API_KEY --provider openai --model gpt-4o
-```
-
-## Scenarios (104 total)
-
-```
-scenarios/
-├── capture/
-│ ├── should_capture/ 55 scenarios
-│ │ ├── architecture/ 10 (incl. mixed-language edge cases)
-│ │ ├── debugging/ 9 (incl. implicit decision in FYI)
-│ │ ├── incident/ 6
-│ │ ├── product/ 8
-│ │ ├── tradeoff/ 6
-│ │ ├── process/ 6
-│ │ └── pr_review/ 10 (incl. cross-team, subtle standard-setting)
-│ └── should_not_capture/ 23 scenarios
-│ ├── casual/ 4
-│ ├── status_update/ 4
-│ ├── question/ 4
-│ ├── slop/ 6 (AI fluff, non-committal, verbose nothing)
-│ └── pr_noise/ 5 (LGTM, nitpick, lint, merge conflict)
-├── recall/ 16 scenarios
-│ ├── exact_match/ 4
-│ ├── semantic_match/ 6 (incl. cross-language Korean→English)
-│ ├── cross_domain/ 3
-│ └── temporal/ 3 (superseded decisions)
-└── extraction/ 10 scenarios
- ├── single/ 4
- ├── phase_chain/ 3
- └── bundle/ 3
-```
-
-## Scoring
-
-### Capture Accuracy
-- **True Positive**: should_capture scenario correctly flagged
-- **True Negative**: should_not_capture scenario correctly skipped
-- **False Positive**: noise incorrectly captured (worse than FN for storage cost)
-- **False Negative**: decision missed
-
-### Extraction Quality
-- **Type accuracy**: correct extraction mode (single / phase_chain / bundle)
-- **Title keyword match**: extracted title contains expected keywords
-- **Status accuracy**: correct status_hint (accepted / proposed / rejected)
-- **Field coverage**: sufficient alternatives and trade-offs extracted
-- **Phase count**: reasonable number of phases for multi-phase/bundle
-
-### Retriever Quality
-- **Hit@K**: target record appears in results above min_score threshold
-- **MRR**: Mean Reciprocal Rank of target records
-
-## Adding Your Own Scenarios
-
-When Rune doesn't behave as expected — a decision was missed, noise was captured, or the extraction structure was wrong — paste the actual conversation into a scenario to turn it into a regression test.
-
-### Field Reference
-
-#### Capture scenario fields
-
-| Field | Required | Description |
-|-------|----------|-------------|
-| `id` | yes | Unique identifier. Format: `{category}-{description}-{number}` (e.g., `debug-grpc-timeout-009`) |
-| `category` | yes | Directory path under `scenarios/`. Must match where the file lives (e.g., `capture/should_capture/debugging`) |
-| `language` | yes | Language of the input: `en`, `ko`, `ja`, or `mixed` |
-| `input` | yes | The actual conversation text to evaluate. Paste the real message as-is |
-| `expected_capture` | yes | `true` if this should be captured, `false` if it should be skipped |
-| `expected_fields` | no | Expected metadata when captured (see below). Use `{}` for should_not_capture scenarios |
-| `expected_fields.domain` | no | Expected domain classification: `architecture`, `security`, `product`, `debugging`, `incident`, `ops`, `process`, `qa`, `hr`, `data`, `finance`, `general`, etc. |
-| `expected_fields.status_hint` | no | Expected decision status: `accepted`, `proposed`, or `rejected` |
-| `expected_fields.title_keywords` | no | List of keywords that should appear in the extracted title (case-insensitive, any match passes) |
-| `recall_queries` | no | Optional queries that should (or should not) retrieve this decision after capture |
-| `notes` | no | Free-text annotation explaining why this scenario is interesting or tricky |
-
-#### Retriever scenario fields
-
-| Field | Required | Description |
-|-------|----------|-------------|
-| `id` | yes | Format: `recall-{subcategory}-{description}-{number}` |
-| `category` | yes | One of: `recall/exact_match`, `recall/semantic_match`, `recall/cross_domain`, `recall/temporal` |
-| `language` | yes | Language of the query |
-| `seed_records` | yes | Array of records to index before querying. Each has `title`, `domain`, `content`, and optionally `tags` |
-| `query` | yes | The recall query to test |
-| `expected_match_titles` | yes | Titles of seed records that should appear in results above `min_score` |
-| `min_score` | no | Minimum cosine similarity threshold (default: 0.35). Lower for semantic/cross-domain tests |
-| `notes` | no | Free-text annotation |
-
-#### Extraction scenario fields
-
-| Field | Required | Description |
-|-------|----------|-------------|
-| `id` | yes | Format: `extract-{type}-{description}-{number}` |
-| `category` | yes | One of: `extraction/single`, `extraction/phase_chain`, `extraction/bundle` |
-| `language` | yes | Language of the input |
-| `input` | yes | The conversation text to extract from |
-| `expected_extraction_type` | yes | Expected extraction mode: `single`, `phase_chain`, or `bundle` |
-| `expected_fields.title_keywords` | no | Keywords expected in extracted title |
-| `expected_fields.status_hint` | no | Expected status: `accepted`, `proposed`, `rejected` |
-| `expected_fields.min_alternatives` | no | Minimum number of alternatives that should be extracted |
-| `expected_fields.min_trade_offs` | no | Minimum number of trade-offs that should be extracted |
-| `expected_fields.min_phases` | no | Minimum phase count (for phase_chain/bundle) |
-| `expected_fields.max_phases` | no | Maximum phase count (for phase_chain/bundle) |
-| `notes` | no | Free-text annotation |
-
-### 1. Capture Scenarios
-
-A decision that should have been captured but wasn't — append a line to the matching category's JSONL:
-
-```bash
-# e.g., a debugging decision that was missed
-vi benchmark/scenarios/capture/should_capture/debugging/scenarios.jsonl
-```
-
-```json
-{"id": "debug-your-case-009", "category": "capture/should_capture/debugging", "language": "en", "input": "Paste the actual conversation here", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["key", "terms"]}, "recall_queries": [{"query": "A question you'd use to find this decision later", "should_match": true}]}
-```
-
-Noise that was incorrectly captured:
-
-```json
-{"id": "slop-your-case-007", "category": "capture/should_not_capture/slop", "language": "en", "input": "Content that should not have been captured", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-```
-
-### 2. Retriever Scenarios
-
-A query that failed to surface a known decision:
-
-```json
-{"id": "recall-semantic-your-case-007", "category": "recall/semantic_match", "language": "en", "seed_records": [{"title": "Title of the stored record", "domain": "architecture", "content": "Body of the stored record"}], "query": "The query that should have matched but didn't", "expected_match_titles": ["Title of the stored record"], "min_score": 0.3}
-```
-
-### 3. Extraction Scenarios
-
-Extraction produced the wrong structure (e.g., single decision extracted as phase_chain):
-
-```json
-{"id": "extract-single-your-case-005", "category": "extraction/single", "language": "en", "input": "Content to extract from", "expected_extraction_type": "single", "expected_fields": {"title_keywords": ["expected", "keywords"], "status_hint": "accepted", "min_alternatives": 1, "min_trade_offs": 1}}
-```
-
-### Tips
-
-- **Avoid duplicate ids** — increment the number past the highest existing one
-- **Paste real conversations** as input — real-world data makes better benchmarks than synthetic examples
-- **Redact sensitive data**: mask API keys, passwords, and PII before adding
-- Verify immediately after adding:
- ```bash
- python benchmark/runners/scribe_bench.py --category debugging -v
- ```
diff --git a/benchmark/datasets/embedding_token_length.json b/benchmark/datasets/embedding_token_length.json
deleted file mode 100644
index 415d1c89..00000000
--- a/benchmark/datasets/embedding_token_length.json
+++ /dev/null
@@ -1,202 +0,0 @@
-{
- "metadata": {
- "description": "Embedding token length benchmark dataset",
- "languages": {"en": 2, "ko": 2, "ja": 1, "fr": 1},
- "token_lengths": [128, 256, 512, 768],
- "variants": ["original", "duplicate", "evolution", "unrelated"]
- },
- "topics": [
- {
- "id": "arch_postgres_en",
- "language": "en",
- "domain": "architecture",
- "description": "Adopted PostgreSQL over MongoDB",
- "variants": {
- "original": {
- "128": "We decided to adopt PostgreSQL over MongoDB as our primary datastore. The key driver was the need for strong ACID transaction guarantees across our financial transaction processing pipeline. MongoDB's eventual consistency model posed unacceptable risks for balance calculations and audit trails. PostgreSQL also offered mature tooling for schema migrations and a robust ecosystem of extensions like PostGIS for our geolocation features.",
- "256": "We decided to adopt PostgreSQL over MongoDB as our primary datastore after a three-month evaluation period involving the backend and data engineering teams. The key driver was the need for strong ACID transaction guarantees across our financial transaction processing pipeline. MongoDB's eventual consistency model posed unacceptable risks for balance calculations and audit trails. We considered CockroachDB as a distributed SQL alternative but ruled it out due to operational complexity and higher infrastructure costs for our current scale. PostgreSQL also offered mature tooling for schema migrations via Flyway, a robust ecosystem of extensions like PostGIS for our geolocation features, and excellent support for JSON columns which addressed the semi-structured data use cases that initially drew us to MongoDB.",
- "512": "We decided to adopt PostgreSQL over MongoDB as our primary datastore after a three-month evaluation period involving the backend and data engineering teams. The key driver was the need for strong ACID transaction guarantees across our financial transaction processing pipeline. MongoDB's eventual consistency model posed unacceptable risks for balance calculations and audit trails, particularly after an incident in Q2 where a race condition in MongoDB caused a temporary ledger discrepancy of twelve thousand dollars that took two days to reconcile. We considered CockroachDB as a distributed SQL alternative but ruled it out due to operational complexity and higher infrastructure costs at our current scale of roughly fifty million rows. PostgreSQL also offered mature tooling for schema migrations via Flyway, a robust ecosystem of extensions like PostGIS for our geolocation features, and excellent support for JSON columns which addressed the semi-structured data use cases that initially drew us to MongoDB. The trade-off we accepted is that PostgreSQL requires more careful capacity planning for horizontal scaling compared to MongoDB's native sharding, but we determined that vertical scaling with read replicas would serve our needs for the next eighteen months. The team unanimously agreed after running comparative load tests showing PostgreSQL handled our p99 latency targets of under fifty milliseconds while MongoDB exceeded seventy milliseconds under equivalent write-heavy workloads.",
- "768": "We decided to adopt PostgreSQL over MongoDB as our primary datastore after a three-month evaluation period involving the backend and data engineering teams. The key driver was the need for strong ACID transaction guarantees across our financial transaction processing pipeline. MongoDB's eventual consistency model posed unacceptable risks for balance calculations and audit trails, particularly after an incident in Q2 where a race condition in MongoDB caused a temporary ledger discrepancy of twelve thousand dollars that took two days to reconcile. We considered CockroachDB as a distributed SQL alternative but ruled it out due to operational complexity and higher infrastructure costs at our current scale of roughly fifty million rows. We also briefly evaluated Amazon Aurora but preferred the vendor neutrality of open-source PostgreSQL given our multi-cloud roadmap. PostgreSQL offered mature tooling for schema migrations via Flyway, a robust ecosystem of extensions like PostGIS for our geolocation features, and excellent support for JSON columns which addressed the semi-structured data use cases that initially drew us to MongoDB. The trade-off we accepted is that PostgreSQL requires more careful capacity planning for horizontal scaling compared to MongoDB's native sharding, but we determined that vertical scaling with read replicas would serve our needs for the next eighteen months based on projected growth from three hundred to eight hundred concurrent users. The team unanimously agreed after running comparative load tests showing PostgreSQL handled our p99 latency targets of under fifty milliseconds while MongoDB exceeded seventy milliseconds under equivalent write-heavy workloads. We plan to revisit this decision if we exceed one billion rows or need multi-region active-active replication, at which point CockroachDB or Citus would become the natural next step. Migration is scheduled over six weeks with a parallel-write strategy to minimize downtime."
- },
- "duplicate": {
- "128": "The team chose PostgreSQL instead of MongoDB for the primary database. The main reason was our requirement for robust ACID transactions in the financial processing pipeline. MongoDB's eventual consistency introduced unacceptable risks for ledger accuracy and audit compliance. Additionally, PostgreSQL provided well-established migration tools and a rich extension ecosystem including PostGIS for location-based features.",
- "256": "The team chose PostgreSQL instead of MongoDB for the primary database following a three-month assessment by the backend and data engineering groups. The main reason was our requirement for robust ACID transactions in the financial processing pipeline. MongoDB's eventual consistency introduced unacceptable risks for ledger accuracy and audit compliance. CockroachDB was evaluated as a distributed SQL option but was eliminated because of operational overhead and elevated infrastructure expenses at our present scale. PostgreSQL further provided well-established migration tools through Flyway, a rich extension ecosystem including PostGIS for location-based features, and strong JSON column support that covered the semi-structured data scenarios which originally attracted us to MongoDB.",
- "512": "The team chose PostgreSQL instead of MongoDB for the primary database following a three-month assessment by the backend and data engineering groups. The main reason was our requirement for robust ACID transactions in the financial processing pipeline. MongoDB's eventual consistency introduced unacceptable risks for ledger accuracy and audit compliance, especially following a Q2 incident where a MongoDB race condition produced a temporary twelve-thousand-dollar ledger mismatch requiring two days to resolve. CockroachDB was evaluated as a distributed SQL option but was eliminated because of operational overhead and elevated infrastructure expenses at our present volume of approximately fifty million records. PostgreSQL further provided well-established migration tools through Flyway, a rich extension ecosystem including PostGIS for location-based features, and strong JSON column support that covered the semi-structured data scenarios which originally attracted us to MongoDB. The accepted compromise is that PostgreSQL demands more deliberate capacity planning for horizontal growth versus MongoDB's built-in sharding, but our analysis showed that vertical scaling with read replicas would meet our requirements for the next year and a half. The entire team reached consensus after comparative load testing demonstrated PostgreSQL meeting our sub-fifty-millisecond p99 latency goals while MongoDB surpassed seventy milliseconds under the same write-intensive conditions.",
- "768": "The team chose PostgreSQL instead of MongoDB for the primary database following a three-month assessment by the backend and data engineering groups. The main reason was our requirement for robust ACID transactions in the financial processing pipeline. MongoDB's eventual consistency introduced unacceptable risks for ledger accuracy and audit compliance, especially following a Q2 incident where a MongoDB race condition produced a temporary twelve-thousand-dollar ledger mismatch requiring two days to resolve. CockroachDB was evaluated as a distributed SQL option but was eliminated because of operational overhead and elevated infrastructure expenses at our present volume of approximately fifty million records. Amazon Aurora was also briefly considered but we preferred the vendor independence of open-source PostgreSQL in light of our multi-cloud strategy. PostgreSQL further provided well-established migration tools through Flyway, a rich extension ecosystem including PostGIS for location-based features, and strong JSON column support that covered the semi-structured data scenarios which originally attracted us to MongoDB. The accepted compromise is that PostgreSQL demands more deliberate capacity planning for horizontal growth versus MongoDB's built-in sharding, but our analysis showed that vertical scaling with read replicas would meet our requirements for the next year and a half given projected growth from three hundred to eight hundred simultaneous users. The entire team reached consensus after comparative load testing demonstrated PostgreSQL meeting our sub-fifty-millisecond p99 latency goals while MongoDB surpassed seventy milliseconds under the same write-intensive conditions. If we surpass one billion records or require multi-region active-active replication in the future, CockroachDB or Citus would be the logical progression. The migration is planned across six weeks using a parallel-write approach to minimize service interruption."
- },
- "evolution": {
- "128": "We are migrating from PostgreSQL to CockroachDB as our primary datastore. Our user base grew to multi-region and PostgreSQL's single-leader replication became a bottleneck. CockroachDB's distributed SQL with automatic sharding and multi-region active-active support addresses our new scale requirements while maintaining ACID guarantees.",
- "256": "We are migrating from PostgreSQL to CockroachDB as our primary datastore after eighteen months of running PostgreSQL in production. Our user base expanded to three geographic regions and PostgreSQL's single-leader replication became a latency bottleneck for users in Asia and Europe. CockroachDB's distributed SQL with automatic sharding and multi-region active-active replication addresses our new scale requirements while maintaining the ACID guarantees we originally chose PostgreSQL for. We evaluated Yugabyte as an alternative but CockroachDB's simpler operational model and better documentation won out. The trade-off is a fifteen percent increase in write latency for single-region operations, which we consider acceptable given the global consistency benefits.",
- "512": "We are migrating from PostgreSQL to CockroachDB as our primary datastore after eighteen months of running PostgreSQL in production. Our user base expanded from a single US region to three geographic regions covering North America, Europe, and Asia Pacific. PostgreSQL's single-leader replication became a latency bottleneck with p99 response times exceeding two hundred milliseconds for users in Singapore and Frankfurt. CockroachDB's distributed SQL with automatic sharding and multi-region active-active replication addresses our new scale requirements while maintaining the ACID guarantees we originally chose PostgreSQL for. We evaluated Yugabyte as an alternative but CockroachDB's simpler operational model, better Kubernetes integration, and more comprehensive documentation won out during our two-month proof of concept. The trade-off is a fifteen percent increase in write latency for single-region operations and roughly thirty percent higher infrastructure costs, which we consider acceptable given the global consistency benefits and elimination of complex replication lag management. Our data volume has grown from fifty million to one point two billion rows, validating the scaling concern we noted when originally choosing PostgreSQL.",
- "768": "We are migrating from PostgreSQL to CockroachDB as our primary datastore after eighteen months of running PostgreSQL in production. Our user base expanded from a single US region to three geographic regions covering North America, Europe, and Asia Pacific, growing from eight hundred to four thousand concurrent users. PostgreSQL's single-leader replication became a latency bottleneck with p99 response times exceeding two hundred milliseconds for users in Singapore and Frankfurt, well above our fifty-millisecond target. CockroachDB's distributed SQL with automatic sharding and multi-region active-active replication addresses our new scale requirements while maintaining the ACID guarantees we originally chose PostgreSQL for. We evaluated Yugabyte as an alternative but CockroachDB's simpler operational model, better Kubernetes integration, and more comprehensive documentation won out during our two-month proof of concept. We also revisited Amazon Aurora Global Database but again preferred vendor neutrality. The trade-off is a fifteen percent increase in write latency for single-region operations and roughly thirty percent higher infrastructure costs, which we consider acceptable given the global consistency benefits and elimination of complex replication lag management. Our data volume has grown from fifty million to one point two billion rows, validating the scaling concern we noted when originally choosing PostgreSQL. The migration is planned over eight weeks using a dual-write strategy with CockroachDB's built-in import tools. We expect to complete cutover by end of Q3 and will monitor closely for any query performance regressions given CockroachDB's different query optimizer characteristics compared to PostgreSQL."
- },
- "unrelated": {
- "128": "The design team adopted Figma as the primary interface design tool replacing Sketch. The main motivation was real-time collaboration across our distributed team spanning four time zones. Sketch's file-based workflow caused constant version conflicts and required manual handoff processes. Figma's browser-based approach eliminated installation dependencies and enabled instant feedback loops with product managers.",
- "256": "The design team adopted Figma as the primary interface design tool replacing Sketch after six months of growing friction with file-based workflows. The main motivation was real-time collaboration across our distributed team spanning four time zones in San Francisco, London, Bangalore, and Tokyo. Sketch's file-based workflow caused constant version conflicts and required manual handoff processes through Zeplin. Figma's browser-based approach eliminated installation dependencies, enabled instant feedback loops with product managers, and reduced design-to-developer handoff time from two days to four hours. We evaluated Adobe XD as well but found its collaboration features less mature. The trade-off is Figma's subscription cost is roughly forty percent higher than Sketch per seat, but the productivity gains justified the expense.",
- "512": "The design team adopted Figma as the primary interface design tool replacing Sketch after six months of growing friction with file-based workflows. The main motivation was real-time collaboration across our distributed team spanning four time zones in San Francisco, London, Bangalore, and Tokyo. Sketch's file-based workflow caused constant version conflicts particularly when multiple designers worked on the same component library simultaneously. The handoff process through Zeplin added an extra two days to every sprint cycle. Figma's browser-based approach eliminated installation dependencies across Mac and Windows, enabled instant feedback loops with product managers who could comment directly on designs, and reduced design-to-developer handoff time from two days to four hours through its built-in dev mode. We evaluated Adobe XD as well but found its collaboration features less mature and its plugin ecosystem smaller. The trade-off is Figma's subscription cost is roughly forty percent higher than Sketch per seat at twenty-five dollars versus fifteen dollars monthly, but the productivity gains justified the expense. Additionally, our existing Sketch component library of over three hundred components was migrated using an automated conversion tool with roughly ninety percent fidelity, requiring two weeks of manual cleanup for the remaining edge cases.",
- "768": "The design team adopted Figma as the primary interface design tool replacing Sketch after six months of growing friction with file-based workflows. The main motivation was real-time collaboration across our distributed team spanning four time zones in San Francisco, London, Bangalore, and Tokyo. Sketch's file-based workflow caused constant version conflicts particularly when multiple designers worked on the same component library simultaneously, resulting in an average of three merge conflicts per week that each took about ninety minutes to resolve. The handoff process through Zeplin added an extra two days to every sprint cycle because developers had to wait for designers to export and upload updated specs. Figma's browser-based approach eliminated installation dependencies across Mac and Windows, enabled instant feedback loops with product managers who could comment directly on designs, and reduced design-to-developer handoff time from two days to four hours through its built-in dev mode. We evaluated Adobe XD as well but found its collaboration features less mature and its plugin ecosystem smaller with roughly one thousand plugins versus Figma's three thousand. We also considered Penpot as an open-source alternative but its feature set was not yet mature enough for production design work. The trade-off is Figma's subscription cost is roughly forty percent higher than Sketch per seat at twenty-five dollars versus fifteen dollars monthly, but the productivity gains justified the expense given our team of twelve designers. Additionally, our existing Sketch component library of over three hundred components was migrated using an automated conversion tool with roughly ninety percent fidelity, requiring two weeks of manual cleanup for the remaining edge cases. We plan to build a shared design system in Figma that will serve as the single source of truth for all product teams."
- }
- }
- },
- {
- "id": "cache_redis_en",
- "language": "en",
- "domain": "infrastructure",
- "description": "Implemented Redis caching with 15-minute TTL",
- "variants": {
- "original": {
- "128": "We implemented a Redis caching layer with a fifteen-minute TTL for our product catalog API. Database queries were consuming sixty percent of request latency at p95. Redis reduced average response time from two hundred milliseconds to twelve milliseconds for cached endpoints. We chose Redis over Memcached because we needed sorted sets for our leaderboard feature and pub/sub for cache invalidation.",
- "256": "We implemented a Redis caching layer with a fifteen-minute TTL for our product catalog API after profiling revealed that database queries were consuming sixty percent of request latency at the p95 level. Redis reduced average response time from two hundred milliseconds to twelve milliseconds for cached endpoints, a ninety-four percent improvement. We chose Redis over Memcached because we needed sorted sets for our leaderboard feature and pub/sub for real-time cache invalidation across our four application servers. We considered Varnish as an HTTP-level cache but needed finer-grained control over cache keys based on user roles and permissions. The fifteen-minute TTL was chosen as a balance between freshness and hit rate, achieving an eighty-seven percent cache hit ratio in production while keeping stale data exposure within acceptable limits for our catalog use case.",
- "512": "We implemented a Redis caching layer with a fifteen-minute TTL for our product catalog API after profiling revealed that database queries were consuming sixty percent of request latency at the p95 level. Our catalog service handles approximately twelve thousand requests per minute during peak hours and the PostgreSQL backend was showing signs of connection pool exhaustion. Redis reduced average response time from two hundred milliseconds to twelve milliseconds for cached endpoints, a ninety-four percent improvement that also decreased our database connection usage by roughly seventy percent. We chose Redis over Memcached because we needed sorted sets for our leaderboard feature, pub/sub for real-time cache invalidation across our four application servers, and the ability to persist cache state across restarts. We considered Varnish as an HTTP-level cache but needed finer-grained control over cache keys based on user roles, subscription tiers, and geographic pricing variations. The fifteen-minute TTL was chosen after analyzing our catalog update frequency of approximately forty changes per day, balancing freshness against hit rate. We achieved an eighty-seven percent cache hit ratio in production while keeping stale data exposure within acceptable limits. The trade-off is increased infrastructure complexity with Redis Sentinel for high availability and the need for careful cache key design to avoid the thundering herd problem during TTL expiration of popular items.",
- "768": "We implemented a Redis caching layer with a fifteen-minute TTL for our product catalog API after profiling revealed that database queries were consuming sixty percent of request latency at the p95 level. Our catalog service handles approximately twelve thousand requests per minute during peak hours and the PostgreSQL backend was showing signs of connection pool exhaustion with sporadic timeout errors during flash sales. Redis reduced average response time from two hundred milliseconds to twelve milliseconds for cached endpoints, a ninety-four percent improvement that also decreased our database connection usage by roughly seventy percent and eliminated the timeout errors entirely. We chose Redis over Memcached because we needed sorted sets for our leaderboard feature, pub/sub for real-time cache invalidation across our four application servers, and the ability to persist cache state across restarts to avoid cold-start penalties after deployments. We also evaluated KeyDB as a multi-threaded Redis alternative but determined that standard Redis with pipelining was sufficient for our throughput requirements. We considered Varnish as an HTTP-level cache but needed finer-grained control over cache keys based on user roles, subscription tiers, and geographic pricing variations that would have required complex VCL configuration. The fifteen-minute TTL was chosen after analyzing our catalog update frequency of approximately forty changes per day, balancing freshness against hit rate. We achieved an eighty-seven percent cache hit ratio in production while keeping stale data exposure within acceptable limits given that price-sensitive updates use explicit cache invalidation via pub/sub rather than relying on TTL expiration. The trade-off is increased infrastructure complexity with Redis Sentinel for high availability and the need for careful cache key design to avoid the thundering herd problem during TTL expiration of popular items. We implemented a jittered TTL strategy adding random variance of up to two minutes to spread expiration across time windows."
- },
- "duplicate": {
- "128": "A Redis cache with fifteen-minute expiration was deployed for the product catalog API. Profiling showed database queries accounted for sixty percent of p95 latency. After deploying Redis, average response times dropped from two hundred to twelve milliseconds on cached routes. Redis was selected over Memcached due to our need for sorted sets in the leaderboard and pub/sub for invalidating cache entries.",
- "256": "A Redis cache with fifteen-minute expiration was deployed for the product catalog API after profiling showed database queries accounted for sixty percent of p95 latency. After deploying Redis, average response times dropped from two hundred to twelve milliseconds on cached routes, representing a ninety-four percent gain. Redis was selected over Memcached due to our need for sorted sets in the leaderboard and pub/sub for invalidating cache entries across four application servers. Varnish was evaluated as an HTTP-layer caching solution but rejected because we required granular cache key control tied to user roles and permission levels. We settled on a fifteen-minute TTL after weighing data freshness against cache performance, ultimately reaching an eighty-seven percent hit ratio while maintaining acceptable staleness levels for catalog data.",
- "512": "A Redis cache with fifteen-minute expiration was deployed for the product catalog API after profiling showed database queries accounted for sixty percent of p95 latency. Our catalog service processes around twelve thousand requests each minute at peak load and the PostgreSQL database was approaching connection pool limits. After deploying Redis, average response times dropped from two hundred to twelve milliseconds on cached routes, representing a ninety-four percent gain, while database connection consumption fell by about seventy percent. Redis was selected over Memcached due to our need for sorted sets in the leaderboard, pub/sub for invalidating cache entries across four application servers, and persistent storage to maintain cache state through restarts. Varnish was evaluated as an HTTP-layer caching solution but rejected because we required granular cache key control tied to user roles, subscription levels, and region-specific pricing. We settled on a fifteen-minute TTL after studying our catalog's update pattern of roughly forty modifications daily, ultimately reaching an eighty-seven percent hit ratio while maintaining acceptable staleness levels. The downside is added operational burden from running Redis Sentinel for failover and engineering effort required to design cache keys that prevent thundering herd scenarios when popular items expire simultaneously.",
- "768": "A Redis cache with fifteen-minute expiration was deployed for the product catalog API after profiling showed database queries accounted for sixty percent of p95 latency. Our catalog service processes around twelve thousand requests each minute at peak load and the PostgreSQL database was approaching connection pool limits, causing intermittent timeout failures during promotional events. After deploying Redis, average response times dropped from two hundred to twelve milliseconds on cached routes, representing a ninety-four percent gain, while database connection consumption fell by about seventy percent and timeout errors were fully eliminated. Redis was selected over Memcached due to our need for sorted sets in the leaderboard, pub/sub for invalidating cache entries across four application servers, and persistent storage to maintain cache state through restarts avoiding cold-start overhead after releases. KeyDB was also assessed as a multi-threaded Redis fork but standard Redis with request pipelining proved adequate for our load profile. Varnish was evaluated as an HTTP-layer caching solution but rejected because we required granular cache key control tied to user roles, subscription levels, and region-specific pricing, which would demand complicated VCL rules. We settled on a fifteen-minute TTL after studying our catalog's update pattern of roughly forty modifications daily, ultimately reaching an eighty-seven percent hit ratio while maintaining acceptable staleness levels since price-critical changes are propagated instantly through pub/sub rather than waiting for TTL to lapse. The downside is added operational burden from running Redis Sentinel for failover and engineering effort required to design cache keys that prevent thundering herd scenarios when popular items expire simultaneously. We addressed this by introducing randomized TTL jitter of up to two minutes to distribute cache expiration events."
- },
- "evolution": {
- "128": "We are replacing our Redis caching layer with a Cloudflare Workers KV edge cache. As our user base expanded globally, even Redis with regional replicas added forty milliseconds of latency for users far from our data centers. Edge caching brings content within five milliseconds of end users. We are moving cache invalidation from pub/sub to Cloudflare's purge API.",
- "256": "We are replacing our Redis caching layer with a Cloudflare Workers KV edge cache after our user base expanded to thirty countries. Even Redis with regional replicas in three availability zones added forty milliseconds of latency for users in Southeast Asia and Africa. Edge caching through Cloudflare's global network of over three hundred points of presence brings content within five milliseconds of end users. We are moving cache invalidation from Redis pub/sub to Cloudflare's purge API which propagates globally within seconds. The trade-off is losing Redis's sorted sets for leaderboards, which we are migrating to a separate Redis instance dedicated to real-time features. Our cache hit ratio improved from eighty-seven to ninety-three percent with edge caching.",
- "512": "We are replacing our Redis caching layer with a Cloudflare Workers KV edge cache after our user base expanded to thirty countries across six continents. Even Redis with regional replicas in three availability zones added forty milliseconds of latency for users in Southeast Asia and Africa due to the physical distance to our nearest data center. Edge caching through Cloudflare's global network of over three hundred points of presence brings content within five milliseconds of end users regardless of location. We evaluated AWS CloudFront and Fastly as alternatives but Cloudflare Workers KV offered the most cost-effective combination of edge computing and key-value storage for our catalog use case. We are moving cache invalidation from Redis pub/sub to Cloudflare's purge API which propagates globally within seconds. The trade-off is losing Redis's sorted sets for leaderboards and the rich data structures we relied on, which we are migrating to a separate Redis instance dedicated to real-time features. We also accept slightly higher eventual consistency windows of up to thirty seconds versus Redis's near-instant invalidation. Our cache hit ratio improved from eighty-seven to ninety-three percent with edge caching and our p99 latency for catalog reads dropped from fifty milliseconds to eight milliseconds globally.",
- "768": "We are replacing our Redis caching layer with a Cloudflare Workers KV edge cache after our user base expanded to thirty countries across six continents with peak traffic now reaching forty thousand requests per minute. Even Redis with regional replicas in three availability zones added forty milliseconds of latency for users in Southeast Asia and Africa due to the physical distance to our nearest data center in Singapore. Edge caching through Cloudflare's global network of over three hundred points of presence brings content within five milliseconds of end users regardless of location, a dramatic improvement for our growing international customer base. We evaluated AWS CloudFront and Fastly as alternatives but Cloudflare Workers KV offered the most cost-effective combination of edge computing and key-value storage for our catalog use case at roughly sixty percent lower cost than CloudFront with Lambda at Edge. We are moving cache invalidation from Redis pub/sub to Cloudflare's purge API which propagates globally within seconds. The trade-off is losing Redis's sorted sets for leaderboards and the rich data structures we relied on, which we are migrating to a separate Redis instance dedicated to real-time features that do not benefit from edge distribution. We also accept slightly higher eventual consistency windows of up to thirty seconds versus Redis's near-instant invalidation, though for our catalog data this is acceptable. Our cache hit ratio improved from eighty-seven to ninety-three percent with edge caching and our p99 latency for catalog reads dropped from fifty milliseconds to eight milliseconds globally. Infrastructure costs decreased by twenty-two percent despite the additional Cloudflare subscription because we were able to downsize our Redis cluster from six to two nodes."
- },
- "unrelated": {
- "128": "The engineering team standardized on Conventional Commits for all git commit messages across twelve repositories. Previously, inconsistent commit message formats made automated changelog generation impossible. The standard enforces type prefixes like feat, fix, and chore followed by a scope and description. This enabled automated semantic versioning through commitlint and standard-version tooling.",
- "256": "The engineering team standardized on Conventional Commits for all git commit messages across twelve repositories after months of inconsistent formatting that made automated changelog generation impossible. The standard enforces type prefixes like feat, fix, and chore followed by an optional scope in parentheses and a concise description. This enabled automated semantic versioning through commitlint integrated with our pre-commit hooks and standard-version for release management. We evaluated Angular commit conventions and Karma-style formats but Conventional Commits had the broadest tooling ecosystem. The trade-off is additional friction for developers who must remember the format, but IDE plugins and commitizen's interactive CLI reduced this barrier. After three months, ninety-five percent of commits follow the convention and our changelog quality improved dramatically.",
- "512": "The engineering team standardized on Conventional Commits for all git commit messages across twelve repositories after months of inconsistent formatting that made automated changelog generation impossible and made it difficult to determine the scope of changes in any given release. The standard enforces type prefixes like feat, fix, chore, docs, and refactor followed by an optional scope in parentheses and a concise description. Breaking changes are indicated with an exclamation mark before the colon or a footer. This enabled automated semantic versioning through commitlint integrated with our pre-commit hooks and standard-version for release management, replacing our previous manual process that required a release manager to review every pull request description to compile changelogs. We evaluated Angular commit conventions and Karma-style formats but Conventional Commits had the broadest tooling ecosystem and was framework-agnostic. The trade-off is additional friction for developers who must remember the format, but IDE plugins for VS Code and JetBrains plus commitizen's interactive CLI reduced this barrier significantly. We also added a CI check that blocks merges with non-conforming commit messages. After three months, ninety-five percent of commits follow the convention, our changelog quality improved dramatically, and we reduced the release preparation time from four hours to fifteen minutes.",
- "768": "The engineering team standardized on Conventional Commits for all git commit messages across twelve repositories after months of inconsistent formatting that made automated changelog generation impossible and made it difficult to determine the scope of changes in any given release. Previously, commit messages ranged from single words like fixed to paragraphs of stream-of-consciousness notes, with no consistent way to distinguish features from bug fixes or breaking changes. The Conventional Commits standard enforces type prefixes like feat, fix, chore, docs, and refactor followed by an optional scope in parentheses and a concise description. Breaking changes are indicated with an exclamation mark before the colon or a BREAKING CHANGE footer. This enabled automated semantic versioning through commitlint integrated with our pre-commit hooks and standard-version for release management, replacing our previous manual process that required a release manager to spend four hours reviewing every pull request description to compile changelogs for each release. We evaluated Angular commit conventions and Karma-style formats but Conventional Commits had the broadest tooling ecosystem, was framework-agnostic, and had clearer documentation. The trade-off is additional friction for developers who must remember the format, but IDE plugins for VS Code and JetBrains plus commitizen's interactive CLI reduced this barrier significantly. We also added a CI check that blocks merges with non-conforming commit messages, which initially caused some pushback from the team but was accepted after we demonstrated the time savings. After three months, ninety-five percent of commits follow the convention, our changelog quality improved dramatically, and we reduced the release preparation time from four hours to fifteen minutes. We plan to extend this to automate GitHub release notes and Slack notifications for each production deployment."
- }
- }
- },
- {
- "id": "pipeline_kafka_ko",
- "language": "ko",
- "domain": "data_engineering",
- "description": "데이터 파이프라인에 Kafka Streams 도입",
- "variants": {
- "original": {
- "128": "데이터 파이프라인의 실시간 처리를 위해 Apache Kafka Streams를 도입하기로 결정했다. 기존의 배치 기반 Spark 파이프라인은 데이터 지연이 평균 15분이었으며, 실시간 대시보드와 알림 시스템의 요구사항을 충족하지 못했다. Kafka Streams는 별도 클러스터 없이 애플리케이션 내에서 스트림 처리가 가능하다는 점이 결정적이었다.",
- "256": "데이터 파이프라인의 실시간 처리를 위해 Apache Kafka Streams를 도입하기로 결정했다. 기존의 배치 기반 Apache Spark 파이프라인은 데이터 지연이 평균 15분이었으며, 실시간 대시보드와 이상 거래 알림 시스템의 요구사항을 충족하지 못했다. Apache Flink도 검토했으나 별도의 클러스터 관리가 필요하고 운영 복잡도가 높아 현재 팀 규모로는 부담이 컸다. Kafka Streams는 이미 사용 중인 Kafka 인프라 위에서 애플리케이션 라이브러리로 동작하므로 추가 인프라 비용 없이 도입할 수 있었다. 트레이드오프로 Flink 대비 복잡한 윈도우 처리나 CEP 기능이 부족하지만, 현재 요구사항인 이벤트별 변환과 집계에는 충분했다.",
- "512": "데이터 파이프라인의 실시간 처리를 위해 Apache Kafka Streams를 도입하기로 결정했다. 기존의 배치 기반 Apache Spark 파이프라인은 하루 네 번 실행되며 데이터 지연이 평균 15분, 최악의 경우 45분까지 발생했다. 이로 인해 실시간 대시보드는 항상 뒤처진 데이터를 보여줬고, 이상 거래 탐지 알림이 지연되어 사기 거래 대응 시간이 목표 5분을 크게 초과하고 있었다. 대안으로 Apache Flink를 검토했으나 별도의 클러스터 구축과 운영이 필요했고, 현재 데이터 엔지니어 3명으로는 운영 부담이 과도했다. AWS Kinesis Data Analytics도 평가했으나 벤더 종속성과 Kafka 토픽에서 Kinesis로 데이터를 복제해야 하는 추가 복잡도가 문제였다. Kafka Streams는 이미 운영 중인 Kafka 클러스터 위에서 일반 자바 애플리케이션으로 동작하므로 추가 인프라 구축 없이 기존 배포 파이프라인에 통합할 수 있었다. 도입 후 데이터 지연은 평균 2초로 줄었고, 이상 거래 알림 응답 시간은 3분 이내로 개선되었다. 트레이드오프로 Flink 대비 복잡한 이벤트 처리나 세션 윈도우 기능이 제한적이지만, 현재 이벤트별 변환, 집계, 조인 수준의 요구사항에는 충분히 대응 가능하다.",
- "768": "데이터 파이프라인의 실시간 처리를 위해 Apache Kafka Streams를 도입하기로 결정했다. 기존의 배치 기반 Apache Spark 파이프라인은 하루 네 번 실행되며 데이터 지연이 평균 15분, 최악의 경우 45분까지 발생했다. 이로 인해 실시간 대시보드는 항상 뒤처진 데이터를 보여줬고, 이상 거래 탐지 알림이 지연되어 사기 거래 대응 시간이 목표 5분을 크게 초과하고 있었다. 비즈니스 팀에서는 고객 이탈 징후를 실시간으로 감지하고 싶다는 요구사항도 추가로 제기한 상태였다. 대안으로 Apache Flink를 2개월간 PoC로 검토했으나 별도의 TaskManager 클러스터 구축과 운영이 필요했고, 현재 데이터 엔지니어 3명으로는 운영 부담이 과도했다. AWS Kinesis Data Analytics도 평가했으나 벤더 종속성과 Kafka 토픽에서 Kinesis로 데이터를 복제해야 하는 추가 복잡도, 그리고 월 예상 비용이 Kafka Streams 대비 3배 높다는 점이 문제였다. Kafka Streams는 이미 운영 중인 Kafka 클러스터 위에서 일반 자바 애플리케이션으로 동작하므로 추가 인프라 구축 없이 기존 쿠버네티스 배포 파이프라인에 통합할 수 있었다. 도입 후 데이터 지연은 평균 2초로 줄었고, 이상 거래 알림 응답 시간은 3분 이내로 개선되었다. 처리량은 초당 약 5만 이벤트로 현재 피크 트래픽의 3배 여유가 있다. 트레이드오프로 Flink 대비 복잡한 이벤트 처리(CEP)나 세션 윈도우 기능이 제한적이지만, 현재 이벤트별 변환, 집계, 스트림-테이블 조인 수준의 요구사항에는 충분히 대응 가능하다. 향후 요구사항이 복잡해지면 Flink로의 전환을 재검토할 계획이다."
- },
- "duplicate": {
- "128": "실시간 데이터 처리를 위해 Kafka Streams를 채택했다. 기존 Spark 배치 파이프라인은 평균 15분의 데이터 지연을 보였고, 실시간 대시보드와 알림 시스템 요구를 만족시키지 못했다. Kafka Streams는 추가 클러스터 구축 없이 앱 내부에서 바로 스트림 처리를 수행할 수 있다는 장점이 핵심이었다.",
- "256": "실시간 데이터 처리를 위해 Kafka Streams를 채택했다. 기존 Spark 배치 파이프라인은 평균 15분의 데이터 지연을 보였고, 실시간 대시보드와 이상 거래 감지 시스템의 요건을 만족시키지 못했다. Flink를 대안으로 살펴봤으나 전용 클러스터를 별도로 구축해야 했고 운영 부담이 현재 팀 규모에 비해 과중했다. Kafka Streams는 기존에 사용하고 있던 Kafka 인프라에서 애플리케이션 라이브러리 형태로 작동하기 때문에 별도 인프라 투자 없이 적용할 수 있었다. Flink에 비해 복잡한 윈도우 처리나 CEP 기능은 부족하지만, 이벤트 단위 변환과 집계라는 현재 요구사항에는 충분히 부합했다.",
- "512": "실시간 데이터 처리를 위해 Kafka Streams를 채택했다. 기존 Spark 배치 파이프라인은 하루 4회 실행 기준으로 평균 15분, 최대 45분까지 데이터 지연이 발생했다. 이 때문에 실시간 대시보드는 늘 지난 데이터를 표시했고, 이상 거래 감지 알림 지연으로 사기 대응 시간이 목표인 5분을 크게 넘기고 있었다. Flink를 대안으로 살펴봤지만 전용 클러스터 구축이 필요했고 데이터 엔지니어 3명으로는 운영이 부담스러웠다. AWS Kinesis Data Analytics도 평가했으나 벤더 종속과 Kafka에서 Kinesis로의 데이터 복제 추가 복잡도가 걸림돌이었다. Kafka Streams는 기존 Kafka 클러스터에서 일반 자바 앱으로 동작해 추가 인프라 없이 기존 배포 환경에 바로 통합 가능했다. 도입 결과 데이터 지연은 평균 2초로 줄었고 이상 거래 알림 응답은 3분 이내로 개선되었다. Flink 대비 복잡한 이벤트 처리나 세션 윈도우 지원이 부족하지만, 현재의 이벤트 변환 및 집계 요구사항에는 충분하다.",
- "768": "실시간 데이터 처리를 위해 Kafka Streams를 채택했다. 기존 Spark 배치 파이프라인은 하루 4회 실행 기준으로 평균 15분, 최대 45분까지 데이터 지연이 발생했다. 이 때문에 실시간 대시보드는 늘 지난 데이터를 표시했고, 이상 거래 감지 알림 지연으로 사기 대응 시간이 목표인 5분을 크게 넘기고 있었다. 사업 부서에서는 고객 이탈 신호를 실시간으로 포착하고 싶다는 추가 요구도 있었다. Flink를 2개월간 PoC로 검토했지만 전용 TaskManager 클러스터 구축이 필요했고 데이터 엔지니어 3명으로는 운영이 과중했다. AWS Kinesis Data Analytics도 평가했으나 벤더 종속, Kafka에서 Kinesis로의 데이터 복제 복잡도, 월 비용이 Kafka Streams 대비 3배라는 점이 문제였다. Kafka Streams는 기존 Kafka 클러스터에서 일반 자바 앱으로 동작해 추가 인프라 없이 쿠버네티스 배포 환경에 바로 통합 가능했다. 도입 결과 데이터 지연은 평균 2초로 줄었고 이상 거래 알림 응답은 3분 이내로 개선되었다. 처리량은 초당 5만 이벤트로 피크 트래픽 대비 3배 여유가 있다. Flink 대비 CEP나 세션 윈도우 지원이 제한적이지만 현재의 이벤트 변환, 집계, 스트림-테이블 조인 요구에는 충분하다. 추후 요구사항이 복잡해지면 Flink 전환을 다시 고려할 계획이다."
- },
- "evolution": {
- "128": "Kafka Streams에서 Apache Flink로 스트림 처리 플랫폼을 전환하기로 결정했다. 실시간 사기 탐지에 복잡 이벤트 처리(CEP)가 필요해졌고, 세션 윈도우 기반의 사용자 행동 분석 요구가 추가되었다. Kafka Streams로는 이러한 고급 스트리밍 패턴을 구현하기 어려웠다.",
- "256": "Kafka Streams에서 Apache Flink로 스트림 처리 플랫폼을 전환하기로 결정했다. 지난 1년간 실시간 사기 탐지에 복잡 이벤트 처리(CEP) 패턴이 필요해졌고, 세션 윈도우 기반의 사용자 행동 분석과 실시간 ML 모델 서빙 연동 요구가 추가되었다. Kafka Streams로는 이런 고급 스트리밍 패턴을 구현하기가 제한적이었다. 팀 규모도 3명에서 7명으로 늘어나 Flink 클러스터 운영이 가능해졌다. 데이터 지연은 Kafka Streams의 2초에서 Flink의 500밀리초로 더 줄었고, CEP를 통한 패턴 매칭 정확도가 크게 향상되었다.",
- "512": "Kafka Streams에서 Apache Flink로 스트림 처리 플랫폼을 전환하기로 결정했다. 지난 1년간 요구사항이 크게 변화했다. 실시간 사기 탐지에 복잡 이벤트 처리(CEP) 패턴이 필요해졌고, 세션 윈도우 기반의 사용자 행동 분석, 실시간 ML 모델 서빙과의 연동, 그리고 여러 데이터 소스를 통합하는 스트림 조인이 추가되었다. Kafka Streams로는 이런 고급 스트리밍 패턴을 구현하기가 제한적이었으며, 특히 CEP 기능의 부재가 사기 탐지 정확도 개선의 병목이 되고 있었다. 결정적으로 팀 규모가 데이터 엔지니어 3명에서 7명으로 확대되어 Flink 클러스터의 운영이 현실적으로 가능해졌다. Flink on Kubernetes 구성을 채택하여 기존 인프라와의 통합을 최소화했다. 전환 후 데이터 지연은 Kafka Streams의 2초에서 Flink의 500밀리초로 줄었고, CEP 기반 패턴 매칭으로 사기 탐지율이 78퍼센트에서 94퍼센트로 향상되었다. 트레이드오프는 월 인프라 비용 40퍼센트 증가와 Flink의 학습 곡선이다.",
- "768": "Kafka Streams에서 Apache Flink로 스트림 처리 플랫폼을 전환하기로 결정했다. 지난 1년간 요구사항이 크게 변화했다. 실시간 사기 탐지에 복잡 이벤트 처리(CEP) 패턴이 필요해졌고, 세션 윈도우 기반의 사용자 행동 분석, 실시간 ML 모델 서빙과의 연동, 그리고 Kafka뿐 아니라 CDC와 외부 API를 포함한 다수 데이터 소스를 통합하는 복합 스트림 처리가 추가되었다. Kafka Streams로는 이런 고급 스트리밍 패턴을 구현하기가 제한적이었으며, 특히 CEP 기능의 부재가 사기 탐지 정확도 개선의 병목이 되고 있었다. 초당 처리량도 10만 이벤트를 넘어서면서 Kafka Streams의 단일 파티션 처리 모델의 한계가 드러났다. 결정적으로 팀 규모가 데이터 엔지니어 3명에서 7명으로 확대되어 Flink 클러스터의 운영이 현실적으로 가능해졌다. Flink on Kubernetes 구성을 채택하여 기존 인프라와의 통합을 최소화했고, Flink의 savepoint 기능으로 무중단 배포도 가능해졌다. 전환 후 데이터 지연은 Kafka Streams의 2초에서 Flink의 500밀리초로 줄었고, CEP 기반 패턴 매칭으로 사기 탐지율이 78퍼센트에서 94퍼센트로 향상되었다. 트레이드오프는 월 인프라 비용 40퍼센트 증가와 Flink의 학습 곡선이나, 사기 탐지 성능 개선으로 인한 비용 절감이 이를 상쇄한다. Kafka Streams는 단순한 이벤트 라우팅 용도로 일부 유지하되 핵심 처리는 모두 Flink로 이관할 계획이다."
- },
- "unrelated": {
- "128": "프론트엔드 상태 관리 라이브러리를 Redux에서 Zustand로 전환하기로 결정했다. Redux의 보일러플레이트 코드가 과도했고 새로운 팀원의 온보딩 시간이 평균 2주나 걸렸다. Zustand는 최소한의 API로 동일한 기능을 제공하며 번들 크기도 Redux 대비 90퍼센트 작다.",
- "256": "프론트엔드 상태 관리 라이브러리를 Redux에서 Zustand로 전환하기로 결정했다. Redux의 보일러플레이트 코드가 과도했고 새로운 팀원의 온보딩 시간이 평균 2주나 걸렸다. Zustand는 최소한의 API로 동일한 기능을 제공하며 번들 크기도 Redux 대비 90퍼센트 작다. Recoil과 Jotai도 검토했으나 Zustand가 TypeScript 지원과 미들웨어 생태계에서 가장 균형 잡힌 선택이었다. 전환 후 상태 관리 관련 코드가 40퍼센트 줄었고 신규 개발자 온보딩 시간이 2주에서 3일로 단축되었다. 트레이드오프는 Redux DevTools의 시간 여행 디버깅 기능을 일부 포기해야 했지만, Zustand의 간결한 코드 덕분에 디버깅 자체가 훨씬 간단해졌다.",
- "512": "프론트엔드 상태 관리 라이브러리를 Redux에서 Zustand로 전환하기로 결정했다. 기존 Redux 코드베이스는 액션 타입 정의, 리듀서, 미들웨어 설정 등 보일러플레이트가 전체 상태 관리 코드의 60퍼센트를 차지했고, 이로 인해 새로운 기능 추가 시 평균 4개의 파일을 수정해야 했다. 신규 팀원의 온보딩 시간도 평균 2주가 소요되었으며, Redux Saga의 제너레이터 패턴이 특히 진입 장벽이 높았다. Zustand는 최소한의 API로 동일한 글로벌 상태 관리 기능을 제공하며 번들 크기도 Redux 생태계 전체 대비 90퍼센트 작았다. Recoil과 Jotai도 검토했으나 Recoil은 아직 실험적 상태였고, Jotai는 atomic 모델이 기존 코드베이스와의 점진적 마이그레이션에 불리했다. Zustand가 TypeScript 지원, 미들웨어 생태계, 그리고 Redux에서의 점진적 마이그레이션 용이성에서 가장 균형 잡힌 선택이었다. 전환 후 상태 관리 관련 코드가 40퍼센트 줄었고 신규 개발자 온보딩 시간이 2주에서 3일로 단축되었다. 트레이드오프는 Redux DevTools의 시간 여행 디버깅을 일부 포기한 것이나 코드 간결화로 디버깅 필요성 자체가 줄었다.",
- "768": "프론트엔드 상태 관리 라이브러리를 Redux에서 Zustand로 전환하기로 결정했다. 기존 Redux 코드베이스는 액션 타입 정의, 리듀서, 미들웨어 설정 등 보일러플레이트가 전체 상태 관리 코드의 60퍼센트를 차지했고, 이로 인해 새로운 기능 추가 시 평균 4개의 파일을 수정해야 했다. 신규 팀원의 온보딩 시간도 평균 2주가 소요되었으며, Redux Saga의 제너레이터 패턴이 특히 진입 장벽이 높았다. 코드 리뷰에서도 상태 관리 관련 PR이 전체의 35퍼센트를 차지하며 병목이 되고 있었다. Zustand는 최소한의 API로 동일한 글로벌 상태 관리 기능을 제공하며 번들 크기도 Redux 생태계 전체 대비 90퍼센트 작았다. Recoil과 Jotai도 검토했으나 Recoil은 아직 Meta 내부에서도 실험적 단계였고, Jotai의 atomic 모델은 기존 코드베이스와의 점진적 마이그레이션에 불리했다. MobX도 후보였으나 데코레이터 기반 API가 팀의 함수형 프로그래밍 선호와 맞지 않았다. Zustand가 TypeScript 지원, 미들웨어 생태계, 그리고 Redux에서의 점진적 마이그레이션 용이성에서 가장 균형 잡힌 선택이었다. 6주에 걸쳐 모듈별로 점진적 마이그레이션을 진행했고, 전환 후 상태 관리 관련 코드가 40퍼센트 줄었으며 신규 개발자 온보딩 시간이 2주에서 3일로 단축되었다. 앱 초기 로드 시간도 번들 축소 효과로 1.2초에서 0.8초로 개선되었다. 트레이드오프는 Redux DevTools의 시간 여행 디버깅을 일부 포기한 것이나 코드 간결화로 디버깅 필요성 자체가 현저히 줄었다."
- }
- }
- },
- {
- "id": "privacy_aes_ko",
- "language": "ko",
- "domain": "security",
- "description": "개인정보 AES-256 암호화 적용",
- "variants": {
- "original": {
- "128": "개인정보보호법 준수를 위해 사용자 개인정보에 AES-256-GCM 암호화를 적용하기로 결정했다. 이름, 전화번호, 주소 등 식별 가능한 정보를 데이터베이스 레벨이 아닌 애플리케이션 레벨에서 필드 단위로 암호화한다. 키 관리는 AWS KMS를 사용하며 키 로테이션은 90일 주기로 자동 수행된다.",
- "256": "개인정보보호법과 ISMS-P 인증 요건 준수를 위해 사용자 개인정보에 AES-256-GCM 암호화를 적용하기로 결정했다. 이름, 전화번호, 주소, 이메일 등 식별 가능한 정보를 데이터베이스의 TDE가 아닌 애플리케이션 레벨에서 필드 단위로 암호화한다. 이는 DB 관리자도 평문 데이터에 접근할 수 없게 하여 내부자 위협을 최소화하기 위한 설계다. 키 관리는 AWS KMS의 봉투 암호화 방식을 사용하며, 데이터 키는 레코드별로 생성하고 마스터 키 로테이션은 90일 주기로 자동 수행된다. 트레이드오프로 검색 성능이 영향을 받지만 블라인드 인덱스를 도입하여 암호화된 필드에 대한 동등 비교 검색을 지원한다.",
- "512": "개인정보보호법과 ISMS-P 인증 요건 준수를 위해 사용자 개인정보에 AES-256-GCM 암호화를 적용하기로 결정했다. 최근 개인정보보호위원회의 감사에서 데이터베이스 수준 암호화만으로는 내부자 위협에 대한 보호가 불충분하다는 지적을 받은 것이 직접적 계기였다. 이름, 전화번호, 주소, 이메일, 주민등록번호 후반부 등 식별 가능한 정보를 데이터베이스의 TDE가 아닌 애플리케이션 레벨에서 필드 단위로 암호화한다. 이는 DBA를 포함한 인프라 관리자도 평문 데이터에 접근할 수 없게 하여 내부자 위협을 구조적으로 차단하기 위한 설계다. 대안으로 검토한 데이터베이스 수준 TDE는 성능 오버헤드가 낮지만 DB 접근 권한이 있으면 복호화가 가능해 보안 수준이 부족했고, 동형암호는 검색 기능 지원이 가능하지만 현재 기술 성숙도와 성능이 프로덕션 적용에 부적합했다. 키 관리는 AWS KMS의 봉투 암호화 방식을 사용하며, 데이터 키는 레코드별로 생성하고 마스터 키 로테이션은 90일 주기로 자동 수행된다. 트레이드오프로 암호화된 필드의 범위 검색이나 패턴 매칭이 불가능해지지만, HMAC 기반 블라인드 인덱스를 도입하여 동등 비교 검색은 지원한다. 암호화 적용 후 해당 API의 응답 시간이 평균 12밀리초 증가했으나 사용자 체감 범위 내이다.",
- "768": "개인정보보호법과 ISMS-P 인증 요건 준수를 위해 사용자 개인정보에 AES-256-GCM 암호화를 적용하기로 결정했다. 최근 개인정보보호위원회의 감사에서 데이터베이스 수준 암호화만으로는 내부자 위협에 대한 보호가 불충분하다는 지적을 받은 것이 직접적 계기였으며, 업계에서 내부자에 의한 개인정보 유출 사고가 연이어 발생한 상황도 반영되었다. 이름, 전화번호, 주소, 이메일, 주민등록번호 후반부 등 식별 가능한 정보를 데이터베이스의 TDE가 아닌 애플리케이션 레벨에서 필드 단위로 암호화한다. 이는 DBA를 포함한 인프라 관리자도 평문 데이터에 접근할 수 없게 하여 내부자 위협을 구조적으로 차단하기 위한 설계다. 대안으로 검토한 데이터베이스 수준 TDE는 성능 오버헤드가 낮지만 DB 접근 권한이 있으면 복호화가 가능해 감사 요건을 충족하지 못했다. 동형암호(FHE)는 암호화 상태에서의 검색이 가능하다는 장점이 있었으나 현재 기술 성숙도와 성능이 프로덕션 적용에 부적합했고, 3년 내 재검토를 계획하고 있다. 하시코프 Vault도 키 관리 대안으로 검토했으나 AWS 환경에서의 운영 복잡도와 추가 비용을 고려하여 AWS KMS를 선택했다. 키 관리는 AWS KMS의 봉투 암호화 방식을 사용하며, 데이터 키는 레코드별로 생성하고 마스터 키 로테이션은 90일 주기로 자동 수행된다. 키 접근 로그는 CloudTrail을 통해 실시간 모니터링되며 비정상 접근 패턴 감지 시 즉시 알림이 발송된다. 트레이드오프로 암호화된 필드의 범위 검색이나 패턴 매칭이 불가능해지지만, HMAC 기반 블라인드 인덱스를 도입하여 동등 비교 검색은 지원한다. 암호화 적용 후 해당 API의 응답 시간이 평균 12밀리초 증가했으나 사용자 체감 범위 내이며, 전체 시스템 처리량에는 영향이 없었다."
- },
- "duplicate": {
- "128": "개인정보보호법 대응을 위해 사용자 식별 정보에 AES-256-GCM 방식의 암호화를 도입했다. 이름, 연락처, 주소 등을 DB 레벨이 아닌 앱 레벨에서 필드별로 암호화 처리한다. 암호화 키는 AWS KMS로 관리하며 90일마다 자동으로 키 로테이션이 이루어진다.",
- "256": "개인정보보호법 및 ISMS-P 인증 충족을 위해 사용자 식별 정보에 AES-256-GCM 방식의 암호화를 도입했다. 이름, 연락처, 주소, 이메일 등 식별 데이터를 DB의 TDE 대신 앱 레벨에서 필드별로 암호화 처리한다. DB 관리자조차 원본 데이터에 접근하지 못하도록 하여 내부자 위협을 최소화하는 구조다. 암호화 키는 AWS KMS의 봉투 암호화로 관리하며, 데이터 키는 레코드 단위로 발급하고 마스터 키는 90일마다 자동 교체된다. 검색 성능 저하가 예상되었으나 블라인드 인덱스를 적용해 암호화 필드의 등가 비교 검색을 가능하게 했다.",
- "512": "개인정보보호법 및 ISMS-P 인증 충족을 위해 사용자 식별 정보에 AES-256-GCM 방식의 암호화를 도입했다. 개인정보보호위원회 감사에서 DB 레벨 암호화만으로는 내부자 위협 대응이 미흡하다는 지적이 직접적 동기가 되었다. 이름, 연락처, 주소, 이메일, 주민번호 뒷자리 등 식별 데이터를 DB의 TDE 대신 앱 레벨에서 필드별로 암호화 처리한다. DBA 포함 인프라 관리자가 원본에 접근할 수 없는 구조로 내부자 위협을 구조적으로 차단한다. DB 수준 TDE는 성능 부담은 적으나 DB 접근 권한으로 복호화 가능해 보안이 불충분했고, 동형암호는 암호문 상태 검색이 가능하나 성능과 성숙도가 프로덕션에 미달이었다. AWS KMS 봉투 암호화를 채택하여 데이터 키는 레코드별 생성, 마스터 키는 90일 자동 로테이션을 적용했다. 암호화 필드의 범위 검색이나 패턴 매칭은 불가하나 HMAC 블라인드 인덱스로 등가 비교 검색은 지원한다. 적용 후 API 응답 시간이 평균 12밀리초 늘었으나 체감 수준 이내다.",
- "768": "개인정보보호법 및 ISMS-P 인증 충족을 위해 사용자 식별 정보에 AES-256-GCM 방식의 암호화를 도입했다. 개인정보보호위원회 감사에서 DB 레벨 암호화만으로는 내부자 위협 대응이 미흡하다는 지적이 직접적 동기가 되었고, 동종 업계의 연이은 내부자 유출 사고도 배경이 되었다. 이름, 연락처, 주소, 이메일, 주민번호 뒷자리 등 식별 데이터를 DB의 TDE 대신 앱 레벨에서 필드별로 암호화 처리한다. DBA 포함 인프라 관리자가 원본에 접근할 수 없는 구조로 내부자 위협을 구조적으로 차단한다. DB 수준 TDE는 성능 부담은 적으나 DB 접근 권한으로 복호화 가능해 감사 요건 미충족이었고, 동형암호(FHE)는 암호문 상태 검색이 가능하나 현재 성능과 성숙도가 프로덕션에 미달이어서 3년 내 재검토 예정이다. 하시코프 Vault도 키 관리 대안으로 살펴봤으나 AWS 환경의 운영 복잡도와 비용을 고려해 AWS KMS를 택했다. 봉투 암호화 방식으로 데이터 키는 레코드별 생성, 마스터 키는 90일 자동 로테이션이며 CloudTrail로 키 접근을 실시간 감시하고 이상 패턴 시 즉시 통보한다. 암호화 필드의 범위 검색이나 패턴 매칭은 불가하나 HMAC 블라인드 인덱스로 등가 비교 검색은 지원한다. 적용 후 API 응답 시간이 평균 12밀리초 늘었으나 체감 수준 이내이며 전체 처리량에 영향은 없었다."
- },
- "evolution": {
- "128": "기존 AES-256-GCM 필드 암호화에서 동형암호(FHE) 기반의 검색 가능 암호화로 전환을 결정했다. 암호화된 상태에서 검색이 불가능한 문제로 블라인드 인덱스 유지보수 부담이 커졌고, 신규 검색 요건이 계속 추가되고 있었다. FHE 기술의 성숙도가 프로덕션 수준에 도달했다고 판단했다.",
- "256": "기존 AES-256-GCM 필드 암호화에서 동형암호(FHE) 기반의 검색 가능 암호화로 전환을 결정했다. 지난 2년간 암호화된 필드에 대한 신규 검색 요건이 분기마다 추가되면서 블라인드 인덱스의 유지보수 부담이 급격히 증가했다. 범위 검색이나 유사 매칭 등 블라인드 인덱스로는 지원 불가능한 쿼리 패턴도 요구되기 시작했다. FHE 기술이 크게 발전하여 암호문 상태에서의 연산 속도가 실용적 수준에 도달했고, CryptoLab의 enVector 플랫폼이 프로덕션급 FHE 검색을 지원하게 되었다. 전환 후 블라인드 인덱스 관리가 불필요해지고 암호화 상태 그대로 모든 검색이 가능해진다.",
- "512": "기존 AES-256-GCM 필드 암호화에서 동형암호(FHE) 기반의 검색 가능 암호화로 전환을 결정했다. 지난 2년간 암호화된 필드에 대한 신규 검색 요건이 분기마다 평균 3건씩 추가되면서 블라인드 인덱스의 유지보수 부담이 급격히 증가했다. 각 새로운 검색 패턴마다 별도의 HMAC 인덱스를 설계하고 기존 데이터를 마이그레이션해야 했으며, 범위 검색이나 유사 매칭 등 블라인드 인덱스로는 원천적으로 지원 불가능한 쿼리 패턴도 요구되기 시작했다. 2년 전 검토 당시 성능 부적합으로 보류했던 FHE 기술이 크게 발전하여 암호문 상태에서의 연산 속도가 실용적 수준에 도달했다. 특히 CryptoLab의 enVector 플랫폼이 FHE 기반 벡터 검색을 프로덕션급 성능으로 지원하게 되면서 전환이 현실적으로 가능해졌다. 전환 후 블라인드 인덱스 관리가 완전히 불필요해지고, 서버 관리자조차 검색 쿼리의 내용을 알 수 없는 진정한 제로 트러스트 아키텍처가 실현된다. 트레이드오프는 초기 암호화 비용이 AES 대비 높고 FHE 전문 인력 확보가 필요하다는 점이다.",
- "768": "기존 AES-256-GCM 필드 암호화에서 동형암호(FHE) 기반의 검색 가능 암호화로 전환을 결정했다. 지난 2년간 암호화된 필드에 대한 신규 검색 요건이 분기마다 평균 3건씩 추가되면서 블라인드 인덱스의 유지보수 부담이 급격히 증가했다. 각 새로운 검색 패턴마다 별도의 HMAC 인덱스를 설계하고 기존 데이터를 마이그레이션해야 했으며, 이 작업에만 엔지니어 1명이 풀타임으로 투입되고 있었다. 범위 검색, 유사도 매칭, 집계 쿼리 등 블라인드 인덱스로는 원천적으로 지원 불가능한 쿼리 패턴도 비즈니스팀에서 지속적으로 요구했다. 2년 전 검토 당시 성능 부적합으로 보류했던 FHE 기술이 크게 발전하여 암호문 상태에서의 연산 속도가 실용적 수준에 도달했다. 특히 CryptoLab의 enVector 플랫폼이 FHE 기반 벡터 검색을 프로덕션급 성능으로 지원하게 되면서 전환이 현실적으로 가능해졌다. PoC 결과 검색 지연이 평문 대비 3배 이내로, 사용자 체감 한계인 200밀리초 내에 들어왔다. 전환 후 블라인드 인덱스 관리가 완전히 불필요해지고, 서버 관리자조차 검색 쿼리의 내용을 알 수 없는 진정한 제로 트러스트 아키텍처가 실현된다. AWS KMS 키 관리 체계는 그대로 유지하되, 암호화 방식만 AES에서 FHE로 전환한다. 트레이드오프는 초기 암호화 비용이 AES 대비 높고 FHE 전문 인력 확보가 필요하다는 점이나, 블라인드 인덱스 유지보수 인력의 재배치로 상쇄된다."
- },
- "unrelated": {
- "128": "로그 수집 시스템을 ELK 스택에서 Loki와 Grafana 조합으로 전환하기로 결정했다. Elasticsearch 클러스터의 운영 비용이 월 350만원에 달했고 인덱스 관리에 전담 인력이 필요했다. Loki는 로그 텍스트를 인덱싱하지 않아 스토리지 비용이 ELK 대비 80퍼센트 절감된다.",
- "256": "로그 수집 시스템을 ELK 스택에서 Loki와 Grafana 조합으로 전환하기로 결정했다. Elasticsearch 클러스터의 운영 비용이 월 350만원에 달했고, 인덱스 수명 주기 관리와 클러스터 헬스 모니터링에 전담 인력이 필요했다. Loki는 로그 텍스트를 인덱싱하지 않고 레이블만 인덱싱하여 스토리지 비용이 ELK 대비 80퍼센트 절감된다. Datadog과 Splunk도 평가했으나 SaaS 비용이 우리 로그 볼륨에서는 월 800만원을 초과했다. 트레이드오프는 전문 검색 속도가 Elasticsearch보다 느리지만, 대부분의 운영 로그 조회가 시간 범위와 레이블 필터 기반이므로 실사용에 지장이 없다고 판단했다.",
- "512": "로그 수집 시스템을 ELK 스택에서 Loki와 Grafana 조합으로 전환하기로 결정했다. Elasticsearch 클러스터 3노드의 운영 비용이 월 350만원에 달했고, 인덱스 수명 주기 관리, 샤드 리밸런싱, 클러스터 헬스 모니터링에 DevOps 엔지니어의 주당 약 10시간이 투입되고 있었다. 특히 일 평균 50기가바이트의 로그 볼륨에서 30일 보존 정책을 유지하려면 지속적인 스토리지 확장이 필요했다. Loki는 로그 텍스트를 인덱싱하지 않고 메타데이터 레이블만 인덱싱하는 아키텍처로 스토리지 비용이 ELK 대비 80퍼센트 절감된다. Datadog과 Splunk도 평가했으나 우리 로그 볼륨 기준 SaaS 비용이 각각 월 800만원과 1200만원을 초과했다. Loki의 LogQL 쿼리 언어가 Elasticsearch의 KQL보다 학습 곡선이 있으나, 이미 Prometheus와 Grafana를 메트릭 모니터링에 사용하고 있어 통합 대시보드 구성이 가능하다는 점이 추가 장점이었다. 트레이드오프는 전문 검색 속도가 느리지만 실제 운영 로그 조회의 92퍼센트가 시간 범위와 레이블 필터 기반이므로 체감 영향은 미미했다.",
- "768": "로그 수집 시스템을 ELK 스택에서 Loki와 Grafana 조합으로 전환하기로 결정했다. Elasticsearch 클러스터 3노드의 운영 비용이 월 350만원에 달했고, 인덱스 수명 주기 관리, 샤드 리밸런싱, 클러스터 헬스 모니터링에 DevOps 엔지니어의 주당 약 10시간이 투입되고 있었다. 특히 일 평균 50기가바이트의 로그 볼륨에서 30일 보존 정책을 유지하려면 지속적인 스토리지 확장이 필요했고, 최근 블랙프라이데이 트래픽에서 로그 유실이 발생하면서 클러스터 안정성에도 문제가 드러났다. Loki는 로그 텍스트를 인덱싱하지 않고 메타데이터 레이블만 인덱싱하는 아키텍처로 스토리지 비용이 ELK 대비 80퍼센트 절감되며 S3 호환 오브젝트 스토리지를 백엔드로 사용해 사실상 무제한 확장이 가능하다. Datadog과 Splunk도 평가했으나 우리 로그 볼륨 기준 SaaS 비용이 각각 월 800만원과 1200만원을 초과했다. 자체 호스팅 Graylog도 검토했으나 결국 Elasticsearch 기반이라 근본적 비용 구조는 동일했다. Loki의 LogQL 쿼리 언어가 Elasticsearch의 KQL보다 학습 곡선이 있으나, 이미 Prometheus와 Grafana를 메트릭 모니터링에 사용하고 있어 통합 관찰가능성 플랫폼 구성이 가능하다는 점이 추가 장점이었다. 전환 후 로그 시스템 운영 비용이 월 70만원으로 줄었고, DevOps 팀의 로그 시스템 관련 업무 시간이 주 10시간에서 2시간으로 감소했다. 트레이드오프는 전문 검색 속도가 느리지만 실제 운영 로그 조회의 92퍼센트가 시간 범위와 레이블 필터 기반이므로 체감 영향은 미미했다."
- }
- }
- },
- {
- "id": "micro_split_ja",
- "language": "ja",
- "domain": "architecture",
- "description": "マイクロサービスのドメイン境界決定",
- "variants": {
- "original": {
- "128": "モノリシックアプリケーションをマイクロサービスに分割するにあたり、DDDのバウンデッドコンテキストに基づいてドメイン境界を決定した。注文、在庫、顧客、決済の四つのサービスに分離する。サービス間通信はイベント駆動とし、Apache Kafkaを採用した。同期通信が必要な箇所にはgRPCを使用する。",
- "256": "モノリシックアプリケーションをマイクロサービスに分割するにあたり、DDDのバウンデッドコンテキストに基づいてドメイン境界を決定した。注文管理、在庫管理、顧客管理、決済処理の四つのサービスに分離する。当初は機能別に八つのサービスへの分割を検討したが、チーム規模が二十名と小さく運用負荷が過大になると判断し四つに絞った。サービス間通信は非同期のイベント駆動を原則としApache Kafkaを採用した。注文確定から在庫引当のように即時応答が必要な箇所にはgRPCによる同期通信を使用する。REST APIも検討したがgRPCの型安全性とパフォーマンスが優れていたため採用しなかった。トレードオフとしてデバッグの複雑さが増すがJaegerによる分散トレーシングで対処する。",
- "512": "モノリシックアプリケーションをマイクロサービスに分割するにあたり、DDDのバウンデッドコンテキストに基づいてドメイン境界を決定した。既存のモノリスは五年間の開発で約三十万行に膨れ上がり、一つの変更がシステム全体のリグレッションテストを要する状態になっていた。デプロイ頻度は月二回に低下しビジネスの敏捷性を損なっていた。注文管理、在庫管理、顧客管理、決済処理の四つのサービスに分離する。当初は機能別に八つのサービスへの分割を検討したが、エンジニア二十名のチーム規模では各サービスの運用負荷が過大になると判断し四つに絞った。サービスごとに独立したデータベースを持つDatabase per Service パターンを採用し、データ整合性はSagaパターンで担保する。サービス間通信は非同期のイベント駆動を原則としApache Kafkaを採用した。注文確定から在庫引当のように百ミリ秒以内の応答が必要な箇所にはgRPCによる同期通信を使用する。REST APIも検討したがgRPCの型安全性、バイナリプロトコルによるパフォーマンス、そしてProtobufによるスキーマ管理が優れていた。トレードオフとしてシステム全体のデバッグ複雑さが大幅に増すが、Jaegerによる分散トレーシングとGrafanaダッシュボードで可観測性を確保する。移行は六ヶ月をかけてStrangler Figパターンで段階的に実施する。",
- "768": "モノリシックアプリケーションをマイクロサービスに分割するにあたり、DDDのバウンデッドコンテキストに基づいてドメイン境界を決定した。既存のモノリスはJava Spring Bootで構築され五年間の開発で約三十万行に膨れ上がっていた。一つの機能変更がシステム全体のリグレッションテストを要する状態で、デプロイ頻度は月二回に低下しビジネス部門からの機能追加要望に迅速に対応できなくなっていた。特に決済モジュールの変更が在庫管理に影響するような予期しない結合が頻発していた。注文管理、在庫管理、顧客管理、決済処理の四つのサービスに分離する。当初は配送、通知、分析、管理画面を含む八つのサービスへの分割を検討したが、エンジニア二十名のチーム規模では各サービスの運用負荷が過大になると判断し四つに絞った。配送と通知は注文管理サービスのサブモジュールとして残し、将来のチーム拡大時に分離する計画である。サービスごとに独立したPostgreSQLデータベースを持つDatabase per Serviceパターンを採用し、サービス間のデータ整合性はSagaパターンのコレオグラフィ方式で担保する。サービス間通信は非同期のイベント駆動を原則としApache Kafkaを採用した。注文確定から在庫引当のように百ミリ秒以内の応答が必要な箇所にはgRPCによる同期通信を使用する。REST APIも検討したがgRPCの型安全性、バイナリプロトコルによるパフォーマンス優位、そしてProtobufによる後方互換性のあるスキーマ管理が優れていた。トレードオフとしてシステム全体のデバッグ複雑さが大幅に増すが、Jaegerによる分散トレーシングとGrafanaダッシュボードで可観測性を確保する。移行は六ヶ月をかけてStrangler Figパターンで段階的に実施し、各フェーズでモノリスとマイクロサービスを並行稼働させて動作検証を行う。"
- },
- "duplicate": {
- "128": "モノリスからマイクロサービスへの移行にあたり、DDDバウンデッドコンテキストを基にサービス境界を定めた。注文、在庫、顧客、決済の四サービスに切り出す。サービス間はイベント駆動通信を基本としKafkaを導入した。即時性が求められる通信にはgRPCを選択している。",
- "256": "モノリスからマイクロサービスへの移行にあたり、DDDバウンデッドコンテキストを基にサービス境界を定めた。注文、在庫、顧客、決済の四サービスに切り出す。最初は八サービスへの細分化を検討していたが、二十名のチーム体制では運用コストが重すぎるため四つに集約した。サービス間は非同期のイベント駆動通信を基本としKafkaを導入した。注文から在庫引当のようにリアルタイム応答が必要なケースにはgRPCの同期通信を選択している。RESTも候補だったがgRPCの型保証と処理速度の優位性から見送った。デバッグの難度が上がるトレードオフはJaegerの分散トレーシングで補う方針である。",
- "512": "モノリスからマイクロサービスへの移行にあたり、DDDバウンデッドコンテキストを基にサービス境界を定めた。五年間で約三十万行に成長したモノリスは一箇所の修正でシステム全体のリグレッションテストが必要な状態であり、リリース頻度が月二回まで落ち込みビジネスの俊敏性を阻害していた。注文、在庫、顧客、決済の四サービスに切り出す。最初は八サービスへの細分化を検討していたが、エンジニア二十名の体制では運用コストが重すぎるため四つに集約した。各サービスに独立したデータベースを割り当てるDatabase per Serviceパターンを採用し、データの整合性はSagaパターンで維持する。サービス間は非同期のイベント駆動通信を基本としKafkaを導入した。注文から在庫引当のように百ミリ秒以内の応答が求められるケースにはgRPCの同期通信を選択している。RESTも候補だったがgRPCの型保証、バイナリプロトコルの速度、Protobufによるスキーマ管理の利点から見送った。デバッグの複雑化というトレードオフはJaeger分散トレーシングとGrafanaダッシュボードで可観測性を担保して対処する。六ヶ月をかけStrangler Figパターンで段階的に移行を進める。",
- "768": "モノリスからマイクロサービスへの移行にあたり、DDDバウンデッドコンテキストを基にサービス境界を定めた。Java Spring Bootで構築された既存モノリスは五年間で約三十万行に成長し、一箇所の修正でシステム全体のリグレッションテストが必要な状態であった。リリース頻度は月二回まで落ち込み、ビジネス部門の機能追加要望に迅速に応えられなくなっていた。決済モジュールの変更が在庫管理に予期せず波及するような結合問題も多発していた。注文、在庫、顧客、決済の四サービスに切り出す。配送や通知など含め八サービスへの細分化も検討したが、エンジニア二十名の体制では運用コストが重すぎるため四つに集約した。配送と通知は注文管理のサブモジュールとして残し、チーム拡大時に分離する計画だ。各サービスに独立したPostgreSQLを割り当てるDatabase per Serviceパターンを採用し、データ整合性はSagaのコレオグラフィ方式で維持する。サービス間は非同期イベント駆動を基本としKafkaを導入し、注文確定から在庫引当のように百ミリ秒以内の応答が求められるケースにはgRPCの同期通信を選んだ。RESTも候補だったがgRPCの型保証、バイナリプロトコルの速度、Protobufの後方互換スキーマ管理が優位だった。デバッグの複雑化というトレードオフはJaeger分散トレーシングとGrafanaで可観測性を確保して対処する。移行は六ヶ月をかけStrangler Figパターンで段階的に実施し、各フェーズでモノリスとマイクロサービスを並行稼働させ検証する。"
- },
- "evolution": {
- "128": "マイクロサービスアーキテクチャからモジュラーモノリスへの回帰を決定した。四つのマイクロサービスの運用で分散システム特有の障害が頻発し、チームの半分の時間がインフラ問題の対処に費やされていた。モジュラーモノリスでサービス境界は論理的に維持しつつ単一デプロイに戻す。",
- "256": "マイクロサービスアーキテクチャからモジュラーモノリスへの回帰を決定した。二年間四つのマイクロサービスを運用した結果、ネットワーク障害やデータ不整合など分散システム特有の問題が月平均五件発生し、チームの作業時間の四十パーセントがインフラ問題の対処に費やされていた。ビジネス機能の開発速度は分割前とほぼ変わらず、分割の恩恵を十分に享受できていなかった。モジュラーモノリスでDDDのバウンデッドコンテキストによるサービス境界は論理モジュールとして維持しつつ、単一のデプロイユニットに戻す。Kafka経由のイベント通信をインプロセスのイベントバスに置き換え、gRPCをダイレクトメソッド呼び出しに変更する。",
- "512": "マイクロサービスアーキテクチャからモジュラーモノリスへの回帰を決定した。二年間四つのマイクロサービスを運用した結果、ネットワーク障害、Kafkaパーティションのリバランス、Sagaの補償トランザクション失敗など分散システム特有の問題が月平均五件発生していた。チームの作業時間の四十パーセントがインフラ問題の調査と対処に費やされ、ビジネス機能の開発速度はモノリス時代とほぼ変わらないという皮肉な状況に陥っていた。特にチーム規模が二十名から十五名に縮小された後は、四つのサービスそれぞれのオンコール体制を維持するのが困難になった。モジュラーモノリスでDDDのバウンデッドコンテキストによるサービス境界は論理モジュールとして厳密に維持しつつ、単一のKubernetesデプロイメントに統合する。Kafka経由のイベント通信をSpring Application Eventsベースのインプロセスイベントバスに置き換え、gRPCコールをダイレクトメソッド呼び出しに変更する。データベースは各モジュール専用のスキーマに分離したまま単一PostgreSQLインスタンスでホストする。デプロイ頻度は週三回に向上する見込みであり、障害率の大幅な低下を期待している。",
- "768": "マイクロサービスアーキテクチャからモジュラーモノリスへの回帰を決定した。二年間四つのマイクロサービスを運用した結果、ネットワーク障害、Kafkaパーティションのリバランス、Sagaの補償トランザクション失敗など分散システム特有の問題が月平均五件発生していた。チームの作業時間の四十パーセントがインフラ問題の調査と対処に費やされ、ビジネス機能の開発速度はモノリス時代とほぼ変わらないという皮肉な状況に陥っていた。特にチーム規模が二十名から十五名に縮小された後は、四つのサービスそれぞれのオンコール体制を維持するのが困難で、深夜のKafka障害対応で疲弊するエンジニアが続出していた。当初のマイクロサービス分割判断自体は誤りではなかったが、現在のチーム規模とビジネス成長速度を考慮すると運用コストが価値を上回っている。モジュラーモノリスでDDDのバウンデッドコンテキストによるサービス境界は論理モジュールとして厳密に維持し、モジュール間はインターフェースを通じた疎結合を保ちつつ、単一のKubernetesデプロイメントに統合する。Kafka経由のイベント通信をSpring Application Eventsベースのインプロセスイベントバスに置き換え、gRPCコールをダイレクトメソッド呼び出しに変更する。データベースは各モジュール専用のスキーマに分離したまま単一PostgreSQLインスタンスでホストし、将来再び分割が必要になった際にはスキーマ分離が維持されているため移行が容易である。移行は三ヶ月で完了する見込みで、デプロイ頻度は月二回から週三回に向上し、障害率の大幅な低下を期待している。"
- },
- "unrelated": {
- "128": "社内のナレッジ管理プラットフォームをConfluenceからNotionに移行することを決定した。Confluenceのページ読み込みが平均三秒と遅く、検索精度も低いため社員の利用率が三十パーセントまで低下していた。Notionのデータベース機能とリアルタイム共同編集が決め手となった。",
- "256": "社内のナレッジ管理プラットフォームをConfluenceからNotionに移行することを決定した。Confluenceのページ読み込みが平均三秒と遅く検索精度も低いため、社員のアクティブ利用率が三十パーセントまで低下し重要な情報がSlackの断片的なメッセージに埋もれる状態だった。Notionのデータベース機能、リアルタイム共同編集、直感的なUIが決め手となった。SharePoint OnlineとGitBook も評価したが、SharePointは操作が複雑でエンジニア以外の利用が困難、GitBookは技術文書には強いが営業やマーケティング部門のユースケースに対応しきれなかった。移行後三ヶ月でアクティブ利用率は七十五パーセントに回復した。トレードオフはConfluenceの高度なマクロ機能の一部が失われることと、JiraとのネイティブIntegrationがなくなる点だが、Notion APIとZapier連携で補完している。",
- "512": "社内のナレッジ管理プラットフォームをConfluenceからNotionに移行することを決定した。Confluenceは七年間使用してきたが、ページ読み込みが平均三秒と遅く検索精度も低いため、社員のアクティブ利用率が三十パーセントまで低下していた。その結果、重要な意思決定の記録がSlackの断片的なメッセージやGoogle Docsの個人フォルダに散在し、組織知識の断絶が深刻化していた。新入社員のオンボーディングでも必要な情報を見つけるのに平均二日かかるという調査結果が出ていた。Notionのデータベース機能による構造化されたナレッジ管理、リアルタイム共同編集、直感的なUIが決め手となった。SharePoint OnlineとGitBookも評価した。SharePointはMicrosoft365との統合が魅力だったがUIの複雑さからエンジニア以外の部門での利用が困難と判断した。GitBookは技術ドキュメントには優れていたが営業やマーケティング部門のユースケースに対応しきれなかった。移行はチーム別に三ヶ月かけて段階的に実施し、移行後三ヶ月でアクティブ利用率は七十五パーセントに回復した。新入社員のオンボーディング情報検索時間も二日から四時間に短縮された。トレードオフはConfluenceの高度なマクロ機能の一部とJiraのネイティブ統合が失われる点だが、Notion APIとZapier連携で主要なワークフローは再現している。",
- "768": "社内のナレッジ管理プラットフォームをConfluenceからNotionに移行することを決定した。Confluenceは七年間使用してきたが、Atlassian Cloudへの移行後もページ読み込みが平均三秒と遅く検索精度も低いため、社員のアクティブ利用率が三十パーセントまで低下していた。その結果、重要な意思決定の記録がSlackの断片的なメッセージやGoogle Docsの個人フォルダに散在し、組織知識の断絶が深刻化していた。新入社員のオンボーディングでも必要な情報を見つけるのに平均二日かかるという調査結果が出ており、人事部門から改善要請が上がっていた。また年間のConfluenceライセンス費用が二百名規模で約五百万円に達しており、コスト面でも見直しの機運があった。Notionのデータベース機能による構造化されたナレッジ管理、リアルタイム共同編集、直感的なUIが決め手となった。特にデータベースビューの柔軟性が部門横断的な情報整理に有効であると評価された。SharePoint OnlineとGitBookも評価した。SharePointはMicrosoft365との統合が魅力だったがUIの複雑さからエンジニア以外の部門での利用が困難と判断した。GitBookは技術ドキュメントには優れていたが営業やマーケティング部門のプロジェクト管理やクライアント提案書の共同作成には不向きだった。移行はチーム別に三ヶ月かけて段階的に実施し、各チームにNotionチャンピオンを任命して支援した。移行後三ヶ月でアクティブ利用率は七十五パーセントに回復し、新入社員のオンボーディング情報検索時間も二日から四時間に短縮された。年間ライセンス費用も三百五十万円に削減できた。トレードオフはConfluenceの高度なマクロ機能の一部とJiraのネイティブ統合が失われる点だが、Notion APIとZapier連携で主要なワークフローは再現している。"
- }
- }
- },
- {
- "id": "cloud_multi_fr",
- "language": "fr",
- "domain": "infrastructure",
- "description": "Stratégie multi-cloud AWS/GCP",
- "variants": {
- "original": {
- "128": "Nous avons adopté une stratégie multi-cloud en combinant AWS et GCP pour renforcer la résilience de notre plateforme. La dépendance exclusive à AWS posait un risque de concentration après une panne régionale ayant impacté notre service pendant quatre heures. GCP a été choisi comme cloud secondaire grâce à son réseau mondial performant et ses tarifs compétitifs pour le calcul GPU nécessaire à nos modèles de machine learning.",
- "256": "Nous avons adopté une stratégie multi-cloud en combinant AWS et GCP pour renforcer la résilience de notre plateforme. La dépendance exclusive à AWS constituait un risque majeur de concentration, mis en évidence par une panne régionale en mars qui a impacté notre service pendant quatre heures avec un impact estimé à soixante-quinze mille euros de revenus perdus. GCP a été choisi comme cloud secondaire grâce à son réseau mondial performant, ses tarifs compétitifs pour le calcul GPU nécessaire à nos modèles de machine learning, et la maturité de BigQuery pour nos pipelines analytiques. Azure a été écarté en raison de son écosystème Kubernetes moins abouti que GKE et de la complexité de sa facturation. Le compromis accepté est un surcoût opérationnel de trente pour cent lié à la gestion de deux fournisseurs cloud et la nécessité de maintenir une couche d'abstraction via Terraform et des conteneurs Docker pour garantir la portabilité des déploiements.",
- "512": "Nous avons adopté une stratégie multi-cloud en combinant AWS et GCP pour renforcer la résilience de notre plateforme. La dépendance exclusive à AWS constituait un risque majeur de concentration, mis en évidence par une panne régionale en mars qui a impacté notre service pendant quatre heures avec un impact estimé à soixante-quinze mille euros de revenus perdus. Cette panne a également révélé que notre plan de reprise après sinistre était insuffisant car il reposait entièrement sur la redondance intra-AWS. GCP a été choisi comme cloud secondaire après une évaluation de trois mois comparant GCP et Azure. GCP offre un réseau mondial avec une latence inter-régions inférieure de vingt pour cent à celle d'Azure, des tarifs compétitifs pour le calcul GPU nécessaire à nos modèles de machine learning en production, et la maturité de BigQuery qui surpasse Redshift pour nos pipelines analytiques traitant deux téraoctets de données quotidiennes. Azure a été écarté en raison de son écosystème Kubernetes moins abouti que GKE et de la complexité de sa facturation qui rendait la prévision budgétaire difficile. L'architecture repose sur Terraform pour l'infrastructure as code multi-cloud, des conteneurs Docker pour la portabilité et Kubernetes pour l'orchestration sur les deux plateformes. Le compromis accepté est un surcoût opérationnel d'environ trente pour cent lié à la gestion de deux fournisseurs cloud, la formation des équipes sur GCP et la nécessité de maintenir une couche d'abstraction évitant les services propriétaires spécifiques. Les services critiques fonctionnent en actif-actif sur les deux clouds tandis que les services secondaires utilisent un modèle actif-passif avec basculement automatique.",
- "768": "Nous avons adopté une stratégie multi-cloud en combinant AWS et GCP pour renforcer la résilience de notre plateforme qui dessert environ deux millions d'utilisateurs actifs mensuels. La dépendance exclusive à AWS constituait un risque majeur de concentration, mis en évidence par une panne régionale en mars qui a impacté notre service pendant quatre heures avec un impact estimé à soixante-quinze mille euros de revenus perdus et une dégradation mesurable de la confiance client selon notre enquête NPS post-incident. Cette panne a également révélé que notre plan de reprise après sinistre était insuffisant car il reposait entièrement sur la redondance intra-AWS au sein de la région eu-west-1. Le conseil d'administration a mandaté l'équipe technique pour éliminer tout point de défaillance unique lié à un fournisseur cloud. GCP a été choisi comme cloud secondaire après une évaluation de trois mois comparant GCP et Azure sur des critères de performance, coût, écosystème Kubernetes et support technique. GCP offre un réseau mondial avec une latence inter-régions inférieure de vingt pour cent à celle d'Azure, des tarifs compétitifs pour le calcul GPU nécessaire à nos modèles de machine learning en production avec des instances T4 trente pour cent moins chères, et la maturité de BigQuery qui surpasse Redshift pour nos pipelines analytiques traitant deux téraoctets de données quotidiennes. Azure a été écarté en raison de son écosystème Kubernetes AKS moins abouti que GKE en termes de mise à jour automatique et d'intégration Istio, et de la complexité de sa facturation qui rendait la prévision budgétaire difficile avec des écarts mensuels de quinze pour cent par rapport aux estimations. L'architecture repose sur Terraform pour l'infrastructure as code multi-cloud, des conteneurs Docker pour la portabilité et Kubernetes pour l'orchestration sur les deux plateformes. Le compromis accepté est un surcoût opérationnel d'environ trente pour cent incluant la formation des équipes sur GCP estimée à six semaines et la nécessité de maintenir une couche d'abstraction évitant les services propriétaires comme Lambda ou Cloud Functions au profit de conteneurs standards. Les services critiques fonctionnent en actif-actif sur les deux clouds tandis que les services secondaires utilisent un modèle actif-passif avec basculement automatique testé mensuellement."
- },
- "duplicate": {
- "128": "Notre équipe a mis en place une architecture multi-cloud associant AWS et GCP afin d'améliorer la résilience de la plateforme. Le recours exclusif à AWS représentait un risque de concentration, démontré par une panne régionale ayant affecté notre service durant quatre heures. GCP a été retenu comme fournisseur secondaire pour la qualité de son réseau mondial et ses prix avantageux sur les instances GPU utilisées pour nos modèles de machine learning.",
- "256": "Notre équipe a mis en place une architecture multi-cloud associant AWS et GCP afin d'améliorer la résilience de la plateforme. Le recours exclusif à AWS représentait un risque de concentration important, démontré par une panne régionale en mars ayant affecté notre service durant quatre heures pour un manque à gagner estimé à soixante-quinze mille euros. GCP a été retenu comme fournisseur secondaire pour la qualité de son réseau mondial, ses prix avantageux sur les instances GPU utilisées pour nos modèles de machine learning, et les capacités de BigQuery pour nos traitements analytiques. Azure a été éliminé à cause de son environnement Kubernetes moins mature que GKE et de l'opacité de sa tarification. Le compromis consenti est une augmentation de trente pour cent des coûts d'exploitation due à la gestion de deux plateformes cloud et au besoin de maintenir une couche d'abstraction avec Terraform et Docker pour assurer la portabilité des déploiements.",
- "512": "Notre équipe a mis en place une architecture multi-cloud associant AWS et GCP afin d'améliorer la résilience de la plateforme. Le recours exclusif à AWS représentait un risque de concentration important, démontré par une panne régionale en mars ayant affecté notre service durant quatre heures pour un manque à gagner estimé à soixante-quinze mille euros. Cet incident a aussi mis en lumière les lacunes de notre plan de continuité qui dépendait uniquement de la redondance interne à AWS. GCP a été retenu comme fournisseur secondaire à l'issue d'une évaluation comparative de trois mois avec Azure. GCP propose un réseau mondial affichant une latence inter-régions vingt pour cent inférieure à celle d'Azure, des prix compétitifs pour le calcul GPU destiné à nos modèles de machine learning en production, et BigQuery dont la maturité dépasse Redshift pour nos pipelines traitant deux téraoctets de données par jour. Azure a été éliminé à cause de son environnement Kubernetes moins mature que GKE et de l'opacité de sa tarification compliquant les prévisions budgétaires. L'architecture s'appuie sur Terraform pour l'infrastructure as code, des conteneurs Docker pour la portabilité et Kubernetes pour l'orchestration sur les deux plateformes. Le compromis consenti est une hausse d'environ trente pour cent des coûts d'exploitation liée à la gestion de deux fournisseurs, la formation des équipes et le maintien d'une couche d'abstraction excluant les services propriétaires. Les services essentiels tournent en actif-actif sur les deux clouds tandis que les services non critiques fonctionnent en actif-passif avec bascule automatique.",
- "768": "Notre équipe a mis en place une architecture multi-cloud associant AWS et GCP afin d'améliorer la résilience de la plateforme qui sert environ deux millions d'utilisateurs actifs par mois. Le recours exclusif à AWS représentait un risque de concentration important, démontré par une panne régionale en mars ayant affecté notre service durant quatre heures pour un manque à gagner estimé à soixante-quinze mille euros et une baisse mesurable de la satisfaction client selon notre enquête NPS réalisée après l'incident. Cet incident a aussi mis en lumière les lacunes de notre plan de continuité qui dépendait uniquement de la redondance interne à AWS dans la région eu-west-1. La direction a demandé à l'équipe technique d'éliminer tout point de défaillance lié à un fournisseur unique. GCP a été retenu comme fournisseur secondaire à l'issue d'une évaluation de trois mois comparant GCP et Azure sur les critères de performance, coût, écosystème Kubernetes et qualité du support. GCP propose un réseau mondial affichant une latence inter-régions vingt pour cent inférieure à celle d'Azure, des instances GPU T4 trente pour cent moins onéreuses pour nos modèles de machine learning, et BigQuery dont la maturité dépasse Redshift pour nos pipelines traitant deux téraoctets de données par jour. Azure a été éliminé à cause de son environnement Kubernetes AKS moins abouti que GKE sur les mises à jour automatiques et l'intégration Istio, et de l'opacité de sa tarification avec des écarts mensuels de quinze pour cent par rapport aux prévisions. L'architecture s'appuie sur Terraform pour l'infrastructure as code, Docker pour la portabilité et Kubernetes pour l'orchestration multi-cloud. Le compromis consenti est une hausse d'environ trente pour cent des coûts d'exploitation incluant six semaines de formation GCP et le maintien d'une couche d'abstraction excluant les services propriétaires comme Lambda ou Cloud Functions au profit de conteneurs standards. Les services essentiels tournent en actif-actif sur les deux clouds tandis que les services non critiques fonctionnent en actif-passif avec bascule automatique testée chaque mois."
- },
- "evolution": {
- "128": "Nous abandonnons notre stratégie multi-cloud AWS et GCP pour nous recentrer exclusivement sur AWS. La complexité opérationnelle du multi-cloud a dépassé les bénéfices de résilience attendus. La maintenance de la couche d'abstraction Terraform multi-cloud consommait vingt pour cent du temps de l'équipe infrastructure. AWS a renforcé ses garanties de disponibilité avec les zones de disponibilité locales.",
- "256": "Nous abandonnons notre stratégie multi-cloud AWS et GCP pour nous recentrer exclusivement sur AWS après dix-huit mois d'exploitation en configuration bi-cloud. La complexité opérationnelle a dépassé les bénéfices de résilience attendus. La maintenance de la couche d'abstraction Terraform multi-cloud et le refus d'utiliser les services managés propriétaires consommaient vingt pour cent du temps de l'équipe infrastructure soit l'équivalent de deux ingénieurs à temps plein. De plus AWS a significativement renforcé ses garanties de disponibilité avec les zones de disponibilité locales et le programme de résilience multi-AZ rendant le risque de panne régionale nettement plus faible. Nous conservons une sauvegarde froide de nos données sur GCP Cloud Storage comme filet de sécurité à moindre coût. Cette simplification permettra d'utiliser les services natifs AWS comme Lambda et Aurora sans contrainte de portabilité.",
- "512": "Nous abandonnons notre stratégie multi-cloud AWS et GCP pour nous recentrer exclusivement sur AWS après dix-huit mois d'exploitation en configuration bi-cloud. L'analyse coûts-bénéfices a montré que la complexité opérationnelle dépassait les gains de résilience. La maintenance de la couche d'abstraction Terraform multi-cloud, l'interdiction d'utiliser les services managés propriétaires et la formation continue des équipes sur deux plateformes consommaient vingt pour cent du temps de l'équipe infrastructure soit l'équivalent de deux ingénieurs à temps plein. Les incidents liés à des incompatibilités subtiles entre les comportements AWS et GCP représentaient quarante pour cent de nos alertes d'astreinte. Parallèlement AWS a significativement renforcé ses garanties avec les zones de disponibilité locales, le programme AWS Resilience Hub et des SLA améliorés à quatre nines pour les services critiques, rendant le risque de panne régionale catastrophique nettement plus faible qu'au moment de notre décision initiale. Nous conservons une sauvegarde froide de nos données sur GCP Cloud Storage comme filet de sécurité à un coût marginal de trois cents euros par mois. Cette simplification permettra d'adopter les services natifs AWS comme Lambda pour les fonctions événementielles, Aurora Serverless pour la base de données et Step Functions pour l'orchestration, avec un gain de productivité estimé à trente pour cent. Le compromis est le retour à une dépendance mono-fournisseur mais avec des mesures de mitigation bien plus robustes qu'il y a dix-huit mois.",
- "768": "Nous abandonnons notre stratégie multi-cloud AWS et GCP pour nous recentrer exclusivement sur AWS après dix-huit mois d'exploitation en configuration bi-cloud desservant deux millions d'utilisateurs mensuels. L'analyse coûts-bénéfices réalisée par l'équipe architecture a montré que la complexité opérationnelle dépassait significativement les gains de résilience. La maintenance de la couche d'abstraction Terraform multi-cloud, l'interdiction d'utiliser les services managés propriétaires et la formation continue des équipes sur deux plateformes consommaient vingt pour cent du temps de l'équipe infrastructure soit l'équivalent de deux ingénieurs senior à temps plein représentant environ deux cent quarante mille euros de coût annuel. Les incidents liés à des incompatibilités subtiles entre les comportements AWS et GCP comme les différences de gestion du DNS ou les variations de performance réseau représentaient quarante pour cent de nos alertes d'astreinte nocturne. En dix-huit mois nous n'avons subi aucune panne régionale AWS justifiant le basculement vers GCP. Parallèlement AWS a significativement renforcé ses garanties avec les zones de disponibilité locales, le programme AWS Resilience Hub et des SLA améliorés à quatre nines pour les services critiques. Le conseil d'administration a approuvé le retour au mono-cloud après présentation de ces données. Nous conservons une sauvegarde froide de nos données sur GCP Cloud Storage comme filet de sécurité à un coût marginal de trois cents euros par mois et nous maintenons les images Docker permettant théoriquement un redéploiement sur GCP en cas de besoin critique. Cette simplification permettra d'adopter les services natifs AWS comme Lambda, Aurora Serverless et Step Functions avec un gain de productivité estimé à trente pour cent. La migration de retour vers AWS natif est planifiée sur quatre mois avec une approche service par service."
- },
- "unrelated": {
- "128": "L'équipe a décidé de remplacer notre processus de code review manuel par une intégration systématique de linters automatisés dans la pipeline CI. Les revues de code prenaient en moyenne deux jours par pull request, créant un goulot d'étranglement majeur. ESLint, Prettier et SonarQube sont maintenant exécutés automatiquement à chaque push, réduisant le temps de review humaine aux décisions architecturales.",
- "256": "L'équipe a décidé de remplacer notre processus de code review manuel par une intégration systématique de linters automatisés dans la pipeline CI. Les revues de code prenaient en moyenne deux jours par pull request créant un goulot d'étranglement majeur qui bloquait les livraisons. Les commentaires de review portaient à soixante-dix pour cent sur des problèmes de formatage et de style détectables par des outils automatiques. ESLint avec notre configuration personnalisée, Prettier pour le formatage et SonarQube pour l'analyse de qualité sont maintenant exécutés automatiquement à chaque push. Les reviewers humains se concentrent désormais exclusivement sur les décisions architecturales, la logique métier et les aspects de sécurité. Le temps moyen de review est passé de deux jours à quatre heures. Le compromis est l'investissement initial de deux semaines pour configurer et personnaliser les règles de linting aux conventions de l'équipe.",
- "512": "L'équipe a décidé de remplacer notre processus de code review manuel par une intégration systématique de linters automatisés dans la pipeline CI après avoir constaté que les revues de code constituaient le principal goulot d'étranglement de notre cycle de livraison. Les pull requests attendaient en moyenne deux jours avant d'être revues, et l'analyse de six mois de commentaires de review a révélé que soixante-dix pour cent des remarques portaient sur des problèmes de formatage, de conventions de nommage et de style détectables par des outils automatiques. Ce constat était particulièrement frustrant pour les auteurs des PR qui recevaient des retours superficiels avant même que la logique métier ne soit examinée. Nous avons déployé ESLint avec une configuration personnalisée alignée sur nos conventions internes, Prettier pour le formatage automatique avec résolution des conflits de style et SonarQube Community Edition pour l'analyse statique de qualité incluant la détection de code dupliqué et les métriques de complexité cyclomatique. Ces outils sont exécutés automatiquement à chaque push via GitHub Actions et bloquent le merge si des violations critiques sont détectées. Les reviewers humains se concentrent désormais exclusivement sur les décisions architecturales, la logique métier complexe et les aspects de sécurité. Le temps moyen de review est passé de deux jours à quatre heures et la satisfaction des développeurs mesurée par enquête interne a augmenté de vingt-cinq points. Le compromis a été un investissement initial de deux semaines pour configurer les règles et former l'équipe, plus quelques faux positifs de SonarQube qui ont nécessité des ajustements de règles pendant le premier mois.",
- "768": "L'équipe a décidé de remplacer notre processus de code review manuel par une intégration systématique de linters automatisés dans la pipeline CI après avoir constaté que les revues de code constituaient le principal goulot d'étranglement de notre cycle de livraison pour une équipe de vingt-cinq développeurs travaillant sur un monorepo TypeScript. Les pull requests attendaient en moyenne deux jours avant d'être revues et l'analyse détaillée de six mois de commentaires de review environ quatre mille deux cents au total a révélé que soixante-dix pour cent des remarques portaient sur des problèmes de formatage, de conventions de nommage et de style parfaitement détectables par des outils automatiques. Ce constat était particulièrement frustrant pour les auteurs qui recevaient des retours superficiels avant même que la logique métier ne soit examinée, créant une culture de review perçue comme bureaucratique plutôt que constructive. Nous avons déployé ESLint avec une configuration personnalisée de cent vingt règles alignées sur nos conventions internes incluant des règles spécifiques pour React hooks et les patterns async, Prettier pour le formatage automatique éliminant tout débat sur le style et SonarQube Community Edition pour l'analyse statique incluant la détection de code dupliqué les métriques de complexité cyclomatique et les vulnérabilités OWASP courantes. Ces trois outils sont exécutés automatiquement à chaque push via GitHub Actions en parallèle pour minimiser le temps de feedback avec un temps d'exécution moyen de quatre-vingt-dix secondes et bloquent le merge si des violations critiques sont détectées. Nous avons également ajouté husky pour les pre-commit hooks locaux permettant aux développeurs de corriger les problèmes avant même de pousser. Les reviewers humains se concentrent désormais exclusivement sur les décisions architecturales, la logique métier complexe et les aspects de sécurité. Le temps moyen de review est passé de deux jours à quatre heures, le nombre de cycles de review aller-retour a diminué de quarante pour cent et la satisfaction des développeurs mesurée par enquête interne a augmenté de vingt-cinq points. Le compromis a été un investissement initial de deux semaines plus quelques faux positifs de SonarQube nécessitant des ajustements pendant le premier mois."
- }
- }
- }
- ]
-}
diff --git a/benchmark/reports/.gitignore b/benchmark/reports/.gitignore
deleted file mode 100644
index a5a7700d..00000000
--- a/benchmark/reports/.gitignore
+++ /dev/null
@@ -1,3 +0,0 @@
-# Benchmark reports are generated, not committed
-*.json
-!.gitignore
diff --git a/benchmark/requirements.txt b/benchmark/requirements.txt
deleted file mode 100644
index 46965282..00000000
--- a/benchmark/requirements.txt
+++ /dev/null
@@ -1,9 +0,0 @@
-# rune benchmark dependencies
-
-# Embedding similarity (recall bench)
-fastembed>=0.7.4
-numpy>=1.24.0
-
-# Optional: direct API fallback (only needed with --api-key)
-# anthropic>=0.40.0
-# openai>=1.40.0
diff --git a/benchmark/runners/__init__.py b/benchmark/runners/__init__.py
deleted file mode 100644
index e69de29b..00000000
diff --git a/benchmark/runners/common.py b/benchmark/runners/common.py
deleted file mode 100644
index 912a297f..00000000
--- a/benchmark/runners/common.py
+++ /dev/null
@@ -1,149 +0,0 @@
-"""Shared utilities for rune benchmark runners."""
-
-from __future__ import annotations
-
-import json
-import sys
-from dataclasses import asdict, dataclass, field
-from pathlib import Path
-from typing import Any
-
-BENCHMARK_DIR = Path(__file__).resolve().parent.parent
-SCENARIOS_DIR = BENCHMARK_DIR / "scenarios"
-REPORTS_DIR = BENCHMARK_DIR / "reports"
-RUNE_DIR = BENCHMARK_DIR.parent # rune repo root
-
-
-@dataclass
-class ScenarioResult:
- scenario_id: str
- category: str
- passed: bool
- expected: Any = None
- actual: Any = None
- details: dict = field(default_factory=dict)
-
- def to_dict(self) -> dict:
- return asdict(self)
-
-
-@dataclass
-class BenchmarkReport:
- bench_type: str
- total: int = 0
- passed: int = 0
- failed: int = 0
- results: list[ScenarioResult] = field(default_factory=list)
- summary: dict = field(default_factory=dict)
-
- @property
- def accuracy(self) -> float:
- return self.passed / self.total if self.total > 0 else 0.0
-
- def add(self, result: ScenarioResult) -> None:
- self.results.append(result)
- self.total += 1
- if result.passed:
- self.passed += 1
- else:
- self.failed += 1
-
- def compute_summary(self) -> None:
- categories: dict[str, dict[str, int]] = {}
- for r in self.results:
- cat = r.category
- if cat not in categories:
- categories[cat] = {"total": 0, "passed": 0, "failed": 0}
- categories[cat]["total"] += 1
- if r.passed:
- categories[cat]["passed"] += 1
- else:
- categories[cat]["failed"] += 1
-
- self.summary = {
- "overall_accuracy": round(self.accuracy, 4),
- "total": self.total,
- "passed": self.passed,
- "failed": self.failed,
- "by_category": {
- cat: {
- **stats,
- "accuracy": round(
- stats["passed"] / stats["total"] if stats["total"] else 0, 4
- ),
- }
- for cat, stats in sorted(categories.items())
- },
- }
-
- def save(self, path: Path | None = None) -> Path:
- self.compute_summary()
- if path is None:
- REPORTS_DIR.mkdir(parents=True, exist_ok=True)
- path = REPORTS_DIR / f"{self.bench_type}_report.json"
- path.write_text(json.dumps(self.to_dict(), indent=2, ensure_ascii=False))
- return path
-
- def to_dict(self) -> dict:
- return {
- "bench_type": self.bench_type,
- "summary": self.summary,
- "results": [r.to_dict() for r in self.results],
- }
-
- def print_summary(self) -> None:
- self.compute_summary()
- print(f"\n{'=' * 60}")
- print(f" rune benchmark: {self.bench_type}")
- print(f"{'=' * 60}")
- print(
- f" Overall: {self.passed}/{self.total} passed "
- f"({self.summary['overall_accuracy']:.1%})"
- )
- print(f"{'─' * 60}")
-
- for cat, stats in self.summary["by_category"].items():
- status = "PASS" if stats["failed"] == 0 else "FAIL"
- print(
- f" [{status}] {cat}: "
- f"{stats['passed']}/{stats['total']} ({stats['accuracy']:.0%})"
- )
-
- print(f"{'=' * 60}\n")
-
- failed = [r for r in self.results if not r.passed]
- if failed:
- print(f" Failed scenarios ({len(failed)}):")
- for r in failed:
- print(f" - {r.scenario_id}: {r.details.get('reason', 'unknown')}")
- print()
-
-
-def load_scenarios(category_prefix: str) -> list[dict]:
- """Load all scenarios matching a category prefix from JSONL files."""
- scenarios = []
- search_dir = SCENARIOS_DIR / category_prefix
- if not search_dir.exists():
- print(f"Warning: directory not found: {search_dir}", file=sys.stderr)
- return scenarios
-
- for jsonl_file in sorted(search_dir.rglob("*.jsonl")):
- with open(jsonl_file) as f:
- for line_num, line in enumerate(f, 1):
- line = line.strip()
- if not line:
- continue
- try:
- scenarios.append(json.loads(line))
- except json.JSONDecodeError as e:
- print(
- f"Warning: invalid JSON in {jsonl_file}:{line_num}: {e}",
- file=sys.stderr,
- )
- return scenarios
-
-
-def check_title_keywords(title: str, keywords: list[str]) -> bool:
- """Check if a title contains any of the expected keywords (case-insensitive)."""
- title_lower = title.lower()
- return any(kw.lower() in title_lower for kw in keywords)
diff --git a/benchmark/runners/embedding_bench.py b/benchmark/runners/embedding_bench.py
deleted file mode 100644
index 1e72a80c..00000000
--- a/benchmark/runners/embedding_bench.py
+++ /dev/null
@@ -1,242 +0,0 @@
-#!/usr/bin/env python3
-"""Embedding token length benchmark.
-
-Measures how reusable_insight token length affects:
-1. Novelty classification accuracy (duplicate/evolution/unrelated detection)
-2. Recall precision (retrieval quality for known decisions)
-
-Usage:
- python benchmark/runners/embedding_bench.py
- python benchmark/runners/embedding_bench.py --model Qwen/Qwen3-Embedding-0.6B
- python benchmark/runners/embedding_bench.py --report benchmark/reports/embedding_token_length.json
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import sys
-import time
-from pathlib import Path
-
-import numpy as np
-
-# Add rune root to path for imports
-RUNE_DIR = Path(__file__).resolve().parent.parent.parent
-sys.path.insert(0, str(RUNE_DIR))
-
-from agents.common.schemas.embedding import classify_novelty
-
-DATASET_PATH = Path(__file__).resolve().parent.parent / "datasets" / "embedding_token_length.json"
-REPORTS_DIR = Path(__file__).resolve().parent.parent / "reports"
-
-# Acceptable novelty classes for each variant type
-ACCEPTABLE_CLASSES = {
- "duplicate": {"near_duplicate"},
- "evolution": {"evolution", "related"},
- "unrelated": {"novel"},
-}
-
-TOKEN_LENGTHS = ["128", "256", "512", "768"]
-
-
-def load_model(model_name: str):
- """Load sentence-transformers model."""
- from sentence_transformers import SentenceTransformer
- print(f"Loading model: {model_name}")
- t0 = time.monotonic()
- model = SentenceTransformer(model_name, trust_remote_code=True)
- print(f"Model loaded in {time.monotonic() - t0:.1f}s")
- return model
-
-
-def embed_all(model, texts: list[str]) -> np.ndarray:
- """Embed all texts in one batch, return L2-normalized embeddings."""
- embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
- return np.array(embeddings)
-
-
-def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
- """Cosine similarity between two L2-normalized vectors."""
- return float(np.dot(a, b))
-
-
-def run_benchmark(model, dataset: dict) -> dict:
- """Run full benchmark and return results."""
- topics = dataset["topics"]
-
- # Collect all texts for batch embedding
- text_index = {} # (topic_id, variant, length) -> index
- all_texts = []
- for topic in topics:
- for variant in ["original", "duplicate", "evolution", "unrelated"]:
- for length in TOKEN_LENGTHS:
- key = (topic["id"], variant, length)
- text_index[key] = len(all_texts)
- all_texts.append(topic["variants"][variant][length])
-
- print(f"\nEmbedding {len(all_texts)} texts...")
- embeddings = embed_all(model, all_texts)
- print(f"Embedding shape: {embeddings.shape}")
-
- # === Novelty Classification ===
- novelty_results = []
- for topic in topics:
- for variant in ["duplicate", "evolution", "unrelated"]:
- for length in TOKEN_LENGTHS:
- orig_idx = text_index[(topic["id"], "original", length)]
- var_idx = text_index[(topic["id"], variant, length)]
- sim = cosine_sim(embeddings[orig_idx], embeddings[var_idx])
- classification = classify_novelty(sim)
- correct = classification["class"] in ACCEPTABLE_CLASSES[variant]
- novelty_results.append({
- "topic_id": topic["id"],
- "language": topic["language"],
- "variant": variant,
- "length": length,
- "similarity": round(sim, 4),
- "predicted_class": classification["class"],
- "expected_classes": list(ACCEPTABLE_CLASSES[variant]),
- "correct": correct,
- })
-
- # === Recall Precision ===
- recall_results = []
- for length in TOKEN_LENGTHS:
- # Build "memory DB" from all originals at this length
- orig_embeddings = []
- orig_ids = []
- for topic in topics:
- idx = text_index[(topic["id"], "original", length)]
- orig_embeddings.append(embeddings[idx])
- orig_ids.append(topic["id"])
- orig_matrix = np.array(orig_embeddings)
-
- for topic in topics:
- for variant in ["duplicate", "evolution", "unrelated"]:
- var_idx = text_index[(topic["id"], variant, length)]
- query_emb = embeddings[var_idx]
-
- # Compute similarities to all originals
- sims = orig_matrix @ query_emb
- ranked = np.argsort(-sims)
- top_ids = [orig_ids[i] for i in ranked[:3]]
-
- target = topic["id"]
- recall_at_1 = target == top_ids[0]
- recall_at_3 = target in top_ids
-
- recall_results.append({
- "topic_id": topic["id"],
- "language": topic["language"],
- "variant": variant,
- "length": length,
- "recall_at_1": recall_at_1,
- "recall_at_3": recall_at_3,
- "top3": top_ids,
- "top3_sims": [round(float(sims[ranked[i]]), 4) for i in range(3)],
- })
-
- return {
- "embedding_dim": int(embeddings.shape[1]),
- "total_texts": len(all_texts),
- "novelty": novelty_results,
- "recall": recall_results,
- }
-
-
-def print_summary(results: dict) -> None:
- """Print summary tables to terminal."""
- novelty = results["novelty"]
- recall = results["recall"]
-
- print(f"\n{'='*62}")
- print(f"Dim: {results['embedding_dim']} | Texts: {results['total_texts']}")
- print(f"{'='*62}")
-
- # --- Novelty by length ---
- print(f"\nNovelty Classification Accuracy")
- print(f"{'Length':<8} | {'duplicate':<12} | {'evolution':<12} | {'unrelated':<12} | {'Overall':<10}")
- print("-" * 62)
- for length in TOKEN_LENGTHS:
- row = [r for r in novelty if r["length"] == length]
- by_var = {}
- for variant in ["duplicate", "evolution", "unrelated"]:
- subset = [r for r in row if r["variant"] == variant]
- acc = sum(r["correct"] for r in subset) / len(subset) if subset else 0
- by_var[variant] = acc
- overall = sum(r["correct"] for r in row) / len(row) if row else 0
- print(f"{length:<8} | {by_var['duplicate']:>10.1%} | {by_var['evolution']:>10.1%} | {by_var['unrelated']:>10.1%} | {overall:>8.1%}")
-
- # --- Recall by length ---
- print(f"\nRecall Precision")
- print(f"{'Length':<8} | {'R@1 dup':<10} | {'R@3 evo':<10} | {'Unrel !@3':<10} | {'Overall':<10}")
- print("-" * 55)
- for length in TOKEN_LENGTHS:
- row = [r for r in recall if r["length"] == length]
- dup = [r for r in row if r["variant"] == "duplicate"]
- evo = [r for r in row if r["variant"] == "evolution"]
- unr = [r for r in row if r["variant"] == "unrelated"]
- r1_dup = sum(r["recall_at_1"] for r in dup) / len(dup) if dup else 0
- r3_evo = sum(r["recall_at_3"] for r in evo) / len(evo) if evo else 0
- unr_not3 = sum(not r["recall_at_3"] for r in unr) / len(unr) if unr else 0
- overall = (r1_dup + r3_evo + unr_not3) / 3
- print(f"{length:<8} | {r1_dup:>8.1%} | {r3_evo:>8.1%} | {unr_not3:>8.1%} | {overall:>8.1%}")
-
- # --- By language ---
- print(f"\nBy Language")
- print(f"{'Lang':<6} | {'Nov. Acc':<10} | {'R@1 dup':<10} | {'R@3 evo':<10}")
- print("-" * 42)
- langs = sorted(set(r["language"] for r in novelty))
- for lang in langs:
- nov_sub = [r for r in novelty if r["language"] == lang]
- nov_acc = sum(r["correct"] for r in nov_sub) / len(nov_sub) if nov_sub else 0
- rec_sub = [r for r in recall if r["language"] == lang]
- dup_sub = [r for r in rec_sub if r["variant"] == "duplicate"]
- evo_sub = [r for r in rec_sub if r["variant"] == "evolution"]
- r1 = sum(r["recall_at_1"] for r in dup_sub) / len(dup_sub) if dup_sub else 0
- r3 = sum(r["recall_at_3"] for r in evo_sub) / len(evo_sub) if evo_sub else 0
- print(f"{lang:<6} | {nov_acc:>8.1%} | {r1:>8.1%} | {r3:>8.1%}")
-
- # --- Similarity distributions ---
- print(f"\nSimilarity Distribution (mean +/- std)")
- print(f"{'Length':<8} | {'duplicate':<18} | {'evolution':<18} | {'unrelated':<18}")
- print("-" * 68)
- for length in TOKEN_LENGTHS:
- row = [r for r in novelty if r["length"] == length]
- parts = []
- for variant in ["duplicate", "evolution", "unrelated"]:
- subset = [r["similarity"] for r in row if r["variant"] == variant]
- mean = np.mean(subset) if subset else 0
- std = np.std(subset) if subset else 0
- parts.append(f"{mean:.3f} +/- {std:.3f}")
- print(f"{length:<8} | {parts[0]:<18} | {parts[1]:<18} | {parts[2]:<18}")
-
-
-def main():
- parser = argparse.ArgumentParser(description="Embedding token length benchmark")
- parser.add_argument("--model", default="Qwen/Qwen3-Embedding-0.6B",
- help="Sentence-transformers model name")
- parser.add_argument("--dataset", default=str(DATASET_PATH),
- help="Path to dataset JSON")
- parser.add_argument("--report", default=None,
- help="Path to save JSON report")
- args = parser.parse_args()
-
- with open(args.dataset) as f:
- dataset = json.load(f)
-
- model = load_model(args.model)
- results = run_benchmark(model, dataset)
- print_summary(results)
-
- if args.report:
- report_path = Path(args.report)
- report_path.parent.mkdir(parents=True, exist_ok=True)
- with open(report_path, "w") as f:
- json.dump(results, f, indent=2, ensure_ascii=False)
- print(f"\nReport saved to {report_path}")
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmark/runners/retriever_bench.py b/benchmark/runners/retriever_bench.py
deleted file mode 100644
index 0ddfe5c9..00000000
--- a/benchmark/runners/retriever_bench.py
+++ /dev/null
@@ -1,143 +0,0 @@
-#!/usr/bin/env python3
-"""Retriever benchmark runner.
-
-Evaluates retriever quality using offline embedding similarity.
-FHE is transparent to similarity scores, so offline cosine similarity
-accurately predicts enVector Cloud recall performance.
-
-Usage:
- python benchmark/runners/retriever_bench.py
- python benchmark/runners/retriever_bench.py --category exact_match
- python benchmark/runners/retriever_bench.py --report benchmark/reports/retriever.json
-"""
-
-from __future__ import annotations
-
-import argparse
-import sys
-from pathlib import Path
-
-RUNE_DIR = Path(__file__).resolve().parent.parent.parent
-
-sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
-sys.path.insert(0, str(RUNE_DIR / "agents"))
-
-from runners.common import (
- BenchmarkReport,
- ScenarioResult,
- load_scenarios,
-)
-
-
-def evaluate_offline(scenarios: list[dict]) -> BenchmarkReport:
- """Evaluate recall using embedding similarity (no server needed)."""
- from common.embedding_service import EmbeddingService
-
- embedding_service = EmbeddingService()
- report = BenchmarkReport(bench_type="retriever")
-
- for i, scenario in enumerate(scenarios):
- sid = scenario["id"]
- seed_records = scenario["seed_records"]
- query = scenario["query"]
- expected_titles = scenario.get("expected_match_titles", [])
- min_score = scenario.get("min_score", 0.35)
-
- print(f" [{i + 1}/{len(scenarios)}] {sid}...", end=" ", flush=True)
-
- # Embed seed records
- record_embeddings = []
- for record in seed_records:
- text = f"{record['title']}. {record['content']}"
- emb = embedding_service.embed(text)
- record_embeddings.append((record["title"], emb))
-
- # Embed query
- query_emb = embedding_service.embed(query)
-
- # Compute similarities
- scores = []
- for title, rec_emb in record_embeddings:
- sim = embedding_service.cosine_similarity(query_emb, rec_emb)
- scores.append((title, sim))
-
- scores.sort(key=lambda x: x[1], reverse=True)
-
- matched_titles = [t for t, s in scores if s >= min_score]
- hits = [t for t in expected_titles if t in matched_titles]
- passed = len(hits) == len(expected_titles)
-
- # MRR
- mrr_values = []
- for expected_title in expected_titles:
- for rank, (title, _) in enumerate(scores, 1):
- if title == expected_title:
- mrr_values.append(1.0 / rank)
- break
- else:
- mrr_values.append(0.0)
- mrr = sum(mrr_values) / len(mrr_values) if mrr_values else 0.0
-
- details: dict = {
- "scores": [(t, round(s, 4)) for t, s in scores],
- "matched_titles": matched_titles,
- "hits": hits,
- "mrr": round(mrr, 4),
- "min_score": min_score,
- }
- if not passed:
- missed = [t for t in expected_titles if t not in matched_titles]
- details["reason"] = f"Missing matches: {missed}"
-
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=passed,
- expected=expected_titles,
- actual=matched_titles,
- details=details,
- )
- )
- print("PASS" if passed else "FAIL")
-
- return report
-
-
-def main() -> None:
- parser = argparse.ArgumentParser(description="Rune retriever benchmark")
- parser.add_argument(
- "--report", type=Path, default=None, help="Save report to this path"
- )
- parser.add_argument(
- "--category",
- default=None,
- help="Filter to specific category (e.g. 'exact_match', 'semantic_match')",
- )
- args = parser.parse_args()
-
- all_scenarios = load_scenarios("recall")
-
- if args.category:
- all_scenarios = [
- s for s in all_scenarios if args.category in s["category"]
- ]
-
- if not all_scenarios:
- print("No retriever scenarios found.", file=sys.stderr)
- sys.exit(1)
-
- print(f"=== Retriever Benchmark ({len(all_scenarios)} scenarios) ===\n")
-
- report = evaluate_offline(all_scenarios)
- report.print_summary()
-
- if args.report:
- saved = report.save(args.report)
- print(f"Report saved to: {saved}")
- else:
- report.save()
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmark/runners/scribe_bench.py b/benchmark/runners/scribe_bench.py
deleted file mode 100644
index 7626ac51..00000000
--- a/benchmark/runners/scribe_bench.py
+++ /dev/null
@@ -1,658 +0,0 @@
-#!/usr/bin/env python3
-"""Scribe benchmark runner.
-
-Tests the agent-delegated capture pipeline by feeding the actual scribe prompt
-+ scenario input to an agent CLI, then scoring the output JSON against expectations.
-
-This benchmarks what actually determines capture quality in v0.2.0:
-the scribe prompt's ability to guide the agent's policy evaluation and extraction.
-
-Usage:
- # Default: use claude CLI (no API key needed)
- python benchmark/runners/scribe_bench.py
-
- # Use a specific agent CLI
- python benchmark/runners/scribe_bench.py --agent gemini
- python benchmark/runners/scribe_bench.py --agent codex
-
- # Capture only / extraction only
- python benchmark/runners/scribe_bench.py --mode capture
- python benchmark/runners/scribe_bench.py --mode extraction
-
- # Filter by category
- python benchmark/runners/scribe_bench.py --category pr_review
-
- # Fallback: direct API call (for CI/automation without CLI auth)
- python benchmark/runners/scribe_bench.py --api-key $ANTHROPIC_API_KEY --provider anthropic
-
- # Save report
- python benchmark/runners/scribe_bench.py --report benchmark/reports/scribe.json
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import re
-import shlex
-import subprocess
-import sys
-import time
-from pathlib import Path
-
-sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
-from runners.common import (
- RUNE_DIR,
- BenchmarkReport,
- ScenarioResult,
- check_title_keywords,
- load_scenarios,
-)
-
-# Known agent CLI configurations.
-# Each maps to (command, args) where the prompt is piped via stdin.
-AGENT_CLI = {
- "claude": (["claude"], ["-p"]),
- "gemini": (["gemini"], []),
- "codex": (["codex"], ["-q"]),
-}
-
-# Default models for direct API fallback
-API_MODELS = {
- "anthropic": "claude-haiku-4-5-20251001",
- "openai": "gpt-4o-mini",
-}
-
-VERBOSE = False
-
-
-def load_scribe_prompt() -> str:
- """Load and extract the evaluation-relevant sections from scribe.md.
-
- Strips out the Activation Check, Step 3 (MCP tool call), Handling Results,
- and Rules sections — the benchmark only needs Steps 1-2 (policy + extraction).
- """
- scribe_path = RUNE_DIR / "agents" / "claude" / "scribe.md"
- if not scribe_path.exists():
- print(f"Error: scribe prompt not found at {scribe_path}", file=sys.stderr)
- sys.exit(1)
-
- full_text = scribe_path.read_text()
-
- # Extract Steps 1-2 only (policy evaluation + structured extraction).
- # Cut everything before "## Step 1" and everything from "## Step 3" onward.
- sections_to_keep: list[str] = []
- current_section: list[str] = []
- keeping = False
-
- for line in full_text.splitlines():
- if line.startswith("## Step 1"):
- keeping = True
- elif line.startswith("## Step 3"):
- # Save accumulated section and stop
- if current_section:
- sections_to_keep.append("\n".join(current_section))
- keeping = False
- break
-
- if keeping:
- current_section.append(line)
-
- if current_section and keeping:
- sections_to_keep.append("\n".join(current_section))
-
- if not sections_to_keep:
- # Fallback: use full text if section markers changed
- return full_text
-
- return "\n\n".join(sections_to_keep)
-
-
-def build_evaluation_prompt(scribe_prompt: str, input_text: str) -> str:
- """Build the prompt that simulates the scribe's evaluation of a message."""
- return f"""You are evaluating a workplace message for organizational memory capture.
-
-{scribe_prompt}
-
----
-
-Evaluate the following message. Output ONLY a single JSON object — no explanation, no markdown fences, no other text. Either a rejection:
-{{"tier2": {{"capture": false, "reason": "...", "domain": "general"}}}}
-Or a full extraction (Format A, B, or C as described above).
-
-Message:
-{input_text}"""
-
-
-def call_agent_cli(prompt: str, agent: str) -> str:
- """Call an agent CLI, piping the prompt via stdin. Returns response text."""
- if agent in AGENT_CLI:
- cmd, args = AGENT_CLI[agent]
- full_cmd = cmd + args
- else:
- # Unknown agent: try splitting as a shell command
- full_cmd = shlex.split(agent)
-
- result = subprocess.run(
- full_cmd,
- input=prompt,
- capture_output=True,
- text=True,
- timeout=120,
- )
-
- if result.returncode != 0:
- stderr = result.stderr.strip()
- raise RuntimeError(
- f"{' '.join(full_cmd)} exited with code {result.returncode}: {stderr}"
- )
-
- return result.stdout
-
-
-def call_api(prompt: str, provider: str, api_key: str, model: str) -> str:
- """Fallback: call LLM via API SDK directly."""
- if provider == "anthropic":
- import anthropic
-
- client = anthropic.Anthropic(api_key=api_key)
- response = client.messages.create(
- model=model,
- max_tokens=2048,
- messages=[{"role": "user", "content": prompt}],
- )
- return response.content[0].text
-
- elif provider == "openai":
- import openai
-
- client = openai.OpenAI(api_key=api_key)
- response = client.chat.completions.create(
- model=model,
- max_tokens=2048,
- messages=[{"role": "user", "content": prompt}],
- )
- return response.choices[0].message.content or ""
-
- else:
- raise ValueError(f"Unsupported provider: {provider}")
-
-
-def call_llm(
- prompt: str,
- *,
- agent: str | None = None,
- provider: str | None = None,
- api_key: str | None = None,
- model: str | None = None,
-) -> str:
- """Unified LLM call: prefers agent CLI, falls back to API."""
- if api_key and provider:
- mdl = model or API_MODELS.get(provider, "claude-haiku-4-5-20251001")
- return call_api(prompt, provider, api_key, mdl)
- else:
- return call_agent_cli(prompt, agent or "claude")
-
-
-def parse_json_response(text: str) -> dict | None:
- """Extract JSON from LLM response, handling various wrapping formats."""
- text = text.strip()
-
- # 1. Direct parse (clean JSON)
- if text.startswith("{"):
- try:
- return json.loads(text)
- except json.JSONDecodeError:
- pass
-
- # 2. Markdown code block
- match = re.search(r"```(?:json)?\s*\n(.*?)\n\s*```", text, re.DOTALL)
- if match:
- try:
- return json.loads(match.group(1).strip())
- except json.JSONDecodeError:
- pass
-
- # 3. Find the outermost balanced { ... } using brace counting
- start = text.find("{")
- if start != -1:
- depth = 0
- in_string = False
- escape_next = False
- for i in range(start, len(text)):
- ch = text[i]
- if escape_next:
- escape_next = False
- continue
- if ch == "\\":
- escape_next = True
- continue
- if ch == '"' and not escape_next:
- in_string = not in_string
- continue
- if in_string:
- continue
- if ch == "{":
- depth += 1
- elif ch == "}":
- depth -= 1
- if depth == 0:
- candidate = text[start : i + 1]
- try:
- return json.loads(candidate)
- except json.JSONDecodeError:
- break
-
- return None
-
-
-def evaluate_capture(
- scenarios: list[dict],
- scribe_prompt: str,
- llm_kwargs: dict,
-) -> BenchmarkReport:
- """Evaluate capture accuracy: does the scribe correctly capture or skip?"""
- report = BenchmarkReport(bench_type="scribe-capture")
-
- for i, scenario in enumerate(scenarios):
- sid = scenario["id"]
- expected = scenario["expected_capture"]
- text = scenario["input"]
-
- print(f" [{i + 1}/{len(scenarios)}] {sid}...", end=" ", flush=True)
-
- prompt = build_evaluation_prompt(scribe_prompt, text)
-
- try:
- response_text = call_llm(prompt, **llm_kwargs)
- parsed = parse_json_response(response_text)
- except Exception as e:
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=False,
- expected=expected,
- actual=None,
- details={"error": str(e), "reason": f"LLM call failed: {e}"},
- )
- )
- print("ERROR")
- if VERBOSE:
- print(f" {e}")
- continue
-
- if parsed is None:
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=False,
- expected=expected,
- actual=None,
- details={
- "reason": "Failed to parse JSON from response",
- "raw_response": response_text[:500],
- },
- )
- )
- print("PARSE_ERROR")
- if VERBOSE:
- print(f" Raw ({len(response_text)} chars): {response_text[:200]}")
- continue
-
- # Determine actual capture decision
- tier2 = parsed.get("tier2", {})
- actual = tier2.get("capture", False)
-
- # Capture implied by presence of extraction fields without tier2
- if "title" in parsed and "tier2" not in parsed:
- actual = True
- if "phases" in parsed and "tier2" not in parsed:
- actual = True
-
- passed = actual == expected
- details: dict = {
- "tier2_capture": actual,
- "tier2_domain": tier2.get("domain"),
- "tier2_reason": tier2.get("reason"),
- }
-
- if not passed:
- if expected and not actual:
- details["reason"] = "False negative: should have been captured"
- else:
- details["reason"] = "False positive: should not have been captured"
-
- expected_fields = scenario.get("expected_fields", {})
- if passed and expected and "domain" in expected_fields:
- details["domain_match"] = (
- tier2.get("domain") == expected_fields["domain"]
- )
-
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=passed,
- expected=expected,
- actual=actual,
- details=details,
- )
- )
- print("PASS" if passed else "FAIL")
- time.sleep(0.1)
-
- return report
-
-
-def evaluate_extraction(
- scenarios: list[dict],
- scribe_prompt: str,
- llm_kwargs: dict,
-) -> BenchmarkReport:
- """Evaluate extraction quality: is the extracted JSON well-structured?"""
- report = BenchmarkReport(bench_type="scribe-extraction")
-
- for i, scenario in enumerate(scenarios):
- sid = scenario["id"]
- text = scenario["input"]
- expected_type = scenario["expected_extraction_type"]
- expected_fields = scenario.get("expected_fields", {})
-
- print(f" [{i + 1}/{len(scenarios)}] {sid}...", end=" ", flush=True)
-
- prompt = build_evaluation_prompt(scribe_prompt, text)
-
- try:
- response_text = call_llm(prompt, **llm_kwargs)
- parsed = parse_json_response(response_text)
- except Exception as e:
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=False,
- expected=expected_type,
- actual=None,
- details={"error": str(e), "reason": f"LLM call failed: {e}"},
- )
- )
- print("ERROR")
- if VERBOSE:
- print(f" {e}")
- continue
-
- if parsed is None:
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=False,
- expected=expected_type,
- actual=None,
- details={
- "reason": "Failed to parse JSON from response",
- "raw_response": response_text[:500],
- },
- )
- )
- print("PARSE_ERROR")
- if VERBOSE:
- print(f" Raw ({len(response_text)} chars): {response_text[:200]}")
- continue
-
- # Check if scribe decided to capture at all
- tier2 = parsed.get("tier2", {})
- if not tier2.get("capture", True):
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=False,
- expected=expected_type,
- actual="rejected",
- details={
- "reason": "Scribe rejected capture for extraction scenario",
- "tier2_reason": tier2.get("reason"),
- },
- )
- )
- print("REJECTED")
- continue
-
- # Determine actual extraction type
- group_type = parsed.get("group_type")
- if group_type == "bundle":
- actual_type = "bundle"
- elif group_type == "phase_chain" or "phases" in parsed:
- actual_type = "phase_chain"
- else:
- actual_type = "single"
-
- checks: dict[str, bool] = {}
- reasons: list[str] = []
-
- type_match = actual_type == expected_type
- checks["type_match"] = type_match
- if not type_match:
- reasons.append(f"Type: expected {expected_type}, got {actual_type}")
-
- if "title_keywords" in expected_fields:
- title = parsed.get("title") or parsed.get("group_title") or ""
- kw_match = check_title_keywords(title, expected_fields["title_keywords"])
- checks["title_keywords"] = kw_match
- if not kw_match:
- reasons.append(
- f"Title '{title}' missing keywords: "
- f"{expected_fields['title_keywords']}"
- )
-
- if "status_hint" in expected_fields:
- actual_status = parsed.get("status_hint", "")
- status_match = actual_status == expected_fields["status_hint"]
- checks["status_hint"] = status_match
- if not status_match:
- reasons.append(
- f"Status: expected {expected_fields['status_hint']}, got {actual_status}"
- )
-
- if "min_alternatives" in expected_fields:
- if actual_type == "single":
- alt_count = len(parsed.get("alternatives", []))
- else:
- alt_count = sum(
- len(p.get("alternatives", []))
- for p in parsed.get("phases", [])
- )
- alt_ok = alt_count >= expected_fields["min_alternatives"]
- checks["min_alternatives"] = alt_ok
- if not alt_ok:
- reasons.append(
- f"Alternatives: {alt_count} < {expected_fields['min_alternatives']}"
- )
-
- if "min_trade_offs" in expected_fields:
- if actual_type == "single":
- to_count = len(parsed.get("trade_offs", []))
- else:
- to_count = sum(
- len(p.get("trade_offs", []))
- for p in parsed.get("phases", [])
- )
- to_ok = to_count >= expected_fields["min_trade_offs"]
- checks["min_trade_offs"] = to_ok
- if not to_ok:
- reasons.append(
- f"Trade-offs: {to_count} < {expected_fields['min_trade_offs']}"
- )
-
- phases = parsed.get("phases", [])
- if "min_phases" in expected_fields:
- min_ok = len(phases) >= expected_fields["min_phases"]
- checks["min_phases"] = min_ok
- if not min_ok:
- reasons.append(
- f"Phases: {len(phases)} < {expected_fields['min_phases']}"
- )
- if "max_phases" in expected_fields:
- max_ok = len(phases) <= expected_fields["max_phases"]
- checks["max_phases"] = max_ok
- if not max_ok:
- reasons.append(
- f"Phases: {len(phases)} > {expected_fields['max_phases']}"
- )
-
- passed = all(checks.values())
- details = {
- "checks": checks,
- "actual_type": actual_type,
- "title": parsed.get("title") or parsed.get("group_title"),
- }
- if phases:
- details["phase_count"] = len(phases)
- details["phase_titles"] = [p.get("phase_title", "") for p in phases]
- if reasons:
- details["reason"] = "; ".join(reasons)
-
- report.add(
- ScenarioResult(
- scenario_id=sid,
- category=scenario["category"],
- passed=passed,
- expected=expected_type,
- actual=actual_type,
- details=details,
- )
- )
- print("PASS" if passed else "FAIL")
- time.sleep(0.1)
-
- return report
-
-
-def main() -> None:
- parser = argparse.ArgumentParser(
- description="Rune scribe benchmark — tests capture quality via agent-delegated flow"
- )
-
- # Agent CLI mode (default)
- parser.add_argument(
- "--agent",
- default="claude",
- help=(
- "Agent CLI to use. Built-in: claude, gemini, codex. "
- "Or pass a custom command, e.g. 'my-agent --flag'. (default: claude)"
- ),
- )
-
- # API fallback mode (for CI / environments without CLI auth)
- api_group = parser.add_argument_group("API fallback (optional)")
- api_group.add_argument(
- "--api-key",
- default=None,
- help="Use direct API call instead of agent CLI",
- )
- api_group.add_argument(
- "--provider",
- default="anthropic",
- help="API provider: anthropic, openai (only with --api-key)",
- )
- api_group.add_argument(
- "--model",
- default=None,
- help="Model name (only with --api-key)",
- )
-
- parser.add_argument(
- "--mode",
- choices=["capture", "extraction", "all"],
- default="all",
- help="What to benchmark (default: all)",
- )
- parser.add_argument(
- "--report", type=Path, default=None, help="Save report to this path"
- )
- parser.add_argument(
- "--category",
- default=None,
- help="Filter to specific category (e.g. 'architecture', 'pr_review')",
- )
- parser.add_argument(
- "-v", "--verbose",
- action="store_true",
- help="Show raw LLM response on parse errors",
- )
- args = parser.parse_args()
-
- global VERBOSE
- VERBOSE = args.verbose
-
- # Build LLM call kwargs
- if args.api_key:
- llm_kwargs = {
- "provider": args.provider,
- "api_key": args.api_key,
- "model": args.model,
- }
- backend_desc = f"API: {args.provider} / {args.model or API_MODELS.get(args.provider)}"
- else:
- llm_kwargs = {"agent": args.agent}
- backend_desc = f"Agent CLI: {args.agent}"
-
- scribe_prompt = load_scribe_prompt()
-
- print(f"Backend: {backend_desc}")
- print(f"Scribe prompt: {len(scribe_prompt)} chars\n")
-
- if args.mode in ("capture", "all"):
- should_capture = load_scenarios("capture/should_capture")
- should_not_capture = load_scenarios("capture/should_not_capture")
- capture_scenarios = should_capture + should_not_capture
-
- if args.category:
- capture_scenarios = [
- s for s in capture_scenarios if args.category in s["category"]
- ]
-
- if capture_scenarios:
- print(f"=== Capture Benchmark ({len(capture_scenarios)} scenarios) ===\n")
- capture_report = evaluate_capture(
- capture_scenarios, scribe_prompt, llm_kwargs
- )
- capture_report.print_summary()
-
- if args.report:
- p = args.report.with_stem(args.report.stem + "-capture")
- capture_report.save(p)
- print(f"Report saved to: {p}")
- else:
- capture_report.save()
-
- if args.mode in ("extraction", "all"):
- extraction_scenarios = load_scenarios("extraction")
-
- if args.category:
- extraction_scenarios = [
- s for s in extraction_scenarios if args.category in s["category"]
- ]
-
- if extraction_scenarios:
- print(
- f"\n=== Extraction Benchmark ({len(extraction_scenarios)} scenarios) ===\n"
- )
- extraction_report = evaluate_extraction(
- extraction_scenarios, scribe_prompt, llm_kwargs
- )
- extraction_report.print_summary()
-
- if args.report:
- p = args.report.with_stem(args.report.stem + "-extraction")
- extraction_report.save(p)
- print(f"Report saved to: {p}")
- else:
- extraction_report.save()
-
-
-if __name__ == "__main__":
- main()
diff --git a/benchmark/scenarios/capture/should_capture/architecture/edge_cases.jsonl b/benchmark/scenarios/capture/should_capture/architecture/edge_cases.jsonl
deleted file mode 100644
index 32fcbe3f..00000000
--- a/benchmark/scenarios/capture/should_capture/architecture/edge_cases.jsonl
+++ /dev/null
@@ -1,2 +0,0 @@
-{"id": "arch-edge-implicit-001", "category": "capture/should_capture/architecture", "language": "en", "input": "After spending 3 days fighting with Terraform state locks in DynamoDB, we're moving to Terraform Cloud for state management. The $20/month cost is nothing compared to the engineering time lost. The DynamoDB approach works fine for single-developer projects but breaks down with 5 people running terraform plan concurrently. Terraform Cloud also gives us a UI for state inspection and drift detection, which we've been doing manually.", "expected_capture": true, "expected_fields": {"domain": "ops", "status_hint": "accepted", "title_keywords": ["Terraform", "state management"]}, "recall_queries": [{"query": "How do we manage Terraform state?", "should_match": true}], "notes": "Decision is embedded in a frustration narrative — tests if detector catches decisions in informal language"}
-{"id": "arch-edge-mixed-lang-002", "category": "capture/should_capture/architecture", "language": "mixed", "input": "API rate limiting 전략 확정. Token bucket 알고리즘 사용하기로. Redis에서 EVALSHA로 atomic하게 처리. Per-user limit은 1000 req/min, per-IP는 5000 req/min. 429 Too Many Requests 응답에 Retry-After 헤더 포함. 이전에 검토한 sliding window 방식은 Redis memory 사용량이 너무 높아서 탈락.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["rate limiting", "token bucket"]}, "recall_queries": [{"query": "rate limiting strategy", "should_match": true}, {"query": "API 속도 제한", "should_match": true}], "notes": "Mixed Korean/English — common in Korean engineering teams"}
diff --git a/benchmark/scenarios/capture/should_capture/architecture/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/architecture/scenarios.jsonl
deleted file mode 100644
index 7387e83e..00000000
--- a/benchmark/scenarios/capture/should_capture/architecture/scenarios.jsonl
+++ /dev/null
@@ -1,8 +0,0 @@
-{"id": "arch-event-sourcing-001", "category": "capture/should_capture/architecture", "language": "en", "input": "We decided to use event sourcing for the order service instead of traditional CRUD. The main reason is we need full audit trails for compliance, and event replay lets us rebuild read models without data migration. The trade-off is increased storage and complexity in the event store, but compliance requirements make this non-negotiable.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["event sourcing", "order service"]}, "recall_queries": [{"query": "Why did we choose event sourcing?", "should_match": true}, {"query": "Order service architecture decision", "should_match": true}]}
-{"id": "arch-graphql-gateway-002", "category": "capture/should_capture/architecture", "language": "en", "input": "After evaluating REST, gRPC, and GraphQL for our API gateway, we're going with GraphQL. The mobile team needs flexible queries without over-fetching, and our current REST endpoints require 3-4 calls per screen load. gRPC was considered for internal services but doesn't solve the client query flexibility problem. We'll keep REST for webhooks and public APIs.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["GraphQL", "API gateway"]}, "recall_queries": [{"query": "What API protocol did we choose for the gateway?", "should_match": true}, {"query": "Why not gRPC for the gateway?", "should_match": true}]}
-{"id": "arch-microservices-split-003", "category": "capture/should_capture/architecture", "language": "en", "input": "We're splitting the monolith's payment module into a separate service. The billing team needs independent deploy cycles and the shared database is causing lock contention during invoice runs. We'll use the strangler fig pattern — new endpoints go to the service, old ones proxy through until migration is complete. Estimated 3 months for full cutover.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["payment", "microservice", "split"]}, "recall_queries": [{"query": "How are we decomposing the monolith?", "should_match": true}]}
-{"id": "arch-cache-strategy-004", "category": "capture/should_capture/architecture", "language": "en", "input": "For the product catalog caching strategy, we're using a write-through cache with Redis instead of cache-aside. The catalog updates are infrequent (< 100/day) but reads are 50K/min. Write-through ensures consistency without TTL-based staleness. We rejected cache-aside because the 30-second staleness window caused pricing discrepancies in the checkout flow last quarter.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["cache", "write-through", "Redis"]}, "recall_queries": [{"query": "What caching strategy do we use for product catalog?", "should_match": true}, {"query": "Why did we reject cache-aside?", "should_match": true}]}
-{"id": "arch-db-migration-005", "category": "capture/should_capture/architecture", "language": "en", "input": "We're migrating from MongoDB to PostgreSQL for the user profile service. The document model doesn't help us anymore since profiles have become highly relational (teams, roles, permissions). Mongo's lack of JOINs forces us to denormalize everything and maintain sync jobs. PostgreSQL with JSONB columns gives us the best of both worlds. Migration will use dual-write for 2 weeks.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["MongoDB", "PostgreSQL", "migration"]}, "recall_queries": [{"query": "Why did we move away from MongoDB?", "should_match": true}]}
-{"id": "arch-k8s-service-mesh-006", "category": "capture/should_capture/architecture", "language": "ko", "input": "서비스 메시로 Istio 대신 Linkerd를 채택하기로 했습니다. Istio는 기능이 풍부하지만 리소스 오버헤드가 너무 크고 운영 복잡도가 높습니다. 우리 규모(서비스 20개 미만)에서는 Linkerd의 경량 사이드카가 더 적합하고, mTLS와 관찰성 기능만으로 충분합니다. Consul Connect도 검토했으나 HashiCorp 라이선스 변경 리스크가 있어 제외했습니다.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["Linkerd", "service mesh"]}, "recall_queries": [{"query": "서비스 메시 어떤 거 쓰기로 했어?", "should_match": true}, {"query": "Why not Istio?", "should_match": true}]}
-{"id": "arch-queue-system-007", "category": "capture/should_capture/architecture", "language": "en", "input": "Rejecting RabbitMQ in favor of Kafka for the event bus. Our use case requires event replay for new consumers, and RabbitMQ's replay story is poor. Kafka's log-based model with consumer groups fits perfectly. The operational overhead is higher, but we already have a Kafka cluster for analytics ingestion, so incremental cost is manageable.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["Kafka", "event bus"]}, "recall_queries": [{"query": "What message queue do we use?", "should_match": true}]}
-{"id": "arch-frontend-framework-008", "category": "capture/should_capture/architecture", "language": "en", "input": "We're standardizing on Next.js App Router for all new frontend projects. The Pages Router is being deprecated internally. Reasons: React Server Components reduce client bundle size by 40% in our tests, and the App Router's nested layouts eliminate the layout prop-drilling problem we've been fighting. The team has completed the migration guide and all new features must use App Router starting next sprint.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["Next.js", "App Router"]}, "recall_queries": [{"query": "What frontend framework are we using?", "should_match": true}, {"query": "Are we using Pages Router or App Router?", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/coding_context/architecture_pivot.jsonl b/benchmark/scenarios/capture/should_capture/coding_context/architecture_pivot.jsonl
deleted file mode 100644
index 14f7ee5e..00000000
--- a/benchmark/scenarios/capture/should_capture/coding_context/architecture_pivot.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "coding-pivot-001", "category": "capture/should_capture/coding_context", "language": "en", "input": "We're pivoting the real-time notifications system from REST polling to WebSocket. The product requirement says users should see new messages within 500ms. Our current setup polls `GET /api/notifications` every 3 seconds, which means:\n\n1. Worst-case latency: 3 seconds (unacceptable per product)\n2. At 10K concurrent users, that's 3,333 requests/second just for polling — most returning empty 304s\n3. Mobile battery drain from constant HTTP keep-alive cycling\n\nWe prototyped both approaches:\n\n```\n// Polling baseline\nPolling interval: 3s\nAvg notification delivery: 1.5s\nServer load: 3,333 req/s at 10K users\nHTTP overhead per poll: ~800 bytes headers for empty response\n\n// WebSocket prototype\nAvg notification delivery: 47ms\nServer load: 10K persistent connections (2.1GB RAM on c5.xlarge)\nPer-message overhead: ~6 bytes framing\n```\n\nWebSocket wins on latency and efficiency. The trade-off is operational complexity — we need sticky sessions or a Redis pub/sub fan-out layer for multi-instance deploys. We're going with Socket.io on the server side with a Redis adapter for horizontal scaling. Fallback to long-polling for corporate networks that block WebSocket upgrades.\n\nThe REST endpoint stays for mobile push notification triggers — those don't need sub-second delivery.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["WebSocket", "polling", "real-time", "notifications"], "evidence_type": "benchmark", "has_reusable_insight": true}, "recall_queries": [{"query": "why WebSocket instead of polling for notifications", "should_match": true}, {"query": "real-time notification architecture", "should_match": true}]}
-{"id": "coding-pivot-002", "category": "capture/should_capture/coding_context", "language": "en", "input": "Decision: we're NOT extracting the billing module into a microservice. I know we planned this for Q2 and it's on the roadmap, but after spiking it for two weeks, the costs outweigh the benefits right now.\n\nReasons to keep it as a module in the monolith:\n\n1. **Data coupling is too tight** — billing reads from 8 different tables (users, subscriptions, invoices, payments, coupons, tax_rates, usage_events, credit_notes). Extracting means either replicating all this data or making 8 cross-service calls per invoice generation.\n\n2. **Transaction boundaries** — invoice creation needs to atomically update `subscriptions.current_period_end`, insert into `invoices`, and create `payment_intents`. With a microservice, we'd need saga orchestration for what's currently a single DB transaction.\n\n3. **Team size** — we're 4 backend engineers. Running a separate service means separate CI/CD, monitoring, on-call rotation. The operational overhead isn't justified.\n\nWhat we WILL do instead:\n- Extract billing into a well-defined module with a clean interface (`BillingService` class with no direct table access from outside)\n- Add integration tests at the module boundary\n- Revisit microservice extraction when we hit 10+ engineers or need independent scaling\n\nThis is a \"keep it boring\" decision. The monolith is serving us fine at current scale.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "rejected", "title_keywords": ["microservice", "monolith", "billing", "module"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "should we extract billing into microservice", "should_match": true}, {"query": "why we kept billing in the monolith", "should_match": true}]}
-{"id": "coding-pivot-003", "category": "capture/should_capture/coding_context", "language": "en", "input": "Dropping the ORM for the analytics query pipeline. We've been fighting Sequelize for weeks trying to express the reporting queries, and it's not working.\n\nThe query we need generates a cohort retention matrix — users grouped by signup week, showing what percentage returned in weeks 1-12. In raw SQL it's a CTE with window functions:\n\n```sql\nWITH cohorts AS (\n SELECT user_id, DATE_TRUNC('week', created_at) AS cohort_week\n FROM users WHERE created_at >= NOW() - INTERVAL '12 weeks'\n),\nactivity AS (\n SELECT DISTINCT user_id, DATE_TRUNC('week', event_time) AS active_week\n FROM events WHERE event_time >= NOW() - INTERVAL '24 weeks'\n)\nSELECT \n c.cohort_week,\n EXTRACT(WEEK FROM a.active_week - c.cohort_week) AS week_number,\n COUNT(DISTINCT a.user_id)::float / COUNT(DISTINCT c.user_id) AS retention_rate\nFROM cohorts c\nLEFT JOIN activity a ON c.user_id = a.user_id AND a.active_week >= c.cohort_week\nGROUP BY c.cohort_week, week_number\nORDER BY c.cohort_week, week_number;\n```\n\nSequelize can't do CTEs with window functions cleanly. We tried `sequelize.literal()` and `sequelize.query()` but at that point we're writing raw SQL inside ORM wrappers — worst of both worlds.\n\nDecision: analytics queries use raw SQL via `pg` driver directly, wrapped in a `AnalyticsQueryService` class. CRUD operations on domain models (users, events, etc.) still use Sequelize. This gives us clean separation — ORM for simple CRUD, raw SQL for complex analytics. We're not switching ORMs; we're just admitting that ORMs aren't the right tool for analytical queries.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["ORM", "raw SQL", "analytics", "Sequelize"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "why raw SQL instead of ORM for analytics", "should_match": true}, {"query": "Sequelize limitations complex queries", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/coding_context/optimization.jsonl b/benchmark/scenarios/capture/should_capture/coding_context/optimization.jsonl
deleted file mode 100644
index e0561a22..00000000
--- a/benchmark/scenarios/capture/should_capture/coding_context/optimization.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "coding-optim-001", "category": "capture/should_capture/coding_context", "language": "en", "input": "Profiled the tag-matching endpoint that was timing out for large accounts. The `findMatchingTags()` function was O(n^2) — nested loop comparing every user tag against every filter tag.\n\nBefore (from the flame graph, this was 94% of request time):\n```js\nfunction findMatchingTags(userTags, filterTags) {\n return userTags.filter(ut => filterTags.some(ft => ft.id === ut.id));\n}\n```\n\nAfter — O(n) with a Set lookup:\n```js\nfunction findMatchingTags(userTags, filterTags) {\n const filterSet = new Set(filterTags.map(ft => ft.id));\n return userTags.filter(ut => filterSet.has(ut.id));\n}\n```\n\nBenchmark results on a real account with 12K user tags and 800 filter tags:\n- Before: 3,200ms avg (p99: 8,400ms, frequently timing out at the 5s gateway limit)\n- After: 11ms avg (p99: 28ms)\n\nThe fix is trivial but the impact is enormous. This pattern keeps recurring — any time we're doing membership checks in a loop, we should be using a Set or Map. Adding this to the team's code review checklist.", "expected_capture": true, "expected_fields": {"domain": "performance", "status_hint": "accepted", "title_keywords": ["O(n²)", "hash set", "tag matching"], "evidence_type": "benchmark", "has_reusable_insight": true}, "recall_queries": [{"query": "tag matching performance optimization", "should_match": true}, {"query": "O(n^2) to O(n) refactor", "should_match": true}]}
-{"id": "coding-optim-002", "category": "capture/should_capture/coding_context", "language": "en", "input": "Fixed the dashboard page load regression. New Relic showed the `/api/dashboard` endpoint jumped from 200ms to 4.2s after we added the activity feed widget. Root cause: classic N+1 query problem.\n\nThe ORM code was doing this:\n```python\nactivities = Activity.objects.filter(team_id=team.id)[:50]\nfor activity in activities:\n # Each of these triggers a separate DB query\n activity.user # lazy load User\n activity.project # lazy load Project\n```\n\nThat's 50 activities x 2 relations = 100 extra queries on top of the initial one.\n\nFix with eager loading:\n```diff\n- activities = Activity.objects.filter(team_id=team.id)[:50]\n+ activities = Activity.objects.filter(team_id=team.id).select_related('user', 'project')[:50]\n```\n\nQuery count dropped from 101 to 1. Response time went from 4.2s back to 220ms. Also added `nplusone` to our test suite to catch this pattern automatically:\n```python\n# settings/test.py\nNPLUSONE_RAISE = True\n```\n\nThis is the third N+1 we've hit this quarter. The `nplusone` library should prevent future ones from reaching production.", "expected_capture": true, "expected_fields": {"domain": "performance", "status_hint": "accepted", "title_keywords": ["N+1", "query", "eager loading", "select_related"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "N+1 query dashboard slow", "should_match": true}, {"query": "how to prevent N+1 queries", "should_match": true}]}
-{"id": "coding-optim-003", "category": "capture/should_capture/coding_context", "language": "en", "input": "Completed the frontend bundle size audit. The main bundle was 2.1MB gzipped, causing 6-8 second load times on mobile. After analysis with `webpack-bundle-analyzer`, the biggest offenders were:\n\n1. `moment.js` with all locales (480KB) — replaced with `dayjs` (2KB + only en/ko locales)\n2. `lodash` full import (71KB) — switched to per-function imports (`lodash/debounce`)\n3. Three chart libraries loaded on every page — code-split behind `React.lazy()`\n\n```diff\n- import moment from 'moment';\n+ import dayjs from 'dayjs';\n\n- import _ from 'lodash';\n+ import debounce from 'lodash/debounce';\n+ import groupBy from 'lodash/groupBy';\n\n- import { BarChart, LineChart, PieChart } from 'recharts';\n+ const BarChart = React.lazy(() => import('recharts').then(m => ({ default: m.BarChart })));\n```\n\nResults:\n- Main bundle: 2.1MB → 380KB gzipped (82% reduction)\n- Dashboard page (with charts): lazy loads additional 120KB on demand\n- Lighthouse Performance score: 34 → 89\n- LCP on 3G: 8.2s → 2.1s\n\nWe should enforce a bundle budget going forward. Adding a CI check that fails if any chunk exceeds 500KB.", "expected_capture": true, "expected_fields": {"domain": "performance", "status_hint": "accepted", "title_keywords": ["bundle size", "tree-shaking", "code splitting"], "evidence_type": "benchmark", "has_reusable_insight": true}, "recall_queries": [{"query": "bundle size reduction frontend", "should_match": true}, {"query": "how did we shrink the JavaScript bundle?", "should_match": true}]}
-{"id": "coding-optim-004", "category": "capture/should_capture/coding_context", "language": "ko", "input": "주문 조회 API의 레이턴시 문제를 해결했습니다. p95가 12초까지 치솟아서 EXPLAIN ANALYZE를 돌려봤습니다:\n\n```sql\nEXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 1234 AND status = 'completed' ORDER BY created_at DESC LIMIT 20;\n\n-- Before:\n-- Seq Scan on orders (cost=0.00..458932.00 rows=892 width=312) (actual time=3421.532..11842.109 rows=18 loops=1)\n-- Filter: ((user_id = 1234) AND (status = 'completed'))\n-- Rows Removed by Filter: 12483921\n-- Planning Time: 0.089 ms\n-- Execution Time: 11842.301 ms\n```\n\n1,200만 건 테이블에서 풀스캔. 복합 인덱스를 추가했습니다:\n\n```sql\nCREATE INDEX CONCURRENTLY idx_orders_user_status_created \n ON orders (user_id, status, created_at DESC);\n```\n\nAfter:\n```\n-- Index Scan using idx_orders_user_status_created on orders (cost=0.56..48.21 rows=892 width=312) (actual time=0.038..0.142 rows=18 loops=1)\n-- Planning Time: 0.312 ms\n-- Execution Time: 0.198 ms\n```\n\n11.8초 → 0.2ms로 개선. `CONCURRENTLY` 옵션으로 무중단 인덱스 생성. 인덱스 크기는 약 340MB인데, 쿼리 빈도(분당 2,000회)를 고려하면 충분히 가치 있습니다. 앞으로 신규 테이블 설계 시 주요 조회 패턴에 대한 인덱스를 DDL에 포함하기로 합니다.", "expected_capture": true, "expected_fields": {"domain": "performance", "status_hint": "accepted", "title_keywords": ["DB index", "풀스캔", "EXPLAIN", "복합 인덱스"], "evidence_type": "benchmark", "has_reusable_insight": true}, "recall_queries": [{"query": "주문 조회 느린 문제 인덱스", "should_match": true}, {"query": "PostgreSQL index optimization orders table", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/coding_context/pattern_establish.jsonl b/benchmark/scenarios/capture/should_capture/coding_context/pattern_establish.jsonl
deleted file mode 100644
index bc91e04f..00000000
--- a/benchmark/scenarios/capture/should_capture/coding_context/pattern_establish.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "coding-pattern-001", "category": "capture/should_capture/coding_context", "language": "en", "input": "After today's incident, we're establishing a mandatory error boundary pattern for all React component trees. Here's what happened: the analytics dashboard crashed completely because a single chart component received `null` for its `data` prop. The uncaught TypeError propagated up and took down the entire page — users couldn't even access navigation.\n\nThe fix is two-fold:\n\n1. Every route-level component MUST be wrapped in an error boundary:\n```jsx\n// app/routes/dashboard.tsx\nexport default function DashboardRoute() {\n return (\n }>\n \n \n );\n}\n```\n\n2. Widget-level error boundaries for independently-failing sections:\n```jsx\n// components/Dashboard/AnalyticsWidget.tsx\nexport function AnalyticsWidget() {\n return (\n \n \n \n \n \n \n );\n}\n```\n\nThe `WidgetErrorBoundary` logs to Sentry with the widget name, renders a \"This section encountered an error\" message, and provides a retry button. The rest of the page remains functional.\n\nAdding an ESLint rule (`no-unbounded-route-component`) to enforce this in CI. This is now a team standard — any new route without an error boundary will fail the lint check.", "expected_capture": true, "expected_fields": {"domain": "engineering_practice", "status_hint": "accepted", "title_keywords": ["error boundary", "React", "crash cascade", "team standard"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "React error boundary team pattern", "should_match": true}, {"query": "how to prevent full page crash from component error", "should_match": true}]}
-{"id": "coding-pattern-002", "category": "capture/should_capture/coding_context", "language": "en", "input": "We're making idempotency keys mandatory for all payment-related endpoints. This comes after the duplicate charge incident last Thursday where a customer was charged $4,200 twice.\n\nRoot cause: the frontend retry logic fired during a network blip. The first request succeeded (charge was created), but the client got a timeout before receiving the response. The retry created a second charge. No idempotency protection on the endpoint.\n\nNew standard for all `POST` endpoints that create financial transactions:\n\n```typescript\n// middleware/idempotency.ts\nexport function requireIdempotencyKey(req: Request, res: Response, next: NextFunction) {\n const key = req.headers['idempotency-key'];\n if (!key) return res.status(400).json({ error: 'Idempotency-Key header required' });\n \n const cached = await redis.get(`idempotency:${key}`);\n if (cached) return res.status(200).json(JSON.parse(cached));\n \n res.on('finish', () => {\n if (res.statusCode >= 200 && res.statusCode < 300) {\n redis.set(`idempotency:${key}`, JSON.stringify(res.body), 'EX', 86400);\n }\n });\n next();\n}\n```\n\nThe key is a UUID generated by the client, stored in Redis for 24 hours. If a request with the same key arrives, we return the cached response without re-executing the handler.\n\nApplies to: `/api/charges`, `/api/refunds`, `/api/transfers`, `/api/subscriptions`. Adding to the API design guidelines doc. Frontend SDK updated to auto-generate and attach idempotency keys for these endpoints.", "expected_capture": true, "expected_fields": {"domain": "engineering_practice", "status_hint": "accepted", "title_keywords": ["idempotency", "payment", "duplicate charge", "team standard"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "idempotency key pattern for payments", "should_match": true}, {"query": "how to prevent duplicate charges", "should_match": true}]}
-{"id": "coding-pattern-003", "category": "capture/should_capture/coding_context", "language": "ko", "input": "오늘 프로덕션 장애 원인이 환경설정 오타였습니다. `REDIS_HOST`가 `redis-prod.internal`이어야 하는데 `redis-pord.internal`로 오타가 나 있었고, 서비스가 시작은 되지만 첫 Redis 호출 시점에 크래시했습니다. 배포 후 15분 뒤에야 발견.\n\n이런 런타임 에러를 방지하기 위해 시작 시 config 검증 규칙을 확립합니다:\n\n```typescript\n// config/validator.ts\nimport { z } from 'zod';\n\nconst envSchema = z.object({\n REDIS_HOST: z.string().min(1),\n REDIS_PORT: z.coerce.number().int().positive(),\n DATABASE_URL: z.string().url(),\n JWT_SECRET: z.string().min(32),\n STRIPE_SECRET_KEY: z.string().startsWith('sk_'),\n NODE_ENV: z.enum(['development', 'staging', 'production']),\n});\n\nexport function validateConfig() {\n const result = envSchema.safeParse(process.env);\n if (!result.success) {\n console.error('Config validation failed:');\n result.error.issues.forEach(issue => {\n console.error(` ${issue.path.join('.')}: ${issue.message}`);\n });\n process.exit(1);\n }\n return result.data;\n}\n```\n\n서비스 진입점에서 가장 먼저 호출:\n```typescript\n// index.ts\nconst config = validateConfig(); // 실패 시 즉시 종료\nconst app = createApp(config);\n```\n\n이제 잘못된 config로 서비스가 시작되는 일은 없습니다. 모든 신규 서비스에 이 패턴 적용을 팀 규칙으로 정합니다. CI에서도 각 환경의 `.env` 파일에 대해 스키마 검증을 실행합니다.", "expected_capture": true, "expected_fields": {"domain": "engineering_practice", "status_hint": "accepted", "title_keywords": ["config validation", "startup", "zod", "환경설정 검증"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "config 검증 시작 시 환경변수 validation", "should_match": true}, {"query": "prevent runtime config errors with startup validation", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/coding_context/reframing.jsonl b/benchmark/scenarios/capture/should_capture/coding_context/reframing.jsonl
deleted file mode 100644
index 10a50806..00000000
--- a/benchmark/scenarios/capture/should_capture/coding_context/reframing.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "coding-reframe-001", "category": "capture/should_capture/coding_context", "language": "en", "input": "Sharing a debugging postmortem that completely changed our understanding of the issue. For two weeks, we thought we had an API timeout problem — the payment confirmation endpoint was sporadically returning 504s. We added retries, increased timeout limits from 10s to 30s, and even opened a ticket with the payment provider.\n\nTurns out it was never an API problem. The real issue was a cache invalidation race in our own service. Here's what was happening:\n\n1. User completes payment → webhook writes `payment_status = 'completed'` to DB\n2. A separate cache-warming job reads the DB and updates Redis — but with a 5-second delay\n3. The confirmation page polls our API, which reads from Redis\n4. If the poll lands in that 5-second window, Redis still has `payment_status = 'pending'`\n5. Frontend retries, Redis eventually catches up, but sometimes the retry storm causes connection pool exhaustion → 504\n\nThe fix wasn't retry logic or timeout tuning. It was updating the Redis cache synchronously in the webhook handler:\n```diff\n- await db.update('payments', { status: 'completed' });\n+ await db.update('payments', { status: 'completed' });\n+ await redis.set(`payment:${id}:status`, 'completed', 'EX', 3600);\n```\n\n504s dropped to zero. Lesson: when the symptom is \"timeout\", check your read path, not just the write path.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["API timeout", "cache invalidation", "reframing"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "payment confirmation 504 timeout root cause", "should_match": true}, {"query": "misdiagnosed API timeout was actually cache issue", "should_match": true}]}
-{"id": "coding-reframe-002", "category": "capture/should_capture/coding_context", "language": "en", "input": "Plot twist on the dashboard performance investigation. Everyone (including me) assumed the UI lag was a React rendering issue. The dashboard has 40+ components and we spent a week optimizing with `React.memo()`, `useMemo()`, and virtualized lists. React DevTools profiler showed render times of 8-15ms per component, which is fine.\n\nThe actual bottleneck was API response serialization on the backend. Our Django REST Framework serializer for the analytics endpoint was doing this:\n\n```python\nclass AnalyticsSerializer(serializers.ModelSerializer):\n computed_metrics = serializers.SerializerMethodField()\n \n def get_computed_metrics(self, obj):\n # This recalculates metrics from raw events on EVERY serialization\n return calculate_metrics(obj.events.all()) # N+1 + CPU-heavy\n```\n\nThe serializer was taking 3.8 seconds to serialize 200 records, each triggering a `calculate_metrics()` call that hit the DB. The frontend was fast — it just had nothing to render while waiting for the response.\n\nFix: pre-compute metrics in a materialized view, serialize from that instead. API response time went from 4.1s to 180ms. All our React optimizations were unnecessary noise — we reverted them to reduce code complexity.\n\nTakeaway: always profile the full request lifecycle before assuming where the bottleneck is. Use the Network tab first, not React DevTools.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["frontend", "backend", "serialization", "misdiagnosis"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "dashboard slow React rendering vs backend", "should_match": true}, {"query": "DRF serializer performance bottleneck", "should_match": true}]}
-{"id": "coding-reframe-003", "category": "capture/should_capture/coding_context", "language": "ko", "input": "2주 동안 네트워크 문제로 분류했던 이슈의 실제 원인을 찾았습니다. 증상은 마이크로서비스 간 호출이 간헐적으로 타임아웃되는 것이었고, 인프라팀에서 네트워크 패킷 로스 분석까지 했었습니다. tcpdump 결과도 정상이었습니다.\n\n진짜 원인: 커넥션 풀 고갈이었습니다.\n\n```\n[2024-03-05T14:22:33Z] WARN HikariPool-1 - Connection is not available, request timed out after 30000ms.\nActive: 10, Idle: 0, Waiting: 47, Total: 10\n```\n\nHikariCP 풀 사이즈가 기본값 10으로 설정되어 있었는데, 트래픽 증가로 동시 DB 접속이 이를 초과했습니다. 대기 중인 스레드가 30초 타임아웃에 걸리면서 상위 서비스에서는 네트워크 타임아웃처럼 보였던 것입니다.\n\n```diff\n# application.yml\n spring:\n datasource:\n hikari:\n- maximum-pool-size: 10\n+ maximum-pool-size: 30\n+ minimum-idle: 10\n+ connection-timeout: 5000\n+ idle-timeout: 300000\n```\n\n추가로 커넥션 풀 모니터링 메트릭을 Grafana에 등록했습니다. `hikaricp_connections_active`, `hikaricp_connections_pending` 알림을 설정해서 80% 이상이면 슬랙 알림이 오도록 했습니다.\n\n교훈: 타임아웃이 발생하면 네트워크부터 의심하지 말고, 커넥션 풀/스레드 풀 같은 리소스 풀 상태를 먼저 확인해야 합니다.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["connection pool", "커넥션 풀", "network timeout", "HikariCP"], "evidence_type": "runtime_observation", "has_reusable_insight": true}, "recall_queries": [{"query": "마이크로서비스 타임아웃 커넥션 풀", "should_match": true}, {"query": "network timeout actually connection pool exhaustion", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/coding_context/root_cause.jsonl b/benchmark/scenarios/capture/should_capture/coding_context/root_cause.jsonl
deleted file mode 100644
index 70be1b90..00000000
--- a/benchmark/scenarios/capture/should_capture/coding_context/root_cause.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "coding-root-cause-001", "category": "capture/should_capture/coding_context", "language": "en", "input": "Found the memory leak in the notification service. The WebSocket handler was registering a `message` listener on the shared EventEmitter every time a client connected, but never removing it on disconnect. After a few thousand connect/disconnect cycles, we had tens of thousands of orphaned listeners piling up.\n\nHere's the problematic code in `ws-handler.ts`:\n```\nwsServer.on('connection', (socket) => {\n const handler = (msg) => socket.send(JSON.stringify(msg));\n eventBus.on('notification', handler);\n // no cleanup on 'close' — this is the leak\n});\n```\n\nFix:\n```diff\n wsServer.on('connection', (socket) => {\n const handler = (msg) => socket.send(JSON.stringify(msg));\n eventBus.on('notification', handler);\n- // no cleanup on 'close'\n+ socket.on('close', () => {\n+ eventBus.removeListener('notification', handler);\n+ });\n });\n```\n\nAlso added a safety net — `eventBus.setMaxListeners(100)` with a monitoring alert if listener count exceeds 50. The process was OOM-killed every ~36 hours in production before this fix. Heap snapshot confirmed: 47K `handler` closures holding references to dead sockets.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["memory leak", "WebSocket", "listener", "removeListener"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "memory leak WebSocket event listener", "should_match": true}, {"query": "notification service OOM crash root cause", "should_match": true}]}
-{"id": "coding-root-cause-002", "category": "capture/should_capture/coding_context", "language": "en", "input": "Root cause of the duplicate order bug: classic TOCTOU race condition in the order creation flow. The frontend calls `GET /api/inventory/:sku` to check stock, displays \"In Stock\", then submits `POST /api/orders`. Between those two requests, another user can claim the last unit.\n\nThe fix uses `SELECT FOR UPDATE SKIP LOCKED` to atomically claim inventory within the order transaction:\n\n```diff\n- const available = await db.query('SELECT quantity FROM inventory WHERE sku = $1', [sku]);\n- if (available.rows[0].quantity < requested) throw new InsufficientStockError();\n- await db.query('UPDATE inventory SET quantity = quantity - $1 WHERE sku = $2', [requested, sku]);\n+ const claimed = await db.query(\n+ 'UPDATE inventory SET quantity = quantity - $1 WHERE sku = $2 AND quantity >= $1 RETURNING quantity',\n+ [requested, sku]\n+ );\n+ if (claimed.rowCount === 0) throw new InsufficientStockError();\n```\n\nWe also wrapped this in a transaction with `SELECT ... FOR UPDATE SKIP LOCKED` so concurrent requests on the same SKU serialize properly without deadlocking. The frontend stock check remains for UX purposes but is no longer the source of truth. Verified with a load test: 500 concurrent orders for 1 remaining unit, exactly 1 succeeds now.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["race condition", "TOCTOU", "duplicate order", "SELECT FOR UPDATE"], "evidence_type": "code_change", "has_reusable_insight": true}, "recall_queries": [{"query": "duplicate order race condition fix", "should_match": true}, {"query": "TOCTOU inventory check", "should_match": true}]}
-{"id": "coding-root-cause-003", "category": "capture/should_capture/coding_context", "language": "en", "input": "Used git bisect to track down the auth regression. Users were getting 401s on the dashboard after last Tuesday's deploy. Bisected across 47 commits:\n\n```\n$ git bisect start\n$ git bisect bad HEAD\n$ git bisect good v2.14.0\nBisecting: 23 revisions left to test after this (roughly 5 steps)\n...\n$ git bisect bad\nBisecting: 11 revisions left to test after this\n...\ncommit a3f7c2e is the first bad commit\nAuthor: dev-charlie\nDate: Tue Mar 3 14:22:11\n\n refactor: extract token validation into middleware\n\n Moved validateToken() from route handler to Express middleware.\n The middleware runs before body parsing, so req.body is undefined\n when it tries to read the refresh token from the request body.\n```\n\nThe issue: `validateToken()` was moved to middleware that runs before `express.json()`. It worked for access tokens (read from `Authorization` header) but failed for refresh token rotation which reads from `req.body`. Fix: reorder middleware so `express.json()` runs before `validateToken`, or read refresh token from a cookie instead. We went with the cookie approach since it's more secure anyway (httpOnly, sameSite=strict). The bisect took 6 steps — worth documenting since we almost blamed the CDN cache.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["git bisect", "auth", "middleware", "token validation"], "evidence_type": "git_bisect", "has_reusable_insight": true}, "recall_queries": [{"query": "auth 401 regression after deploy", "should_match": true}, {"query": "git bisect auth flow broken", "should_match": true}]}
-{"id": "coding-root-cause-004", "category": "capture/should_capture/coding_context", "language": "ko", "input": "스택 트레이스가 완전히 오해를 유발한 케이스를 공유합니다. 프로덕션에서 간헐적 500 에러가 발생했고, 스택 트레이스는 API 레이어를 가리켰습니다:\n\n```\nError: Cannot read properties of undefined (reading 'userId')\n at UserController.getProfile (/app/src/controllers/user.ts:45:32)\n at Layer.handle [as handle_request] (/app/node_modules/express/lib/router/layer.js:95:5)\n```\n\n`user.ts:45`를 아무리 봐도 문제가 없었습니다. 알고 보니 실제 원인은 Redis 캐시 무효화 레이스였습니다. 유저가 프로필을 업데이트하면:\n1. DB에 업데이트 → 성공\n2. 캐시 무효화 요청 → Redis에 전송\n3. 다른 요청이 캐시 미스 → DB에서 조회 → 새 값 캐시\n4. 2번의 무효화가 뒤늦게 도착 → 3번에서 넣은 캐시를 삭제\n5. 또 다른 요청 → 캐시 미스 → DB 조회하는데 이때 connection pool 고갈 → undefined 반환\n\n```diff\n- await redis.del(`user:${userId}`);\n+ await redis.set(`user:${userId}`, JSON.stringify(updatedUser), 'EX', 300);\n```\n\n무효화 대신 write-through로 변경하여 레이스를 제거했습니다. 스택 트레이스만 보고 API 코드를 2일 동안 디버깅한 게 아까웠습니다.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["stack trace", "cache invalidation", "race condition"], "evidence_type": "error_trace", "has_reusable_insight": true}, "recall_queries": [{"query": "스택 트레이스 잘못된 원인 캐시 레이스", "should_match": true}, {"query": "cache invalidation race condition undefined error", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/debugging/edge_cases.jsonl b/benchmark/scenarios/capture/should_capture/debugging/edge_cases.jsonl
deleted file mode 100644
index 145bcf91..00000000
--- a/benchmark/scenarios/capture/should_capture/debugging/edge_cases.jsonl
+++ /dev/null
@@ -1 +0,0 @@
-{"id": "debug-edge-subtle-003", "category": "capture/should_capture/debugging", "language": "en", "input": "FYI — the flaky test in CI (test_concurrent_checkout) is not actually flaky. It exposed a real race condition. The test creates 10 concurrent checkout requests for the last item in stock. On fast machines, the requests serialize through the connection pool and pass. On CI (slower), true concurrency exposes the missing row-level lock. I added SELECT FOR UPDATE and the test now passes deterministically. Don't mark it as @flaky — it was doing its job.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["flaky test", "race condition", "SELECT FOR UPDATE"]}, "recall_queries": [{"query": "flaky test concurrent checkout", "should_match": true}], "notes": "Insight disguised as a casual FYI — contains a debugging breakthrough and a policy statement (don't mark as @flaky)"}
diff --git a/benchmark/scenarios/capture/should_capture/debugging/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/debugging/scenarios.jsonl
deleted file mode 100644
index 3d63d630..00000000
--- a/benchmark/scenarios/capture/should_capture/debugging/scenarios.jsonl
+++ /dev/null
@@ -1,8 +0,0 @@
-{"id": "debug-grpc-keepalive-001", "category": "capture/should_capture/debugging", "language": "en", "input": "Found the root cause of the intermittent gRPC timeouts. The AWS NLB has a 350-second idle timeout, but our gRPC keepalive was set to 600 seconds. Connections were being silently dropped by the NLB, and the client didn't detect it until the next RPC attempt. Fix: set GRPC_KEEPALIVE_TIME_MS to 60000 and GRPC_KEEPALIVE_TIMEOUT_MS to 20000. This matches AWS's recommendation for NLB-backed gRPC services.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["gRPC", "keepalive", "NLB"]}, "recall_queries": [{"query": "gRPC timeout issues with AWS NLB", "should_match": true}, {"query": "What keepalive settings should we use for gRPC?", "should_match": true}]}
-{"id": "debug-memory-leak-002", "category": "capture/should_capture/debugging", "language": "en", "input": "The Node.js memory leak in the notification service was caused by event listeners not being removed on WebSocket disconnect. Each reconnection added a new listener without cleaning up the old one. After 48 hours, the process had 50K+ dangling listeners and OOM-killed. Fix: use AbortController to tie listener lifecycle to the WebSocket connection. Added a process-level listener count alert at 1000.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["memory leak", "WebSocket", "listener"]}, "recall_queries": [{"query": "What caused the notification service memory leak?", "should_match": true}]}
-{"id": "debug-dns-resolution-003", "category": "capture/should_capture/debugging", "language": "en", "input": "Traced the sporadic 5xx errors to DNS resolution caching in the JVM. Our Java services cache DNS indefinitely by default (networkaddress.cache.ttl=-1), but the upstream service sits behind an ALB whose IPs rotate. When ALB scaled in, cached IPs pointed to terminated instances. Fix: set networkaddress.cache.ttl=60 in jvm.options. This is a known JVM footgun that bit us before in the auth service.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["DNS", "JVM", "cache"]}, "recall_queries": [{"query": "JVM DNS caching issue", "should_match": true}, {"query": "Why are Java services getting 5xx to ALB?", "should_match": true}]}
-{"id": "debug-deadlock-004", "category": "capture/should_capture/debugging", "language": "en", "input": "The PostgreSQL deadlock in the inventory service was caused by two concurrent transactions acquiring row locks in different order. Transaction A locks product row then warehouse row; Transaction B locks warehouse row then product row. Solution: enforce a canonical lock ordering — always lock by table name alphabetically (product → warehouse). Added a locking convention doc to the wiki.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["deadlock", "PostgreSQL", "lock ordering"]}, "recall_queries": [{"query": "How do we prevent deadlocks in PostgreSQL?", "should_match": true}]}
-{"id": "debug-cors-preflight-005", "category": "capture/should_capture/debugging", "language": "en", "input": "The CORS preflight failures on the new API endpoints were because our API gateway strips custom headers by default on OPTIONS requests. The browser sends Access-Control-Request-Headers with X-Request-ID, but the gateway's OPTIONS handler doesn't echo it back in Access-Control-Allow-Headers. Fix: explicitly list all custom headers in the gateway's CORS config rather than relying on wildcard, which doesn't work for credentialed requests.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["CORS", "preflight", "gateway"]}, "recall_queries": [{"query": "CORS issues with API gateway", "should_match": true}]}
-{"id": "debug-timezone-bug-006", "category": "capture/should_capture/debugging", "language": "ko", "input": "정산 금액 불일치의 원인을 찾았습니다. 서버는 UTC로 날짜 경계를 계산하는데, 비즈니스 로직의 '오늘 매출'은 KST 기준이어야 합니다. UTC 자정~오전 9시 사이 발생한 거래가 전날로 집계되어 일별 정산이 어긋났습니다. 해결: 정산 쿼리에 KST 오프셋(+09:00) 명시적 적용. 향후 모든 비즈니스 날짜 쿼리는 타임존을 필수 파라미터로 받도록 변경.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["timezone", "정산", "KST"]}, "recall_queries": [{"query": "정산 금액 불일치 원인", "should_match": true}, {"query": "timezone billing bug", "should_match": true}]}
-{"id": "debug-race-condition-007", "category": "capture/should_capture/debugging", "language": "en", "input": "The duplicate order bug was a classic TOCTOU race condition. The frontend checks stock availability, then submits the order in a separate request. Between check and submit, another user can claim the last item. Fix: use SELECT FOR UPDATE SKIP LOCKED in the order creation transaction to atomically claim inventory. The optimistic check on the frontend remains for UX but is not authoritative.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["race condition", "TOCTOU", "duplicate order"]}, "recall_queries": [{"query": "How did we fix the duplicate order bug?", "should_match": true}]}
-{"id": "debug-ssl-handshake-008", "category": "capture/should_capture/debugging", "language": "en", "input": "Intermittent SSL handshake failures between our Go service and the payment provider were caused by TLS session ticket rotation. The provider rotates session tickets every 4 hours, and our Go HTTP client's connection pool holds connections longer than that. When a pooled connection tries to resume with an expired ticket, the handshake fails. Fix: set IdleConnTimeout to 3 hours in the HTTP transport, shorter than the ticket rotation window.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["SSL", "TLS", "session ticket"]}, "recall_queries": [{"query": "SSL handshake failures with payment provider", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/incident/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/incident/scenarios.jsonl
deleted file mode 100644
index c527d83b..00000000
--- a/benchmark/scenarios/capture/should_capture/incident/scenarios.jsonl
+++ /dev/null
@@ -1,6 +0,0 @@
-{"id": "incident-db-failover-001", "category": "capture/should_capture/incident", "language": "en", "input": "Post-mortem: 2-hour outage caused by RDS failover during a maintenance window we didn't know about. AWS scheduled maintenance hit our primary at 3am UTC. The failover itself took 30 seconds, but our connection pool (HikariCP) held stale connections for 10 minutes because we had no connection validation query. The app returned 500s until connections aged out. Action items: 1) Enable connection validation with SELECT 1, 2) Set maxLifetime to 15 minutes, 3) Subscribe to AWS Health events for RDS.", "expected_capture": true, "expected_fields": {"domain": "incident", "status_hint": "accepted", "title_keywords": ["RDS", "failover", "outage"]}, "recall_queries": [{"query": "What happened during the database outage?", "should_match": true}, {"query": "HikariCP connection pool configuration", "should_match": true}]}
-{"id": "incident-cascading-failure-002", "category": "capture/should_capture/incident", "language": "en", "input": "Post-mortem: Cascading failure in the checkout flow. The recommendation service started responding slowly (p99 went from 200ms to 8s) due to a bad model deployment. The checkout service had no timeout on the recommendation call, causing thread pool exhaustion. This backed up into the API gateway, which started rejecting all requests including health checks, triggering auto-scaling to replace 'unhealthy' instances. Resolution: 1) Add 500ms timeout to all non-critical service calls, 2) Implement bulkhead pattern — recommendation calls get their own thread pool, 3) Circuit breaker with 50% error threshold.", "expected_capture": true, "expected_fields": {"domain": "incident", "status_hint": "accepted", "title_keywords": ["cascading failure", "checkout", "timeout"]}, "recall_queries": [{"query": "What caused the checkout cascade failure?", "should_match": true}, {"query": "Bulkhead pattern implementation", "should_match": true}]}
-{"id": "incident-data-loss-003", "category": "capture/should_capture/incident", "language": "en", "input": "Incident: Lost 4 hours of analytics events because the Kafka consumer group offset was reset during a deployment. The Helm chart had `auto.offset.reset=earliest` in dev but `latest` in prod. A misconfigured environment variable override caused the prod consumer to start from latest, skipping all unconsumed messages from the deployment window. Recovery: replayed from the Kafka retention window (7 days). Prevention: remove auto.offset.reset from application config; manage offsets explicitly via consumer group commits.", "expected_capture": true, "expected_fields": {"domain": "incident", "status_hint": "accepted", "title_keywords": ["Kafka", "offset", "data loss"]}, "recall_queries": [{"query": "Kafka consumer offset reset incident", "should_match": true}]}
-{"id": "incident-security-breach-004", "category": "capture/should_capture/incident", "language": "en", "input": "Security incident: An exposed .env file in a public S3 bucket contained database credentials. The bucket was created by a developer for a demo and never had its ACL reviewed. The credentials were for a read-only replica, limiting blast radius. Actions taken: 1) Rotated all credentials immediately, 2) Enabled S3 Block Public Access at the org level, 3) Added Prowler scan to CI that checks for public buckets, 4) Mandatory bucket policy review in PR checklist for any IaC changes.", "expected_capture": true, "expected_fields": {"domain": "security", "status_hint": "accepted", "title_keywords": ["S3", "credentials", "security"]}, "recall_queries": [{"query": "S3 security incident", "should_match": true}, {"query": "How do we prevent public S3 buckets?", "should_match": true}]}
-{"id": "incident-dns-propagation-005", "category": "capture/should_capture/incident", "language": "en", "input": "Post-mortem: 45-minute partial outage during DNS migration from Route53 to Cloudflare. We set the TTL to 300 seconds 24 hours before migration, but some enterprise ISP resolvers ignore low TTLs and cache for up to 48 hours. About 15% of users hit stale DNS for the first hour. Lesson: for critical DNS changes, run dual-stack (old and new) for at least 72 hours, and use a canary domain to verify propagation before cutting over the primary.", "expected_capture": true, "expected_fields": {"domain": "incident", "status_hint": "accepted", "title_keywords": ["DNS", "migration", "propagation"]}, "recall_queries": [{"query": "DNS migration lessons learned", "should_match": true}]}
-{"id": "incident-runaway-query-006", "category": "capture/should_capture/incident", "language": "ko", "input": "장애 리포트: 어제 새벽 대시보드 전체 다운. 원인은 마케팅팀이 실행한 ad-hoc 쿼리가 프로덕션 DB의 전체 orders 테이블을 풀스캔한 것. 3억 건 테이블에 인덱스 없는 LIKE '%keyword%' 쿼리로 CPU 100%, 커넥션 풀 고갈. 대응: 1) 프로덕션 DB에 statement_timeout=30s 설정, 2) 마케팅팀에게 읽기 전용 레플리카 별도 제공, 3) 분석 쿼리용 read replica 자동 라우팅 미들웨어 추가 예정.", "expected_capture": true, "expected_fields": {"domain": "incident", "status_hint": "accepted", "title_keywords": ["runaway query", "풀스캔", "timeout"]}, "recall_queries": [{"query": "프로덕션 DB 다운 원인", "should_match": true}, {"query": "How to prevent runaway queries?", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/pr_review/edge_cases.jsonl b/benchmark/scenarios/capture/should_capture/pr_review/edge_cases.jsonl
deleted file mode 100644
index e6cd6fca..00000000
--- a/benchmark/scenarios/capture/should_capture/pr_review/edge_cases.jsonl
+++ /dev/null
@@ -1,2 +0,0 @@
-{"id": "pr-edge-subtle-decision-001", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #1105 — I see you're using sync file I/O in the request handler. This works now but will become a bottleneck at scale. Our load tests show the event loop blocks for 200ms on large file uploads with sync reads. Let's use aiofiles instead. I know it adds a dependency, but we already use aiohttp, so the async ecosystem cost is already paid. This should be our standard for any file I/O in async code paths going forward.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["aiofiles", "async", "file I/O"]}, "recall_queries": [{"query": "async file I/O standard", "should_match": true}], "notes": "PR review that establishes a forward-looking standard — captures a process decision, not just a code review"}
-{"id": "pr-edge-cross-team-002", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #892 — Data team review: This PR adds a new user_events table but doesn't follow our event naming convention (subject_verb_object). The column 'type' should be 'event_type', and 'data' should be 'event_payload'. Also, we need a 'processed_at' timestamp column for the ETL pipeline — without it, we can't track ingestion lag. These aren't optional — they're required for compatibility with our dbt models. Blocking until the schema matches our data contracts.", "expected_capture": true, "expected_fields": {"domain": "data", "status_hint": "rejected", "title_keywords": ["event", "naming convention", "schema"]}, "recall_queries": [{"query": "event table naming convention", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/pr_review/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/pr_review/scenarios.jsonl
deleted file mode 100644
index 6b4ed6f5..00000000
--- a/benchmark/scenarios/capture/should_capture/pr_review/scenarios.jsonl
+++ /dev/null
@@ -1,8 +0,0 @@
-{"id": "pr-arch-rejection-001", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #342 Review — Rejecting this approach. The PR adds a Redis-based distributed lock for the payment idempotency check, but this introduces a single point of failure. If Redis goes down, all payments are blocked. We should use the database's native advisory locks (pg_advisory_lock) instead — they're already consistent with the transaction and don't require an additional infrastructure dependency. The performance difference is negligible for our payment volume (< 100 TPS).", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "rejected", "title_keywords": ["distributed lock", "payment", "advisory lock"]}, "recall_queries": [{"query": "How do we handle payment idempotency?", "should_match": true}, {"query": "Why not Redis for distributed locks?", "should_match": true}]}
-{"id": "pr-perf-tradeoff-002", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #567 Review — This ORM query generates N+1 selects. I profiled it locally: for a user with 50 teams, it fires 51 queries (1 for user + 50 for team memberships). With eager loading (joinedload), it's a single query with JOINs. The trade-off is the joined query returns more data over the wire, but for our typical team count (< 20), the network overhead is trivial compared to 20 round trips. Requesting change to use joinedload here.", "expected_capture": true, "expected_fields": {"domain": "debugging", "status_hint": "accepted", "title_keywords": ["N+1", "ORM", "eager loading"]}, "recall_queries": [{"query": "N+1 query optimization approach", "should_match": true}]}
-{"id": "pr-security-finding-003", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #891 Security Review — Found an SSRF vulnerability. The webhook URL validation only checks the scheme (http/https) but doesn't block internal IPs. An attacker could register a webhook pointing to http://169.254.169.254/latest/meta-data/ to exfiltrate AWS instance metadata. We need to: 1) Block RFC 1918 ranges and link-local addresses, 2) Resolve the hostname and validate the IP before making the request, 3) Use a dedicated egress proxy for webhook deliveries.", "expected_capture": true, "expected_fields": {"domain": "security", "status_hint": "accepted", "title_keywords": ["SSRF", "webhook", "validation"]}, "recall_queries": [{"query": "How do we prevent SSRF in webhooks?", "should_match": true}, {"query": "Webhook URL validation requirements", "should_match": true}]}
-{"id": "pr-api-design-004", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #234 — Pushing back on the REST endpoint design. The PR uses PATCH /users/:id/settings with a flat body, but settings is becoming a deeply nested object (notification preferences, privacy, display, integrations). We should use JSON Patch (RFC 6902) instead of merge patch for fine-grained updates. This prevents the 'null vs. absent' ambiguity that bit us in the billing settings last month. Also, it gives us an audit trail of exactly which fields changed.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "rejected", "title_keywords": ["JSON Patch", "REST", "settings"]}, "recall_queries": [{"query": "How do we handle partial updates to user settings?", "should_match": true}]}
-{"id": "pr-migration-strategy-005", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #445 — The database migration adds a NOT NULL column without a default value. This will lock the table for the duration of ALTER TABLE on PostgreSQL < 11, and even on PG 14+ it requires a full table rewrite. For our users table (80M rows), this could mean 10+ minutes of downtime. Better approach: 1) Add column as nullable, 2) Backfill in batches of 10K, 3) Add NOT NULL constraint with NOT VALID, 4) VALIDATE CONSTRAINT separately. This is our standard zero-downtime migration pattern.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "rejected", "title_keywords": ["migration", "NOT NULL", "zero-downtime"]}, "recall_queries": [{"query": "How do we add NOT NULL columns safely?", "should_match": true}, {"query": "Zero-downtime migration pattern", "should_match": true}]}
-{"id": "pr-dependency-decision-006", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #678 — Approving the switch from moment.js to date-fns. Moment is effectively dead (in maintenance mode since 2020), and it adds 230KB to our bundle because it can't be tree-shaken. date-fns is modular — we only import the 12 functions we use, bringing the date library footprint from 230KB to 8KB. The API surface is different enough that we should do this in one PR rather than incrementally to avoid having two date libraries in the bundle.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["date-fns", "moment.js", "bundle size"]}, "recall_queries": [{"query": "What date library do we use?", "should_match": true}]}
-{"id": "pr-error-handling-007", "category": "capture/should_capture/pr_review", "language": "ko", "input": "PR #912 리뷰 — 에러 처리 방식 변경 요청. 현재 모든 API 에러를 catch-all로 500 반환하고 있는데, 클라이언트가 재시도 가능한 에러(429, 503)와 불가능한 에러(400, 404)를 구분할 수 없습니다. RFC 7807 Problem Details 형식을 채택합시다. type URI로 에러 카테고리를 구분하고, retryable 필드를 추가해서 클라이언트의 재시도 로직을 단순화해야 합니다. 이건 API v2의 표준 에러 형식으로 확정합시다.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["RFC 7807", "error handling", "retryable"]}, "recall_queries": [{"query": "API 에러 처리 표준", "should_match": true}, {"query": "RFC 7807 error format", "should_match": true}]}
-{"id": "pr-testing-gap-008", "category": "capture/should_capture/pr_review", "language": "en", "input": "PR #1023 — This PR adds a new payment provider integration but has zero test coverage for the webhook signature verification. Last time we shipped a payment webhook without proper signature validation (Stripe integration, March), we had to hotfix it within 2 hours after a security audit flagged it. Blocking this PR until we have: 1) Unit tests for HMAC signature verification, 2) Integration test with a known test payload, 3) Negative test for tampered signatures. Non-negotiable for any payment-related code.", "expected_capture": true, "expected_fields": {"domain": "qa", "status_hint": "rejected", "title_keywords": ["webhook", "signature", "test coverage"]}, "recall_queries": [{"query": "Payment webhook testing requirements", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/process/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/process/scenarios.jsonl
deleted file mode 100644
index 05ad9c87..00000000
--- a/benchmark/scenarios/capture/should_capture/process/scenarios.jsonl
+++ /dev/null
@@ -1,6 +0,0 @@
-{"id": "proc-code-review-policy-001", "category": "capture/should_capture/process", "language": "en", "input": "New code review policy: all PRs require at least 2 approvals, with one from a code owner of the modified path. Exception: documentation-only changes need 1 approval. We're also introducing a 24-hour SLA for first review response. Reviews older than 24 hours will be flagged in the team Slack channel. This was prompted by the auth service incident where a single-approval PR introduced the session fixation vulnerability.", "expected_capture": true, "expected_fields": {"domain": "process", "status_hint": "accepted", "title_keywords": ["code review", "policy", "approval"]}, "recall_queries": [{"query": "What's our code review policy?", "should_match": true}, {"query": "How many approvals do PRs need?", "should_match": true}]}
-{"id": "proc-on-call-rotation-002", "category": "capture/should_capture/process", "language": "en", "input": "Restructuring on-call rotations. Moving from per-service on-call to a tiered model. Tier 1 (first responder): rotates weekly across all backend engineers, handles triage and initial response. Tier 2 (domain expert): the service owner, escalated to within 15 minutes if Tier 1 can't resolve. This ensures no single person is on-call for their service 24/7. Compensation: $200/week for Tier 1 on-call, $100/week for Tier 2 standby.", "expected_capture": true, "expected_fields": {"domain": "ops", "status_hint": "accepted", "title_keywords": ["on-call", "rotation", "tiered"]}, "recall_queries": [{"query": "How does our on-call rotation work?", "should_match": true}]}
-{"id": "proc-deployment-freeze-003", "category": "capture/should_capture/process", "language": "en", "input": "Implementing deployment freezes for the last week of each quarter. Sales team reports that production issues during quarter-end close cause deal slippage. The freeze applies to all production deployments except critical security patches (P0/P1). Feature branches can still be merged to main; they just won't be deployed until the freeze lifts. This is a compromise — the sales team wanted a 2-week freeze but we negotiated to 1 week.", "expected_capture": true, "expected_fields": {"domain": "process", "status_hint": "accepted", "title_keywords": ["deployment freeze", "quarter-end"]}, "recall_queries": [{"query": "Do we have deployment freezes?", "should_match": true}]}
-{"id": "proc-sprint-cadence-004", "category": "capture/should_capture/process", "language": "en", "input": "Switching from 2-week sprints to 1-week sprints for the platform team. The 2-week cycle had too much scope creep — by day 8, the sprint goal was unrecognizable. Shorter sprints force smaller, shippable increments. We'll keep 2-week sprints for the product team since their features naturally take longer and they have better scope discipline. Retros move from biweekly to monthly to avoid meeting fatigue.", "expected_capture": true, "expected_fields": {"domain": "process", "status_hint": "accepted", "title_keywords": ["sprint", "cadence", "1-week"]}, "recall_queries": [{"query": "What sprint cadence does the platform team use?", "should_match": true}]}
-{"id": "proc-documentation-standard-005", "category": "capture/should_capture/process", "language": "en", "input": "Adopting ADR (Architecture Decision Records) as our standard documentation format for technical decisions. Each ADR follows the MADR template: Status, Context, Decision, Consequences. ADRs live in the repo under docs/adr/ and are numbered sequentially. Every architectural decision discussed in Slack or meetings must have a corresponding ADR within 48 hours. This replaces our ad-hoc Confluence pages which nobody can find.", "expected_capture": true, "expected_fields": {"domain": "process", "status_hint": "accepted", "title_keywords": ["ADR", "documentation", "standard"]}, "recall_queries": [{"query": "How do we document architectural decisions?", "should_match": true}]}
-{"id": "proc-hiring-bar-006", "category": "capture/should_capture/process", "language": "en", "input": "Updated our engineering hiring bar. We're adding a system design round for all candidates (previously only for seniors). Too many mid-level hires couldn't reason about distributed systems, which is table stakes for our microservices architecture. The new loop: 1 coding round (LC medium), 1 system design (simplified for mids), 1 behavioral, 1 pair programming on a real codebase task. Dropping the take-home assignment — candidates report it takes 8+ hours and it filters out people with families.", "expected_capture": true, "expected_fields": {"domain": "hr", "status_hint": "accepted", "title_keywords": ["hiring", "interview", "system design"]}, "recall_queries": [{"query": "What's our engineering interview process?", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/product/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/product/scenarios.jsonl
deleted file mode 100644
index 094e2514..00000000
--- a/benchmark/scenarios/capture/should_capture/product/scenarios.jsonl
+++ /dev/null
@@ -1,8 +0,0 @@
-{"id": "prod-pricing-model-001", "category": "capture/should_capture/product", "language": "en", "input": "We're switching from per-seat pricing to usage-based pricing for the API product. Customer interviews show that seat-based pricing penalizes teams with many occasional users. Usage-based (per-API-call with a free tier of 10K calls/month) better aligns with how customers perceive value. The free tier ensures we don't lose the long-tail of small customers who drive word-of-mouth growth. Finance modeled this and expects 15% revenue increase within 6 months.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["pricing", "usage-based"]}, "recall_queries": [{"query": "What pricing model are we using for the API product?", "should_match": true}, {"query": "Why did we move away from per-seat pricing?", "should_match": true}]}
-{"id": "prod-feature-flag-002", "category": "capture/should_capture/product", "language": "en", "input": "Decided to kill the real-time collaboration feature. After 3 months in beta, usage data shows only 4% of users ever enable it, and it accounts for 30% of our WebSocket infrastructure costs. The engineering effort to make it production-ready (conflict resolution, offline sync) would take 2 more quarters. We'll redirect that investment into the async commenting feature, which has 60% adoption in the same beta cohort.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "rejected", "title_keywords": ["real-time collaboration", "killed"]}, "recall_queries": [{"query": "Why did we kill real-time collaboration?", "should_match": true}, {"query": "What happened to the collaboration feature?", "should_match": true}]}
-{"id": "prod-onboarding-flow-003", "category": "capture/should_capture/product", "language": "en", "input": "Redesigning the onboarding flow. Currently 40% drop-off at the email verification step. We're moving to a progressive onboarding model: let users explore the product immediately with a temporary workspace, then prompt for email verification when they try to save or share. A/B test on 10% of traffic showed this reduces drop-off to 12% and increases Day-7 retention by 25%.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["onboarding", "progressive", "verification"]}, "recall_queries": [{"query": "How does our onboarding flow work?", "should_match": true}]}
-{"id": "prod-mobile-first-004", "category": "capture/should_capture/product", "language": "en", "input": "We're deprioritizing the native mobile app in favor of a PWA approach. Our mobile DAU is only 8% of total, and maintaining iOS and Android codebases doubles our frontend team's workload. The PWA with service workers covers 90% of the mobile use cases (offline access, push notifications). We'll keep the native app in maintenance mode but no new features. The 3 mobile engineers will transition to the core web team.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["PWA", "mobile app", "deprioritize"]}, "recall_queries": [{"query": "Are we still building the native mobile app?", "should_match": true}]}
-{"id": "prod-i18n-strategy-005", "category": "capture/should_capture/product", "language": "en", "input": "Adopting a phased i18n strategy. Phase 1 (Q1): Japanese and Korean — our two largest non-English markets by revenue. Phase 2 (Q2): German and French for EU expansion. We rejected machine translation for UI strings because quality directly impacts trust; instead, we'll use a professional localization vendor with in-app review by native-speaking team members. Cost: $15K per language, which is justified by $200K+ ARR opportunity in each market.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["i18n", "localization", "Japanese", "Korean"]}, "recall_queries": [{"query": "What's our internationalization strategy?", "should_match": true}, {"query": "Which languages are we localizing first?", "should_match": true}]}
-{"id": "prod-data-export-006", "category": "capture/should_capture/product", "language": "en", "input": "Adding a full data export feature for enterprise compliance. Customers need to export all their data in a portable format for GDPR Article 20 (data portability). We'll support JSON and CSV formats. The export job runs async and sends a download link via email when ready. Max export size: 10GB. This is a blocker for 3 enterprise deals worth $500K combined ARR.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["data export", "GDPR", "enterprise"]}, "recall_queries": [{"query": "Do we have a data export feature?", "should_match": true}]}
-{"id": "prod-ab-test-result-007", "category": "capture/should_capture/product", "language": "ko", "input": "A/B 테스트 결과 공유. 결제 페이지에서 '연간 구독' 옵션을 기본 선택으로 변경한 변형이 월간 대비 연간 전환율을 34%에서 51%로 올렸습니다. 하지만 전체 결제 전환율은 2% 하락했습니다 — 연간 가격이 부담스러워 이탈하는 사용자가 있었기 때문. 결론: 연간 기본 선택을 적용하되, 월간 가격을 더 눈에 띄게 표시하는 디자인으로 재테스트 예정.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["A/B test", "연간 구독", "결제"]}, "recall_queries": [{"query": "결제 페이지 A/B 테스트 결과", "should_match": true}]}
-{"id": "prod-sunset-v1-api-008", "category": "capture/should_capture/product", "language": "en", "input": "Setting the timeline to sunset API v1. Currently 12% of API traffic still hits v1 endpoints. We'll send deprecation notices starting next month, with a hard shutdown in 6 months. Migration guide is published, and we're offering 1-on-1 migration support for the top 20 v1 consumers. v1 will return 410 Gone after the shutdown date. This frees up 2 engineers from v1 maintenance.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["API v1", "sunset", "deprecation"]}, "recall_queries": [{"query": "When are we shutting down API v1?", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_capture/tradeoff/scenarios.jsonl b/benchmark/scenarios/capture/should_capture/tradeoff/scenarios.jsonl
deleted file mode 100644
index 2dad7ac0..00000000
--- a/benchmark/scenarios/capture/should_capture/tradeoff/scenarios.jsonl
+++ /dev/null
@@ -1,6 +0,0 @@
-{"id": "tradeoff-consistency-latency-001", "category": "capture/should_capture/tradeoff", "language": "en", "input": "Trade-off analysis for the inventory system: strong consistency vs. eventual consistency. Strong consistency (synchronous replication) adds 50ms latency per write and limits throughput to ~2K writes/sec. Eventual consistency gives us 10K writes/sec but allows a 2-second window where two users could order the last item simultaneously. Decision: eventual consistency for browse/search, strong consistency for the actual purchase transaction. The checkout path can tolerate 50ms extra since the payment provider already adds 800ms.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["consistency", "inventory", "trade-off"]}, "recall_queries": [{"query": "Consistency model for inventory system", "should_match": true}]}
-{"id": "tradeoff-build-vs-buy-002", "category": "capture/should_capture/tradeoff", "language": "en", "input": "Build vs. buy analysis for the notification system. Building in-house gives us full control and costs ~$30K/year in infrastructure. Buying (Twilio Notify + SendGrid) costs ~$80K/year but saves 3 engineer-months of development and ongoing maintenance. We're going with buy. The $50K premium is justified because our core competency is not notification delivery, and the engineering time is better spent on product features that drive revenue.", "expected_capture": true, "expected_fields": {"domain": "product", "status_hint": "accepted", "title_keywords": ["build vs buy", "notification"]}, "recall_queries": [{"query": "Did we build or buy the notification system?", "should_match": true}, {"query": "Build vs buy decision for notifications", "should_match": true}]}
-{"id": "tradeoff-mono-vs-poly-repo-003", "category": "capture/should_capture/tradeoff", "language": "en", "input": "Mono-repo vs. poly-repo: staying with poly-repo. We evaluated moving to a monorepo (Turborepo/Nx) for better code sharing and atomic cross-service changes. However, our CI pipeline is already 15 minutes and a monorepo would make it worse without significant investment in incremental builds. Also, our teams are organized around services and prefer autonomous deploy cycles. The code sharing problem will be solved with a shared npm package registry instead.", "expected_capture": true, "expected_fields": {"domain": "process", "status_hint": "rejected", "title_keywords": ["monorepo", "poly-repo"]}, "recall_queries": [{"query": "Are we using a monorepo?", "should_match": true}, {"query": "Why didn't we adopt monorepo?", "should_match": true}]}
-{"id": "tradeoff-serverless-vs-k8s-004", "category": "capture/should_capture/tradeoff", "language": "en", "input": "For the new image processing pipeline: Lambda vs. Kubernetes Jobs. Lambda is simpler to deploy and auto-scales to zero, but has a 15-minute timeout and 10GB memory limit. Our image processing can take 20+ minutes for large batches and needs 16GB RAM for the ML model. Going with K8s Jobs on spot instances. Cost is comparable (Lambda's per-invocation pricing is actually more expensive at our volume), and we get full control over resources. Downside: we need to manage the spot interruption handler.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["Lambda", "Kubernetes", "image processing"]}, "recall_queries": [{"query": "Why didn't we use Lambda for image processing?", "should_match": true}]}
-{"id": "tradeoff-testing-strategy-005", "category": "capture/should_capture/tradeoff", "language": "en", "input": "Shifting our testing strategy from heavy E2E to contract testing. Currently 60% of our test suite is Selenium E2E tests that take 45 minutes and are flaky (15% failure rate on clean code). We're inverting the pyramid: 70% unit, 20% contract (Pact), 10% E2E for critical paths only. The E2E suite will be trimmed to ~50 tests covering checkout, auth, and billing. This should bring CI from 45 to 12 minutes.", "expected_capture": true, "expected_fields": {"domain": "qa", "status_hint": "accepted", "title_keywords": ["testing", "contract", "E2E"]}, "recall_queries": [{"query": "What's our testing strategy?", "should_match": true}, {"query": "Why are we reducing E2E tests?", "should_match": true}]}
-{"id": "tradeoff-vendor-lock-006", "category": "capture/should_capture/tradeoff", "language": "ko", "input": "AWS 종속성 분석. 현재 17개 AWS 서비스 사용 중. 완전한 멀티클라우드는 비현실적(비용 2배, 복잡도 3배). 대신 '탈출 가능한 종속(escapable lock-in)' 전략 채택: 컴퓨팅과 스토리지는 표준 인터페이스(컨테이너, S3 호환 API) 유지, 매니지드 서비스(DynamoDB, SQS 등)는 추상화 레이어 없이 직접 사용. 이유: 추상화 레이어 자체가 유지보수 부담이고, 실제로 클라우드 이전이 필요할 확률은 낮음.", "expected_capture": true, "expected_fields": {"domain": "architecture", "status_hint": "accepted", "title_keywords": ["AWS", "vendor lock-in", "멀티클라우드"]}, "recall_queries": [{"query": "AWS 종속성 전략", "should_match": true}, {"query": "Are we doing multi-cloud?", "should_match": true}]}
diff --git a/benchmark/scenarios/capture/should_not_capture/casual/scenarios.jsonl b/benchmark/scenarios/capture/should_not_capture/casual/scenarios.jsonl
deleted file mode 100644
index 17e60a50..00000000
--- a/benchmark/scenarios/capture/should_not_capture/casual/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "casual-lunch-001", "category": "capture/should_not_capture/casual", "language": "en", "input": "Hey team, anyone want to grab lunch? I'm thinking Thai food today.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "casual-weekend-002", "category": "capture/should_not_capture/casual", "language": "en", "input": "Happy Friday everyone! Any fun plans for the weekend? I'm going hiking if the weather holds up.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "casual-birthday-003", "category": "capture/should_not_capture/casual", "language": "ko", "input": "민수 생일 축하해! 🎂 오늘 저녁 회식 7시에 강남역 근처에서 하는 거 맞지?", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "casual-coffee-004", "category": "capture/should_not_capture/casual", "language": "en", "input": "The new coffee machine on the 3rd floor is amazing. Way better than the old Keurig. Anyone tried the oat milk option?", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
diff --git a/benchmark/scenarios/capture/should_not_capture/code_noise/scenarios.jsonl b/benchmark/scenarios/capture/should_not_capture/code_noise/scenarios.jsonl
deleted file mode 100644
index 20af1b8a..00000000
--- a/benchmark/scenarios/capture/should_not_capture/code_noise/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "coding-noise-001", "category": "capture/should_not_capture/code_noise", "language": "en", "input": "Quick type fix in the API client.\n\n```diff\n// src/services/api-client.ts\n- export async function fetchUsers(): Promise {\n+ export async function fetchUsers(): Promise> {\n const response = await axios.get('/api/users');\n return response.data;\n }\n```\n\nWas using `any` as a placeholder when we first wrote this. Now that the `ApiResponse` type exists, swapped it in. No behavior change, just type safety.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "coding-noise-002", "category": "capture/should_not_capture/code_noise", "language": "en", "input": "Renamed some variables for clarity. No logic changes.\n\n```diff\n// src/hooks/useData.ts\n- const getData = async () => {\n+ const fetchUserProfile = async () => {\n const res = await api.get(`/users/${userId}`);\n- setData(res.data);\n+ setUserProfile(res.data);\n };\n\n- useEffect(() => { getData(); }, [userId]);\n+ useEffect(() => { fetchUserProfile(); }, [userId]);\n```\n\nAlso renamed the file from `useData.ts` to `useUserProfile.ts` and updated all 3 import sites. The old name was too generic — we have 6 hooks that could be called `useData`.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "coding-noise-003", "category": "capture/should_not_capture/code_noise", "language": "en", "input": "Bumped lodash from 4.17.20 to 4.17.21. Security patch for prototype pollution vulnerability (CVE-2021-23337).\n\n```diff\n// package.json\n- \"lodash\": \"4.17.20\",\n+ \"lodash\": \"4.17.21\",\n```\n\nRan the full test suite — all 312 tests pass. No breaking changes in this patch version. The CVE affects `_.template()` which we don't use, but Snyk is flagging it and we want a clean security report.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "coding-noise-004", "category": "capture/should_not_capture/code_noise", "language": "en", "input": "All 47 tests pass after the refactor. CI is green.\n\n```\n PASS src/services/__tests__/auth.test.ts (3.421s)\n PASS src/services/__tests__/billing.test.ts (1.892s)\n PASS src/services/__tests__/inventory.test.ts (2.103s)\n PASS src/middleware/__tests__/cors.test.ts (0.445s)\n PASS src/utils/__tests__/format.test.ts (0.312s)\n\nTest Suites: 5 passed, 5 total\nTests: 47 passed, 47 total\nSnapshots: 0 total\nTime: 8.173s\n```\n\nLooks good to merge. No new tests needed since we only moved code around without changing behavior.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
diff --git a/benchmark/scenarios/capture/should_not_capture/pr_noise/scenarios.jsonl b/benchmark/scenarios/capture/should_not_capture/pr_noise/scenarios.jsonl
deleted file mode 100644
index 25c2f840..00000000
--- a/benchmark/scenarios/capture/should_not_capture/pr_noise/scenarios.jsonl
+++ /dev/null
@@ -1,5 +0,0 @@
-{"id": "pr-noise-lgtm-001", "category": "capture/should_not_capture/pr_noise", "language": "en", "input": "LGTM! 👍", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "pr-noise-nitpick-002", "category": "capture/should_not_capture/pr_noise", "language": "en", "input": "nit: missing trailing comma on line 42. Also, can you rename `temp` to something more descriptive like `pendingOrders`?", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "pr-noise-lint-003", "category": "capture/should_not_capture/pr_noise", "language": "en", "input": "CI failed: ESLint found 3 warnings.\n- src/utils/format.ts:12 — Unexpected any. Specify a different type. (@typescript-eslint/no-explicit-any)\n- src/utils/format.ts:25 — 'result' is assigned a value but never used. (@typescript-eslint/no-unused-vars)\n- src/components/Table.tsx:8 — Missing return type on function. (@typescript-eslint/explicit-function-return-type)", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Automated lint output — no human decision"}
-{"id": "pr-noise-merge-conflict-004", "category": "capture/should_not_capture/pr_noise", "language": "en", "input": "This branch has merge conflicts with main. Please rebase and resolve the conflicts in:\n- src/services/auth.ts\n- src/middleware/cors.ts\n\nI'll re-review once the conflicts are resolved.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "pr-noise-approval-005", "category": "capture/should_not_capture/pr_noise", "language": "en", "input": "Looks good to me. Tested locally and the new endpoint works as expected. Approved.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Simple approval without architectural rationale"}
diff --git a/benchmark/scenarios/capture/should_not_capture/question/scenarios.jsonl b/benchmark/scenarios/capture/should_not_capture/question/scenarios.jsonl
deleted file mode 100644
index 0814d2a8..00000000
--- a/benchmark/scenarios/capture/should_not_capture/question/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "question-env-setup-001", "category": "capture/should_not_capture/question", "language": "en", "input": "Does anyone know how to set up the local dev environment for the billing service? I keep getting a Docker compose error about the postgres volume.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "question-api-docs-002", "category": "capture/should_not_capture/question", "language": "en", "input": "Where can I find the API documentation for the internal user service? Is it in Confluence or in the repo?", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "question-permission-003", "category": "capture/should_not_capture/question", "language": "en", "input": "Who has admin access to the production AWS account? I need to check the CloudWatch logs for the payment service.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "question-deadline-004", "category": "capture/should_not_capture/question", "language": "ko", "input": "이번 스프린트 마감일이 금요일인가요 다음 주 월요일인가요? 캘린더에 두 개가 다르게 되어 있어서 헷갈립니다.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
diff --git a/benchmark/scenarios/capture/should_not_capture/slop/edge_cases.jsonl b/benchmark/scenarios/capture/should_not_capture/slop/edge_cases.jsonl
deleted file mode 100644
index 487326e2..00000000
--- a/benchmark/scenarios/capture/should_not_capture/slop/edge_cases.jsonl
+++ /dev/null
@@ -1,2 +0,0 @@
-{"id": "slop-edge-decision-words-001", "category": "capture/should_not_capture/slop", "language": "en", "input": "I think we should probably consider looking into maybe switching to a different database at some point in the future when we have time. PostgreSQL or MongoDB could be interesting options. Just something to think about. No rush on this, just an idea I had.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Contains decision-related words but is entirely non-committal — no actual decision made"}
-{"id": "slop-edge-verbose-nothing-002", "category": "capture/should_not_capture/slop", "language": "en", "input": "Just wanted to follow up on our earlier conversation about the authentication flow. I've been thinking about it more and I agree that it's a really important topic that we should definitely prioritize. Let's schedule a meeting to discuss further. Would next Tuesday work for everyone? I'll send a calendar invite.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Sounds important but zero decisions or conclusions — pure scheduling fluff"}
diff --git a/benchmark/scenarios/capture/should_not_capture/slop/scenarios.jsonl b/benchmark/scenarios/capture/should_not_capture/slop/scenarios.jsonl
deleted file mode 100644
index 24845075..00000000
--- a/benchmark/scenarios/capture/should_not_capture/slop/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "slop-ai-summary-001", "category": "capture/should_not_capture/slop", "language": "en", "input": "Great question! Here's a comprehensive overview of our system architecture. Our platform leverages cutting-edge microservices architecture with state-of-the-art containerization. We utilize industry-leading best practices including CI/CD pipelines, infrastructure as code, and observability. The system is designed to be highly scalable, resilient, and maintainable. Each component follows SOLID principles and clean architecture patterns to ensure maximum flexibility and extensibility.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "AI-generated fluff with no concrete decisions or specifics"}
-{"id": "slop-ai-advice-002", "category": "capture/should_not_capture/slop", "language": "en", "input": "I'd recommend considering several key factors when making this decision. First, you'll want to evaluate the trade-offs between performance and maintainability. Second, consider the long-term implications for your team's productivity. Third, think about how this aligns with your overall strategic goals. It's important to weigh all these factors carefully before making a final decision. Hope this helps! Let me know if you need any further clarification.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Generic advice with no actual decision or specific context"}
-{"id": "slop-meeting-recap-003", "category": "capture/should_not_capture/slop", "language": "en", "input": "Meeting Recap: Today's meeting was very productive! We discussed several important topics including the roadmap, team structure, and upcoming milestones. Everyone shared their thoughts and we had a great discussion. We agreed that these are important areas to focus on and will continue the conversation in our next meeting. Action items will be shared separately.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Meeting summary with zero substance — no decisions, no specifics"}
-{"id": "slop-ai-analysis-004", "category": "capture/should_not_capture/slop", "language": "en", "input": "After careful analysis of the current landscape, it's clear that the technology ecosystem is rapidly evolving. Organizations need to stay ahead of the curve by embracing digital transformation and leveraging emerging technologies. Key trends include artificial intelligence, cloud computing, and DevOps practices. By adopting these technologies, teams can improve efficiency, reduce costs, and deliver better outcomes for their stakeholders.", "expected_capture": false, "expected_fields": {}, "recall_queries": [], "notes": "Pure buzzword soup with no team-specific decision or insight"}
diff --git a/benchmark/scenarios/capture/should_not_capture/status_update/scenarios.jsonl b/benchmark/scenarios/capture/should_not_capture/status_update/scenarios.jsonl
deleted file mode 100644
index c7e72a2e..00000000
--- a/benchmark/scenarios/capture/should_not_capture/status_update/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "status-standup-001", "category": "capture/should_not_capture/status_update", "language": "en", "input": "Standup update: Yesterday I worked on the login page redesign, got the Figma mockups approved. Today I'll start implementing the new layout. No blockers.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "status-progress-002", "category": "capture/should_not_capture/status_update", "language": "en", "input": "Quick update — the database migration script is 70% done. Should be ready for review by EOD tomorrow. Running into some edge cases with null values but nothing blocking.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "status-deploy-003", "category": "capture/should_not_capture/status_update", "language": "en", "input": "Deployed v2.3.1 to staging. All smoke tests passing. Will promote to production after QA signs off, probably Thursday.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
-{"id": "status-meeting-004", "category": "capture/should_not_capture/status_update", "language": "ko", "input": "미팅 참석 알림: 내일 오후 2시 디자인 리뷰 미팅 참석합니다. 준비 자료는 Figma 링크로 공유했습니다.", "expected_capture": false, "expected_fields": {}, "recall_queries": []}
diff --git a/benchmark/scenarios/extraction/bundle/scenarios.jsonl b/benchmark/scenarios/extraction/bundle/scenarios.jsonl
deleted file mode 100644
index ca7078a6..00000000
--- a/benchmark/scenarios/extraction/bundle/scenarios.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "extract-bundle-observability-001", "category": "extraction/bundle", "language": "en", "input": "Comprehensive observability stack decision.\n\nCore decision: Adopting the Grafana LGTM stack (Loki for logs, Grafana for dashboards, Tempo for traces, Mimir for metrics) as our unified observability platform.\n\nAlternatives evaluated:\n- Datadog: Best UX but $45K/year at our scale. Vendor lock-in with proprietary query language.\n- New Relic: Similar pricing to Datadog, better APM but weaker log management.\n- ELK Stack: Open source but operationally expensive. Our previous Elasticsearch cluster required a full-time SRE.\n- Splunk: Overkill for our size. Licensing model penalizes high-volume logging.\n\nMetrics (Mimir): Prometheus-compatible, so existing alerting rules and Grafana dashboards work unchanged. Long-term storage in S3 with compaction. Cost: ~$200/month for storage.\n\nLogs (Loki): Label-based indexing instead of full-text. 10x cheaper than Elasticsearch for our volume (50GB/day). Trade-off: slower ad-hoc searches, but structured logging with labels covers 95% of our debugging patterns.\n\nTraces (Tempo): OpenTelemetry-native. Auto-instrumentation for Go and Python services. Trace-to-log correlation via trace IDs in log labels. Storage in S3, cost negligible.\n\nDashboards (Grafana): Already using Grafana for metrics. Extending to unified dashboards for logs, traces, and metrics. Explore view replaces our current Kibana setup.\n\nImplementation timeline: 4 weeks. Week 1: Mimir + Loki deployment. Week 2: Agent rollout (Grafana Alloy). Week 3: Trace instrumentation. Week 4: Dashboard migration and Elasticsearch decommission.", "expected_extraction_type": "bundle", "expected_fields": {"title_keywords": ["observability", "Grafana", "LGTM"], "status_hint": "accepted", "min_phases": 3, "max_phases": 5}, "notes": "Should split into core decision + detail facets (alternatives, metrics, logs, traces)"}
-{"id": "extract-bundle-data-platform-002", "category": "extraction/bundle", "language": "en", "input": "Data platform architecture decision.\n\nCore: Building a modern data platform with dbt + Snowflake + Airbyte.\n\nIngestion layer (Airbyte): Open-source ELT connectors for our 23 data sources (PostgreSQL, Stripe, Salesforce, Google Analytics, etc.). Self-hosted on K8s. Rejected Fivetran ($30K/year) because Airbyte covers all our connectors and we have the ops capacity to self-host. Rejected Stitch for limited connector coverage.\n\nStorage layer (Snowflake): Separation of storage and compute lets us scale independently. Pay-per-query model works for our bursty analytics workload (heavy during business hours, idle at night). Alternatives: BigQuery (comparable but our team has more Snowflake experience), Redshift (too coupled to AWS, and we might go multi-cloud), Databricks (overkill — we don't need ML/streaming yet).\n\nTransformation layer (dbt): SQL-based transformations with version control, testing, and documentation. Models organized as staging → intermediate → marts. dbt Cloud for scheduling and CI. The data team can own their transformations without waiting for data engineering.\n\nOrchestration (Dagster): Replaced Airflow. Dagster's software-defined assets model aligns better with dbt's DAG-based approach. Better local development experience and type-safe configuration. Airflow's task-centric model was causing 'invisible dependencies' between data pipelines.\n\nGovernance: Column-level lineage via dbt + Snowflake's ACCESS_HISTORY. PII columns tagged and masked in non-production environments. Data catalog in dbt docs, auto-generated from model descriptions and column-level documentation.", "expected_extraction_type": "bundle", "expected_fields": {"title_keywords": ["data platform", "dbt", "Snowflake"], "status_hint": "accepted", "min_phases": 3, "max_phases": 6}, "notes": "Rich multi-faceted decision that should split into core + ingestion + storage + transformation + orchestration"}
-{"id": "extract-bundle-security-posture-003", "category": "extraction/bundle", "language": "en", "input": "Security posture overhaul following the Q3 penetration test.\n\nCore finding: Our security posture is 'reactive' — we fix vulnerabilities after they're found but lack proactive prevention. Moving to a 'shift-left' model.\n\nSAST (Static analysis): Adopting Semgrep for custom rule authoring. It covers our primary languages (Python, TypeScript, Go) and lets us write org-specific rules (e.g., 'never use string concatenation for SQL queries'). Replaces SonarQube which was too noisy (80% false positive rate).\n\nDAST (Dynamic analysis): Adding ZAP scans to the CI pipeline for all web-facing services. Runs against the staging environment after each deployment. Blocking on high-severity findings only — medium and below create Jira tickets for the next sprint.\n\nDependency scanning: Switching from Dependabot (too noisy, creates PRs nobody reviews) to Snyk. Snyk's reachability analysis tells us if a vulnerable dependency is actually called in our code, reducing noise by 70%. Auto-merge for minor version patches with passing tests.\n\nSecrets management: Migrating from environment variables to HashiCorp Vault. Applications authenticate via Kubernetes service account tokens. Secrets rotated automatically every 30 days. The 142 secrets currently in .env files across our repos will be removed in a coordinated migration.\n\nIncident response: Establishing a formal IR playbook. Severity levels P0-P3 with defined escalation paths and SLAs. Security incidents get a dedicated Slack channel and war room within 15 minutes of P0/P1 declaration. Post-incident reviews within 72 hours.", "expected_extraction_type": "bundle", "expected_fields": {"title_keywords": ["security", "shift-left", "posture"], "status_hint": "accepted", "min_phases": 3, "max_phases": 6}, "notes": "Should split into core + SAST + DAST + dependency scanning + secrets + incident response facets"}
diff --git a/benchmark/scenarios/extraction/phase_chain/scenarios.jsonl b/benchmark/scenarios/extraction/phase_chain/scenarios.jsonl
deleted file mode 100644
index 9452b227..00000000
--- a/benchmark/scenarios/extraction/phase_chain/scenarios.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "extract-phase-migration-001", "category": "extraction/phase_chain", "language": "en", "input": "Here's the full migration plan for moving from our legacy auth system to the new identity platform.\n\nPhase 1 - Assessment: We audited all 47 services that depend on the legacy auth. 12 use session cookies directly, 23 use the auth SDK, and 12 use raw JWT validation. The SDK consumers are the easiest to migrate since we control the SDK interface.\n\nPhase 2 - Dual-stack: We'll run both auth systems in parallel. The new identity platform issues tokens that the legacy system also accepts (backward-compatible claims). This lets us migrate services one at a time without a big-bang cutover. The dual-stack adds latency (extra token validation) but eliminates migration risk.\n\nPhase 3 - Service migration: Migrate in dependency order — leaf services first, then work inward. Each service switches its auth SDK version, runs in shadow mode for 1 week (validating against both systems), then cuts over. Estimated 3 months for all 47 services.\n\nPhase 4 - Legacy decommission: After all services are migrated, keep the legacy system running for 30 days as a rollback safety net. Then decommission: revoke all legacy service accounts, archive the codebase, and redirect the old auth endpoints to the new system with appropriate error messages.", "expected_extraction_type": "phase_chain", "expected_fields": {"title_keywords": ["auth", "migration", "identity"], "status_hint": "accepted", "min_phases": 3, "max_phases": 5}}
-{"id": "extract-phase-incident-002", "category": "extraction/phase_chain", "language": "en", "input": "Incident timeline and resolution for the payment processing outage on March 5th.\n\nDiscovery: Alerts fired at 14:23 UTC when payment success rate dropped below 95%. On-call engineer confirmed — Stripe webhook endpoint returning 502s. Initial suspicion: Stripe outage, but Stripe status page showed all green.\n\nRoot cause investigation: Traced to our webhook handler running out of database connections. The connection pool (max 20) was exhausted because a background job (invoice reconciliation) was holding long-running transactions — each one taking 30+ seconds due to an unindexed query on the invoices table. The reconciliation job runs hourly and usually completes in 2 seconds, but the invoices table grew past 10M rows last week without anyone noticing.\n\nImmediate fix: Killed the long-running queries, restarted the webhook handler. Payment processing recovered at 14:47 UTC (24-minute outage). Added the missing index on invoices(customer_id, created_at) — reconciliation job now completes in 200ms.\n\nPrevention: 1) Added statement_timeout=10s for the reconciliation job's database role. 2) Set up table size alerts — any table crossing 5M rows triggers a review. 3) Connection pool exhaustion now fires a P1 alert instead of silently queuing. 4) Added a dedicated database connection pool for webhook handlers, isolated from batch jobs.", "expected_extraction_type": "phase_chain", "expected_fields": {"title_keywords": ["payment", "outage", "connection pool"], "status_hint": "accepted", "min_phases": 3, "max_phases": 5}}
-{"id": "extract-phase-refactor-003", "category": "extraction/phase_chain", "language": "en", "input": "Planning the permission system refactor from role-based (RBAC) to attribute-based (ABAC).\n\nWhy: Our current RBAC system has 47 roles with overlapping permissions. Adding a new feature requires touching 10+ role definitions. The \"custom role\" feature request from enterprise customers is impossible to implement cleanly with the current model.\n\nDesign decision: Adopting ABAC with a policy engine (Open Policy Agent). Policies are expressed as Rego rules that evaluate user attributes (department, team, clearance level) against resource attributes (owner, sensitivity, environment). This replaces the rigid role-permission matrix with composable policies.\n\nMigration approach: Every existing role will be expressed as an ABAC policy — so the migration is backward-compatible. Users won't notice any change initially. Then we gradually introduce attribute-based policies for new features while keeping the role-based ones for existing features. No big-bang migration.\n\nRisk: OPA adds a network hop to every authorization check. Mitigation: bundle OPA as a sidecar with local policy cache, evaluated in < 1ms. If OPA sidecar is unavailable, fall back to a cached last-known-good policy decision.", "expected_extraction_type": "phase_chain", "expected_fields": {"title_keywords": ["ABAC", "permission", "OPA"], "status_hint": "accepted", "min_phases": 3, "max_phases": 5}}
diff --git a/benchmark/scenarios/extraction/single/scenarios.jsonl b/benchmark/scenarios/extraction/single/scenarios.jsonl
deleted file mode 100644
index 61b11132..00000000
--- a/benchmark/scenarios/extraction/single/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "extract-single-redis-001", "category": "extraction/single", "language": "en", "input": "We're going with Redis for session storage. It's faster than our PostgreSQL-based sessions and we already run a Redis cluster for caching. The risk is adding another state dependency, but the performance gain for auth flows justifies it.", "expected_extraction_type": "single", "expected_fields": {"title_keywords": ["Redis", "session"], "status_hint": "accepted", "min_alternatives": 1, "min_trade_offs": 1}}
-{"id": "extract-single-lint-002", "category": "extraction/single", "language": "en", "input": "Adopting Biome as our linter/formatter, replacing ESLint + Prettier. Biome is 10-30x faster and handles both linting and formatting in a single tool. The config is simpler and it supports TypeScript out of the box. We lose some ESLint plugins (import sorting, accessibility checks), but we'll add those back via separate CI checks.", "expected_extraction_type": "single", "expected_fields": {"title_keywords": ["Biome", "linter"], "status_hint": "accepted", "min_alternatives": 1, "min_trade_offs": 1}}
-{"id": "extract-single-rejected-003", "category": "extraction/single", "language": "en", "input": "Rejected the proposal to move our static assets to a CDN. Our traffic is 90% US-based, and the S3 + CloudFront setup we have now delivers p95 < 100ms domestically. A multi-region CDN would cost $5K/month extra for marginal improvement. We'll revisit when international traffic exceeds 25%.", "expected_extraction_type": "single", "expected_fields": {"title_keywords": ["CDN", "rejected"], "status_hint": "rejected", "min_alternatives": 0, "min_trade_offs": 1}}
-{"id": "extract-single-korean-004", "category": "extraction/single", "language": "ko", "input": "로그 수집 시스템으로 Loki를 선택했습니다. Elasticsearch 대비 스토리지 비용이 1/10이고, 우리 규모(일 50GB 로그)에서는 전문 검색보다 라벨 기반 필터링이 더 실용적입니다. Grafana와의 네이티브 통합도 장점. 다만 복잡한 로그 분석이 필요한 경우 ES 클러스터를 별도로 유지할 수 있습니다.", "expected_extraction_type": "single", "expected_fields": {"title_keywords": ["Loki", "로그"], "status_hint": "accepted", "min_alternatives": 1, "min_trade_offs": 1}}
diff --git a/benchmark/scenarios/recall/cross_domain/scenarios.jsonl b/benchmark/scenarios/recall/cross_domain/scenarios.jsonl
deleted file mode 100644
index c2c8c9b3..00000000
--- a/benchmark/scenarios/recall/cross_domain/scenarios.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "recall-cross-security-arch-001", "category": "recall/cross_domain", "language": "en", "seed_records": [{"title": "SSRF prevention in webhook delivery", "domain": "security", "content": "Webhook URL validation must block RFC 1918 ranges and link-local addresses. Resolve hostname and validate IP before request. Use dedicated egress proxy for webhook deliveries."}, {"title": "Adopted ArgoCD for GitOps deployment", "domain": "ops", "content": "All deployments declarative via GitOps repo. ArgoCD reconciles. Jenkins for CI only."}, {"title": "Write-through cache with Redis for product catalog", "domain": "architecture", "content": "Write-through cache, catalog updates < 100/day, reads 50K/min. Rejected cache-aside."}], "query": "What security measures do we have for external integrations?", "expected_match_titles": ["SSRF prevention in webhook delivery"], "min_score": 0.35, "notes": "Should match security record despite query using 'external integrations' not 'webhook'"}
-{"id": "recall-cross-incident-debug-002", "category": "recall/cross_domain", "language": "en", "seed_records": [{"title": "gRPC keepalive fix for AWS NLB idle timeout", "domain": "debugging", "content": "NLB 350s idle timeout vs gRPC 600s keepalive. Connections silently dropped. Fix: GRPC_KEEPALIVE_TIME_MS=60000."}, {"title": "Cascading failure post-mortem: checkout flow", "domain": "incident", "content": "Recommendation service slowdown caused thread pool exhaustion in checkout. Added 500ms timeout, bulkhead pattern, circuit breaker."}, {"title": "PostgreSQL deadlock fix with canonical lock ordering", "domain": "debugging", "content": "Concurrent transactions acquiring row locks in different order. Fix: lock by table name alphabetically."}], "query": "Have we had reliability issues with AWS networking?", "expected_match_titles": ["gRPC keepalive fix for AWS NLB idle timeout"], "min_score": 0.3, "notes": "Cross-domain: query about AWS networking should surface a debugging record"}
-{"id": "recall-cross-product-arch-003", "category": "recall/cross_domain", "language": "en", "seed_records": [{"title": "API v1 sunset timeline", "domain": "product", "content": "12% traffic on v1. Deprecation notices next month, hard shutdown in 6 months. v1 returns 410 Gone. Frees 2 engineers."}, {"title": "Switched to JWT with short-lived tokens", "domain": "security", "content": "Access tokens 15min, refresh tokens 7 days. Stateless validation. Critical ops still check session store."}, {"title": "JSON Patch (RFC 6902) for settings API", "domain": "architecture", "content": "Using JSON Patch instead of merge patch for user settings. Prevents null vs absent ambiguity. Audit trail of field changes."}], "query": "What are we changing about our API?", "expected_match_titles": ["API v1 sunset timeline", "JSON Patch (RFC 6902) for settings API"], "min_score": 0.3, "notes": "Broad query should surface multiple relevant records across domains"}
diff --git a/benchmark/scenarios/recall/exact_match/scenarios.jsonl b/benchmark/scenarios/recall/exact_match/scenarios.jsonl
deleted file mode 100644
index 8e9b6ad3..00000000
--- a/benchmark/scenarios/recall/exact_match/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "recall-exact-db-001", "category": "recall/exact_match", "language": "en", "seed_records": [{"title": "Adopted PostgreSQL for analytics warehouse", "domain": "architecture", "content": "We chose PostgreSQL over ClickHouse for our analytics warehouse because our query patterns are more OLTP-like with occasional analytics. PostgreSQL's ecosystem and our team's familiarity reduce operational risk. ClickHouse would require a dedicated ops team.", "tags": ["database", "analytics", "postgresql"]}], "query": "What database did we choose for the analytics warehouse?", "expected_match_titles": ["Adopted PostgreSQL for analytics warehouse"], "min_score": 0.5}
-{"id": "recall-exact-auth-002", "category": "recall/exact_match", "language": "en", "seed_records": [{"title": "Switched to JWT with short-lived tokens for authentication", "domain": "security", "content": "Moving from session-based auth to JWT. Access tokens expire in 15 minutes, refresh tokens in 7 days. JWTs are stateless so we can validate without a database call. The trade-off is we can't revoke individual tokens, but the 15-minute window limits exposure. For critical operations (password change, payment), we still verify against the session store.", "tags": ["auth", "jwt", "security"]}], "query": "What authentication mechanism do we use?", "expected_match_titles": ["Switched to JWT with short-lived tokens for authentication"], "min_score": 0.5}
-{"id": "recall-exact-deploy-003", "category": "recall/exact_match", "language": "en", "seed_records": [{"title": "Adopted ArgoCD for GitOps deployment pipeline", "domain": "ops", "content": "Replacing Jenkins-based deployment with ArgoCD. All deployments are now declarative — push a manifest change to the GitOps repo and ArgoCD reconciles. This eliminates the 'deploy button' anti-pattern and gives us full audit trail. Jenkins remains for CI (build + test), but CD is now ArgoCD's responsibility.", "tags": ["deployment", "gitops", "argocd"]}], "query": "How do we deploy to production?", "expected_match_titles": ["Adopted ArgoCD for GitOps deployment pipeline"], "min_score": 0.5}
-{"id": "recall-exact-pricing-004", "category": "recall/exact_match", "language": "ko", "seed_records": [{"title": "API 제품 사용량 기반 과금 모델 채택", "domain": "product", "content": "시트 기반 과금에서 API 호출량 기반 과금으로 전환. 월 10K 호출 무료 티어 제공. 고객 인터뷰 결과 시트 기반은 비정기 사용자가 많은 팀에 불리하다는 피드백. 재무팀 모델링으로 6개월 내 15% 매출 증가 전망.", "tags": ["pricing", "api", "usage-based"]}], "query": "우리 API 과금 모델이 뭐야?", "expected_match_titles": ["API 제품 사용량 기반 과금 모델 채택"], "min_score": 0.5}
diff --git a/benchmark/scenarios/recall/semantic_match/edge_cases.jsonl b/benchmark/scenarios/recall/semantic_match/edge_cases.jsonl
deleted file mode 100644
index b884be5c..00000000
--- a/benchmark/scenarios/recall/semantic_match/edge_cases.jsonl
+++ /dev/null
@@ -1,2 +0,0 @@
-{"id": "recall-semantic-indirect-005", "category": "recall/semantic_match", "language": "ko", "seed_records": [{"title": "Deployment freeze during quarter-end close", "domain": "process", "content": "Implementing deployment freezes for the last week of each quarter. Sales team reports production issues during close cause deal slippage. Feature branches merge to main but deploy waits. Compromise: wanted 2-week freeze, negotiated to 1 week.", "tags": ["deployment", "process", "sales"]}], "query": "영업팀이 요청한 변경사항이 뭐가 있었지?", "expected_match_titles": ["Deployment freeze during quarter-end close"], "min_score": 0.25, "notes": "Korean query about sales team requests should surface an English record about deployment freezes prompted by sales"}
-{"id": "recall-semantic-negation-006", "category": "recall/semantic_match", "language": "en", "seed_records": [{"title": "Rejected monorepo adoption", "domain": "process", "content": "Staying with poly-repo. Evaluated Turborepo/Nx but CI would worsen. Teams prefer autonomous deploy cycles. Code sharing via shared npm registry instead.", "tags": ["monorepo", "repo-structure"]}], "query": "What's our code sharing strategy?", "expected_match_titles": ["Rejected monorepo adoption"], "min_score": 0.25, "notes": "Query about code sharing should surface the monorepo rejection which contains the alternative (npm registry)"}
diff --git a/benchmark/scenarios/recall/semantic_match/scenarios.jsonl b/benchmark/scenarios/recall/semantic_match/scenarios.jsonl
deleted file mode 100644
index 3441620d..00000000
--- a/benchmark/scenarios/recall/semantic_match/scenarios.jsonl
+++ /dev/null
@@ -1,4 +0,0 @@
-{"id": "recall-semantic-cache-001", "category": "recall/semantic_match", "language": "en", "seed_records": [{"title": "Write-through cache with Redis for product catalog", "domain": "architecture", "content": "Using write-through cache for product catalog. Catalog updates < 100/day but reads 50K/min. Rejected cache-aside due to 30-second staleness window causing pricing discrepancies in checkout.", "tags": ["cache", "redis", "catalog"]}], "query": "How do we keep product prices consistent?", "expected_match_titles": ["Write-through cache with Redis for product catalog"], "min_score": 0.35, "notes": "Query uses 'prices consistent' — semantically related to 'pricing discrepancies' and 'cache staleness'"}
-{"id": "recall-semantic-testing-002", "category": "recall/semantic_match", "language": "en", "seed_records": [{"title": "Shifted from E2E-heavy to contract testing strategy", "domain": "qa", "content": "Inverting the test pyramid: 70% unit, 20% contract (Pact), 10% E2E. Selenium E2E suite was 45 minutes with 15% flake rate. Trimming to 50 critical-path E2E tests. Target CI time: 12 minutes.", "tags": ["testing", "contract", "pact"]}], "query": "Why are our CI builds so slow?", "expected_match_titles": ["Shifted from E2E-heavy to contract testing strategy"], "min_score": 0.3, "notes": "Query about CI speed should surface testing strategy change that addressed CI duration"}
-{"id": "recall-semantic-mobile-003", "category": "recall/semantic_match", "language": "en", "seed_records": [{"title": "Deprioritized native mobile app in favor of PWA", "domain": "product", "content": "Mobile DAU is 8% of total. PWA with service workers covers 90% of mobile use cases. Native app in maintenance mode, no new features. 3 mobile engineers transitioning to core web team.", "tags": ["mobile", "pwa", "native"]}], "query": "Do we still have an iOS app?", "expected_match_titles": ["Deprioritized native mobile app in favor of PWA"], "min_score": 0.3, "notes": "Query about iOS should surface the PWA decision even though 'iOS' isn't in the record"}
-{"id": "recall-semantic-hiring-004", "category": "recall/semantic_match", "language": "en", "seed_records": [{"title": "Updated engineering hiring bar with system design round", "domain": "hr", "content": "Adding system design round for all candidates. New loop: coding (LC medium), system design, behavioral, pair programming. Dropped take-home assignment — candidates report 8+ hours and filters out people with families.", "tags": ["hiring", "interview", "process"]}], "query": "What's our policy on take-home assignments for candidates?", "expected_match_titles": ["Updated engineering hiring bar with system design round"], "min_score": 0.3}
diff --git a/benchmark/scenarios/recall/temporal/scenarios.jsonl b/benchmark/scenarios/recall/temporal/scenarios.jsonl
deleted file mode 100644
index 8356204b..00000000
--- a/benchmark/scenarios/recall/temporal/scenarios.jsonl
+++ /dev/null
@@ -1,3 +0,0 @@
-{"id": "recall-temporal-recent-001", "category": "recall/temporal", "language": "en", "seed_records": [{"title": "Migrated from MongoDB to PostgreSQL for user profiles", "domain": "architecture", "content": "Document model no longer helpful — profiles highly relational. Mongo lacks JOINs, forced denormalization. PostgreSQL with JSONB gives best of both worlds. Dual-write migration for 2 weeks.", "tags": ["database", "migration"]}, {"title": "Original MongoDB adoption for user profiles", "domain": "architecture", "content": "Chose MongoDB for user profiles because schema was evolving rapidly during early product development. Flexible documents avoided constant migrations.", "tags": ["database", "mongodb"]}], "query": "What database do we currently use for user profiles?", "expected_match_titles": ["Migrated from MongoDB to PostgreSQL for user profiles"], "min_score": 0.4, "notes": "Query about 'currently' should rank the migration decision (more recent) higher than the original adoption"}
-{"id": "recall-temporal-evolution-002", "category": "recall/temporal", "language": "en", "seed_records": [{"title": "1-week sprint cadence for platform team", "domain": "process", "content": "Switching from 2-week to 1-week sprints. 2-week cycle had too much scope creep. Product team keeps 2-week sprints."}, {"title": "Adopted 2-week sprint cadence company-wide", "domain": "process", "content": "Standardizing on 2-week sprints across all teams. Aligns with biweekly planning cycle."}], "query": "How long are our sprints?", "expected_match_titles": ["1-week sprint cadence for platform team", "Adopted 2-week sprint cadence company-wide"], "min_score": 0.3, "notes": "Both should match — the answer depends on which team. Both records are relevant."}
-{"id": "recall-temporal-superseded-003", "category": "recall/temporal", "language": "en", "seed_records": [{"title": "Adopted ArgoCD for GitOps deployment pipeline", "domain": "ops", "content": "Replacing Jenkins-based deployment with ArgoCD. All deployments declarative via GitOps repo. Jenkins remains for CI only."}, {"title": "Standardized on Jenkins for CI/CD pipeline", "domain": "ops", "content": "Using Jenkins for the full CI/CD pipeline including builds, tests, and deployments. Jenkinsfile in each repo defines the pipeline."}], "query": "Do we still use Jenkins for deployments?", "expected_match_titles": ["Adopted ArgoCD for GitOps deployment pipeline"], "min_score": 0.35, "notes": "The ArgoCD decision supersedes Jenkins for CD. Both may surface but ArgoCD should rank higher."}
diff --git a/benchmark/scenarios/schema.json b/benchmark/scenarios/schema.json
deleted file mode 100644
index d15748b3..00000000
--- a/benchmark/scenarios/schema.json
+++ /dev/null
@@ -1,160 +0,0 @@
-{
- "$schema": "https://json-schema.org/draft/2020-12/schema",
- "title": "Rune Benchmark Scenario",
- "description": "Schema for rune-bench scenario JSONL files",
- "oneOf": [
- { "$ref": "#/$defs/CaptureScenario" },
- { "$ref": "#/$defs/RecallScenario" },
- { "$ref": "#/$defs/ExtractionScenario" }
- ],
- "$defs": {
- "CaptureScenario": {
- "type": "object",
- "required": ["id", "category", "language", "input", "expected_capture"],
- "properties": {
- "id": {
- "type": "string",
- "pattern": "^[a-z0-9-]+-\\d{3}$"
- },
- "category": {
- "type": "string",
- "pattern": "^capture/(should_capture|should_not_capture)/"
- },
- "language": {
- "type": "string",
- "enum": ["en", "ko", "ja", "mixed"]
- },
- "input": {
- "type": "string",
- "minLength": 10
- },
- "expected_capture": {
- "type": "boolean"
- },
- "expected_fields": {
- "type": "object",
- "properties": {
- "domain": { "type": "string" },
- "status_hint": {
- "type": "string",
- "enum": ["proposed", "accepted", "rejected"]
- },
- "title_keywords": {
- "type": "array",
- "items": { "type": "string" }
- },
- "evidence_type": {
- "type": "string",
- "enum": ["code_change", "git_bisect", "benchmark", "error_trace", "runtime_observation"]
- },
- "has_reusable_insight": {
- "type": "boolean"
- }
- }
- },
- "recall_queries": {
- "type": "array",
- "items": {
- "type": "object",
- "required": ["query", "should_match"],
- "properties": {
- "query": { "type": "string" },
- "should_match": { "type": "boolean" }
- }
- }
- },
- "notes": { "type": "string" }
- }
- },
- "RecallScenario": {
- "type": "object",
- "required": ["id", "category", "language", "seed_records", "query"],
- "properties": {
- "id": {
- "type": "string",
- "pattern": "^recall-[a-z0-9-]+-\\d{3}$"
- },
- "category": {
- "type": "string",
- "pattern": "^recall/"
- },
- "language": {
- "type": "string",
- "enum": ["en", "ko", "ja", "mixed"]
- },
- "seed_records": {
- "type": "array",
- "items": {
- "type": "object",
- "required": ["title", "domain", "content"],
- "properties": {
- "title": { "type": "string" },
- "domain": { "type": "string" },
- "content": { "type": "string" },
- "status": { "type": "string" },
- "tags": {
- "type": "array",
- "items": { "type": "string" }
- }
- }
- }
- },
- "query": { "type": "string" },
- "expected_match_titles": {
- "type": "array",
- "items": { "type": "string" }
- },
- "min_score": {
- "type": "number",
- "minimum": 0,
- "maximum": 1
- },
- "notes": { "type": "string" }
- }
- },
- "ExtractionScenario": {
- "type": "object",
- "required": ["id", "category", "language", "input", "expected_extraction_type"],
- "properties": {
- "id": {
- "type": "string",
- "pattern": "^extract-[a-z0-9-]+-\\d{3}$"
- },
- "category": {
- "type": "string",
- "pattern": "^extraction/"
- },
- "language": {
- "type": "string",
- "enum": ["en", "ko", "ja", "mixed"]
- },
- "input": {
- "type": "string",
- "minLength": 20
- },
- "expected_extraction_type": {
- "type": "string",
- "enum": ["single", "phase_chain", "bundle"]
- },
- "expected_fields": {
- "type": "object",
- "properties": {
- "title_keywords": {
- "type": "array",
- "items": { "type": "string" }
- },
- "status_hint": {
- "type": "string",
- "enum": ["proposed", "accepted", "rejected"]
- },
- "min_alternatives": { "type": "integer", "minimum": 0 },
- "min_trade_offs": { "type": "integer", "minimum": 0 },
- "min_phases": { "type": "integer", "minimum": 1 },
- "max_phases": { "type": "integer", "minimum": 1 }
- }
- },
- "notes": { "type": "string" }
- }
- }
- }
-}