Skip to content
77 changes: 77 additions & 0 deletions bench/PROBE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Wick self-improvement probe harness

Closes the loop from the public stats page back into Wick's routing: read which
sites Wick is **failing on**, empirically test access methods through a
residential proxy, and publish measured per-site rules that every client picks
up — so the curated "known behaviors" list (`rust/data/site-rules.json` +
`GET /v1/site-rules`) constantly evolves instead of relying on hand-seeds.

```
/v1/stats/summary ──→ probe.sh ──→ site-rules.measured.json ──→ publish-rules.sh ──→ POST /v1/site-rules
(failing sites) (matrix) (measured verdicts) (merge w/ seed) (clients refresh)
```

## The pipeline

| stage | script | what it does |
|---|---|---|
| select | `probe.sh` step 1 | Pull `/v1/stats/summary`, aggregate per host, keep **site-side** failing hosts. Drops hosts whose failures are mostly `error_kind="offline"` (the user's own network) — so we never chase phantom "this site is hard" signals. |
| probe | `probe.sh` step 2–3 | Per host, run a matrix via `wick fetch --json`: `cronet` \| `cronet+residential` \| `cef`. Derive `render` (cef only if it beats a cronet failure) and `needs_residential` (residential beats a cronet failure). |
| emit | `probe.sh` step 4 | Write `~/.wick/probe/site-rules.measured.json` — a measured verdict for every host where *some* strategy worked (incl. `render:cronet`, so a measurement can **correct** an over-aggressive seed). Key is the host with a leading `www.` stripped, matching the seed convention. |
| publish | `publish-rules.sh` | Merge seed ∪ measured (**measured wins per host**) and `POST /v1/site-rules/:key`. |
| consume | client | `wick` refreshes `GET /v1/site-rules` into `<wick-home>/site-rules.json` daily; that overlay overrides the bundled seed (`site_rules.rs`). |

## Running it

```bash
# 1. residential creds from Vault (prod tailnet + GCP ADC required)
source <skills>/scripts/residential-proxy-env.sh # OXY_USER/OXY_PASS, ...

# 2. ALWAYS probe availability first — it's time-varying
bash <skills>/scripts/residential-probe.sh US

# 3. sweep (oxylabs is the reliable US provider; HTTP CONNECT :443-only)
bash bench/probe.sh --provider=oxylabs --country=us --max-hosts=15
# → ~/.wick/probe/probe-<ts>.jsonl (per-host trace)
# → ~/.wick/probe/site-rules.measured.json

# 4. publish (needs a Worker API key)
WICK_PUBLISH_KEY=<key> bash bench/publish-rules.sh # or --dry-run to preview

# candidate selection alone (no creds): bash bench/probe.sh --dry-run
```

## Scheduling

Rules change slowly and residential probing has cost, so **weekly** is plenty.
Cron (sources creds at runtime — never bake Vault creds into the job):

```
# Sundays 04:00 — sweep the current failing set and republish
0 4 * * 0 source $HOME/.../scripts/residential-proxy-env.sh && \
bash /abs/path/wick/bench/probe.sh --provider=oxylabs --country=us --max-hosts=25 && \
WICK_PUBLISH_KEY=$WICK_PUBLISH_KEY bash /abs/path/wick/bench/publish-rules.sh
```

## Methodology caveats (read before trusting a single sweep)

- **The `cronet` baseline cell uses the operator's own IP.** If that IP is clean
(residential / office), `cronet` succeeds and we conclude `needs_residential:false`
— even though the *datacenter*-hosted clients that generate much of the failing
telemetry would need residential. To detect `needs_residential` faithfully, run
the harness **from a datacenter VM** so the baseline matches the failing
population. (First live sweep, 2026-06-26, ran from a clean US vantage and found
reuters/cfr/tradingview/apkmirror/apkcombo all work on plain Cronet — i.e. the
telemetry failures were vantage-specific or user-side noise, and the hand-seeds
for those hosts were over-aggressive. The loop corrected them to `cronet`.)
- **`--proxy` (SOCKS/HTTP) routes only Cronet, not CEF.** CEF's residential path is
a WireGuard `LD_PRELOAD` (`bindwg.so`) that exists only on tunneled Linux servers,
so the `cef+residential` combination is **not** tested here. `render:cef` and
`needs_residential` are derived as independent signals; a site needing *both*
(e.g. apkpure — DataDome, failed every testable cell) is left to its seed / PR4's
agent.
- **Single residential IP per session, single country.** A site reachable from a
different country/ISP won't show it. Sweep multiple `--country` values for
geo-sensitive targets.
- A `200` under `MIN_OK_BYTES` (default 1000) is treated as a block/challenge shell,
not success (matches `fetch.rs`'s `is_acceptable_render`).
245 changes: 245 additions & 0 deletions bench/probe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
#!/usr/bin/env bash
# Wick self-improvement probe harness (PR2).
#
# Closes the loop: read which sites Wick is FAILING on (from the public stats
# endpoint), then empirically test access methods against each through a
# residential proxy, and emit measured per-site rules in the site-rules.json
# schema that fetch.rs consumes.
#
# stats → candidates → strategy matrix → winner → site-rules.measured.json
#
# Pipeline:
# 1. Pull releases.getwick.dev/v1/stats/summary, aggregate per host, and
# select genuinely SITE-SIDE failing hosts — explicitly dropping hosts
# whose failures are mostly error_kind="offline" (the user's own network),
# so we never chase phantom "this site is hard" signals.
# 2. For each candidate, run a strategy matrix via `wick fetch --json`:
# - cronet (--render cronet) [direct]
# - cronet+residential(--render cronet --proxy <url>) [datacenter-block test]
# - cef (--render cef) [JS / bot-managed test]
# (cef+residential is NOT tested here: --proxy routes only the Cronet/
# reqwest engine, not CEF, whose residential path is a WireGuard preload
# on tunneled servers. The rule still combines render:cef + needs_residential
# when both independent signals fire; PR4's agent refines.)
# 3. Decide the winner and derive the rule:
# render = "cef" if cef succeeds AND cronet-direct fails
# needs_residential = true if cronet+residential succeeds AND cronet-direct fails
# 4. Emit measured rules (source:"measured", with sample count + date) and a
# per-host JSONL trace.
#
# Residential creds come from the env (same convention as run.sh /
# proxy-providers.sh). Source the residential-proxy skill's env first:
# source <skills>/scripts/residential-proxy-env.sh # exports OXY_USER, ...
# bash bench/probe.sh --provider=oxylabs --country=us
#
# Safe under cron/launchd: serial, per-request timeout, polite sleep.

set -u # NOT -e: a single failed probe must never kill the sweep.

REPO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROXY_BUILDER="$REPO_DIR/proxy-providers.sh"
STATS_URL="${WICK_STATS_URL:-https://releases.getwick.dev/v1/stats/summary}"

OUT_DIR="${WICK_PROBE_OUT_DIR:-$HOME/.wick/probe}"
TS="$(date -u +'%Y-%m-%dT%H:%M:%SZ')"
DAY="$(date -u +'%Y-%m-%d')"
RESULTS="$OUT_DIR/probe-$TS.jsonl"
RULES_OUT="$OUT_DIR/site-rules.measured.json"

# Tunables.
PROVIDER="${WICK_PROBE_PROVIDER:-}"
COUNTRY="${WICK_PROBE_COUNTRY:-us}"
MAX_HOSTS="${WICK_PROBE_MAX_HOSTS:-25}"
MIN_FETCHES="${WICK_PROBE_MIN_FETCHES:-4}" # ignore low-volume noise
MAX_SUCCESS_RATE="${WICK_PROBE_MAX_SR:-0.5}" # candidate if site-side SR below this
MIN_OK_BYTES="${WICK_PROBE_MIN_BYTES:-1000}" # a 200 with < this many bytes of extracted content = block/shell
PER_REQUEST_TIMEOUT="${WICK_PROBE_TIMEOUT:-40}"
SLEEP_BETWEEN="${WICK_PROBE_SLEEP:-2}"
DRY_RUN=0

for arg in "$@"; do
case $arg in
--provider=*) PROVIDER="${arg#*=}" ;;
--country=*) COUNTRY="${arg#*=}" ;;
--max-hosts=*) MAX_HOSTS="${arg#*=}" ;;
--dry-run) DRY_RUN=1 ;;
*) echo "WARN: unknown arg ignored: $arg" >&2 ;;
esac
done

command -v jq >/dev/null || { echo "ERROR: jq required" >&2; exit 1; }
command -v curl >/dev/null || { echo "ERROR: curl required" >&2; exit 1; }
WICK_BIN="${WICK_BIN:-$(command -v wick)}"
if [[ -z "$WICK_BIN" && "$DRY_RUN" -eq 0 ]]; then
echo "ERROR: wick not found on PATH (set WICK_BIN), or pass --dry-run" >&2
exit 1
fi
mkdir -p "$OUT_DIR"

# Resolve a timeout command: GNU coreutils ships `timeout`; macOS only has it
# as `gtimeout` (after `brew install coreutils`). Without either, run with no
# per-request timeout (and warn) rather than failing the whole sweep — the
# README documents macOS/launchd usage, so a hard dependency on `timeout`
# would break the default Mac.
TIMEOUT_BIN="$(command -v timeout 2>/dev/null || command -v gtimeout 2>/dev/null || true)"
if [[ -z "$TIMEOUT_BIN" && "$DRY_RUN" -eq 0 ]]; then
echo "WARN: no 'timeout'/'gtimeout' on PATH — running without a per-request timeout (macOS: brew install coreutils)" >&2
fi

# ── Step 1: candidate selection ─────────────────────────────────────────────
# Aggregate the per-(host,strategy) rows into per-host totals. A host is a
# candidate when it has real volume, a low overall success rate, AND its
# failures are predominantly site-side (offline fraction < 0.5). error_kind_dist
# may be absent until the worker is deployed + clients ship; treat missing as
# offline=0 (so we don't accidentally exclude everything in the meantime).
echo "[$TS] fetching stats: $STATS_URL" >&2
STATS_JSON=""
for attempt in 1 2 3 4; do
# --retry handles curl's own transient transport errors; the outer loop
# also retries an empty/non-JSON body (a transient edge blip we've seen).
if STATS_JSON="$(curl -s --max-time 30 --retry 2 "$STATS_URL")" \
&& printf '%s' "$STATS_JSON" | jq -e '.rows' >/dev/null 2>&1; then
break
fi
echo " stats fetch attempt $attempt failed; retrying in 3s…" >&2
STATS_JSON=""
sleep 3
done
[[ -n "$STATS_JSON" ]] || { echo "ERROR: stats fetch failed after retries" >&2; exit 1; }

CANDIDATES="$(printf '%s' "$STATS_JSON" | jq -r --argjson minf "$MIN_FETCHES" --argjson maxsr "$MAX_SUCCESS_RATE" '
[ .rows[]
| { host, fetches, successes,
offline: ((.error_kind_dist // {}).offline // 0) }
]
# group_by already sorts by the key internally in jq, so this sort_by is
# belt-and-suspenders: it makes the host-grouping intent explicit and is
# robust to any future jq change.
| sort_by(.host)
| group_by(.host)
| map({
host: .[0].host,
fetches: (map(.fetches) | add),
successes: (map(.successes) | add),
offline: (map(.offline) | add),
})
| map(. + {
failures: (.fetches - .successes),
sr: (if .fetches > 0 then (.successes / .fetches) else 1 end),
})
# real volume, low success, and failures that are mostly NOT user-offline
| map(select(.fetches >= $minf and .sr < $maxsr
and (.failures <= 0 or (.offline / .failures) < 0.5)))
| sort_by(.sr, (-.fetches))
| .[].host
')"

mapfile -t HOSTS < <(printf '%s\n' "$CANDIDATES" | grep -v '^\s*$' | head -n "$MAX_HOSTS")

echo "[$TS] ${#HOSTS[@]} site-side failing candidate host(s) (max=$MAX_HOSTS):" >&2
printf ' %s\n' "${HOSTS[@]}" >&2

if [[ "$DRY_RUN" -eq 1 ]]; then
echo "[$TS] --dry-run: stopping before probing. Matrix per host would be: cronet | cronet+residential | cef" >&2
exit 0
fi

# Residential proxy is required to test the needs_residential signal.
if [[ -z "$PROVIDER" ]]; then
echo "WARN: no --provider set; testing cronet-direct and cef-direct only (cannot derive needs_residential)." >&2
fi

# Build a fresh residential proxy URL (new session → new exit IP) per call.
# Scheme is provider-specific (oxylabs = HTTP CONNECT, others = SOCKS5).
build_proxy() {
[[ -z "$PROVIDER" ]] && return 1
"$PROXY_BUILDER" --provider="$PROVIDER" --country="$COUNTRY" 2>>"$RESULTS.err"
}

# Run one matrix cell. Echoes "ok <status> <bytes>" or "fail <rc>".
probe_cell() {
local url="$1" render="$2" proxy="$3"
local args=(fetch --json --no-robots --render "$render")
[[ -n "$proxy" ]] && args+=(--proxy "$proxy")
args+=("$url")
local out rc
if [[ -n "$TIMEOUT_BIN" ]]; then
out="$(WICK_AUTO_INSTALL_CEF=1 "$TIMEOUT_BIN" "$PER_REQUEST_TIMEOUT" "$WICK_BIN" "${args[@]}" 2>/dev/null)"; rc=$?
else
out="$(WICK_AUTO_INSTALL_CEF=1 "$WICK_BIN" "${args[@]}" 2>/dev/null)"; rc=$?
fi
if [[ $rc -ne 0 ]]; then
echo "fail $rc"
return
fi
local status bytes
status="$(printf '%s' "$out" | jq -r '.status_code // 0' 2>/dev/null)"
# content_bytes = extracted-content size; a challenge/JS shell extracts to
# near nothing, so a small value below means a block (not bytes-on-wire).
bytes="$(printf '%s' "$out" | jq -r '.content_bytes // 0' 2>/dev/null)"
if [[ "$status" == "200" && "${bytes:-0}" -ge "$MIN_OK_BYTES" ]]; then
echo "ok $status $bytes"
else
echo "fail-block ${status:-0} ${bytes:-0}"
fi
}

# ── Step 2 + 3: matrix + decision ───────────────────────────────────────────
: > "$RESULTS"
for host in "${HOSTS[@]}"; do
url="https://$host/"
cronet="$(probe_cell "$url" cronet "")"; sleep "$SLEEP_BETWEEN"
cef="$(probe_cell "$url" cef "")"; sleep "$SLEEP_BETWEEN"
cronet_res="n/a"
if [[ -n "$PROVIDER" ]]; then
if px="$(build_proxy)"; then
cronet_res="$(probe_cell "$url" cronet "$px")"; sleep "$SLEEP_BETWEEN"
fi
fi

cronet_ok=0; [[ "$cronet" == ok* ]] && cronet_ok=1
cef_ok=0; [[ "$cef" == ok* ]] && cef_ok=1
cronet_res_ok=0; [[ "$cronet_res" == ok* ]] && cronet_res_ok=1

# render: cef only when cef rescues a cronet-direct failure.
render="cronet"
[[ "$cronet_ok" -eq 0 && "$cef_ok" -eq 1 ]] && render="cef"
# needs_residential: residential rescues a cronet-direct failure.
needs_res="false"
[[ "$cronet_ok" -eq 0 && "$cronet_res_ok" -eq 1 ]] && needs_res="true"

jq -nc \
--arg host "$host" --arg render "$render" --argjson needs_res "$needs_res" \
--arg cronet "$cronet" --arg cef "$cef" --arg cronet_res "$cronet_res" \
--arg ts "$TS" \
'{host:$host, render:$render, needs_residential:$needs_res,
cells:{cronet:$cronet, cef:$cef, cronet_residential:$cronet_res}, probed_at:$ts}' \
| tee -a "$RESULTS" >&2
done

# ── Step 4: emit measured rules ─────────────────────────────────────────────
# Emit the measured verdict for every host where SOME strategy worked — including
# render:cronet. That's deliberate: a measurement of "cronet works here" must be
# able to CORRECT an over-aggressive hand-seed (the published overlay overrides
# the bundled seed per host). A host where every cell failed (e.g. apkpure, hard
# even via residential) emits nothing, so its seed stays until we learn a method
# that works. confidence is modest for a single sweep; repeated sweeps / PR4's
# agent raise it.
jq -s --arg day "$DAY" '
{ version: 1, updated_at: $day,
note: "measured by bench/probe.sh",
rules: (
[ .[]
| select(.cells | to_entries | any(.value | startswith("ok")))
# Key on the bare host (strip leading www.) to match the seed
# convention plus the client parent-domain walk, so a measurement
# OVERRIDES a same-host seed instead of sitting beside it.
| { key: (.host | sub("^www\\."; "")),
value: { render: .render, needs_residential: .needs_residential,
vendor: "measured", confidence: 0.7, source: "measured",
updated_at: $day } }
] | from_entries)
}' "$RESULTS" > "$RULES_OUT"

echo "[$TS] wrote $(jq '.rules | length' "$RULES_OUT") measured rule(s) → $RULES_OUT" >&2
echo "[$TS] per-host trace → $RESULTS" >&2
5 changes: 4 additions & 1 deletion bench/proxy-providers.sh
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,12 @@ country_name_for() {

case "$PROVIDER" in
oxylabs)
# Oxylabs residential is an HTTP CONNECT proxy on :7777 and is
# :443-only — SOCKS5 (and non-443 dest ports) return 403/errors. All
# Wick fetch targets are https, so CONNECT-to-443 is exactly right.
require OXY_USER OXY_PASS
login="customer-${OXY_USER}-cc-${CC}-sessid-$(session_id 10)-sesstime-10"
echo "socks5://${login}:${OXY_PASS}@pr.oxylabs.io:7777"
echo "http://${login}:${OXY_PASS}@pr.oxylabs.io:7777"
;;
brightdata)
# BD's SOCKS5 endpoint runs on a different port than HTTP (33335).
Expand Down
Loading
Loading