Skip to content

Cloudflare Browser Rendering fallback for JS/SPA pages #13

Description

@harrymove-ctrl

Context

The production extractor fetchPageContent() (apps/api/src/worker.ts ~2884) uses Firecrawl only when FIRECRAWL_API_KEY is set, otherwise falls back to raw fetchText + regex htmlToText() (line 5217). The Node core path (packages/core/src/web.ts fetchHtml) is also raw-fetch with no JS execution. As a result, SPA shells (Next.js/React <div id="root">) extract to near-empty content whenever Firecrawl is absent — a real risk during a live demo on a modern site. There is no Browser Rendering binding in apps/api/cloudflare/wrangler.jsonc (bindings today: D1, R2, AI, Queues).

Goal / user story

As a user extracting a JavaScript-rendered site, I want ContextMEM to render the page headlessly when a raw fetch returns an empty shell, so I get real content without Firecrawl being mandatory.

Acceptance criteria

  • A browser binding is added to wrangler.jsonc and WorkerEnv, and @cloudflare/puppeteer is wired in the queue consumer.
  • fetchPageContent order becomes firecrawl (if key) → Browser Rendering (if empty-shell detected) → raw fetch + htmlToText, with the engine recorded on FetchedText.engine.
  • An empty-shell heuristic triggers rendering: extracted main-content text below a threshold while the HTML shows SPA markers (id="root", __next, app-root, or large inline script payload with little prose).
  • Rendering honors a wait (CrawlOptions.waitForMs / networkidle) and a hard timeout; failures degrade gracefully to raw fetch rather than failing the run.
  • Rendered HTML flows through the existing markdown/headings/brand path so downstream artifacts (chunks, facts, design-system) are unchanged in shape.

Implementation notes

  • Touch apps/api/cloudflare/wrangler.jsonc (add "browser": { "binding": "BROWSER" }), apps/api/src/worker.ts (WorkerEnv, fetchPageContent, empty-shell detector). Run rendering only in the queue consumer, never the request path, due to CPU/time limits.
  • Use puppeteer.launch(env.BROWSER)page.goto(url, { waitUntil: 'networkidle0' })page.content() → feed into htmlToText/markdown. Reuse fetchText's 1.5MB size guard semantics.
  • Gotchas: Browser Rendering has session/concurrency caps and per-invocation cost — gate it behind the empty-shell heuristic, cap concurrent sessions, and reuse a single browser per batch. Keep local Playwright (packages/core/src/screenshots.ts) for the CLI; this is the hosted parity path.
  • Open question to resolve in PR: is Firecrawl permanent or does Browser Rendering become the default renderer (cost tradeoff)?

Sui Overflow angle

Many Sui/Walrus ecosystem sites and dapp docs are SPA-rendered; guaranteeing non-empty extraction without a paid Firecrawl key makes the demo robust on any URL a judge throws at it, including *.wal.app portals.

Dependencies

None (independent of chunking); complements the Walrus Sites on-chain enumeration issue for .wal.app targets.

Part of the ContextMEM roadmap (#4) • Sui Overflow build.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Demo-blocking: required for a working Sui Overflow democrawlingWeb/Walrus crawling and context-extraction qualityfeatureUser- or agent-facing capabilityplatformBackend platform plumbing: Worker, D1, queues, secrets, metering

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions