forked from tung-lee/contextMeM
-
Notifications
You must be signed in to change notification settings - Fork 0
Cloudflare Browser Rendering fallback for JS/SPA pages #13
Copy link
Copy link
Open
Labels
P0Demo-blocking: required for a working Sui Overflow demoDemo-blocking: required for a working Sui Overflow democrawlingWeb/Walrus crawling and context-extraction qualityWeb/Walrus crawling and context-extraction qualityfeatureUser- or agent-facing capabilityUser- or agent-facing capabilityplatformBackend platform plumbing: Worker, D1, queues, secrets, meteringBackend platform plumbing: Worker, D1, queues, secrets, metering
Milestone
Description
Metadata
Metadata
Assignees
Labels
P0Demo-blocking: required for a working Sui Overflow demoDemo-blocking: required for a working Sui Overflow democrawlingWeb/Walrus crawling and context-extraction qualityWeb/Walrus crawling and context-extraction qualityfeatureUser- or agent-facing capabilityUser- or agent-facing capabilityplatformBackend platform plumbing: Worker, D1, queues, secrets, meteringBackend platform plumbing: Worker, D1, queues, secrets, metering
Context
The production extractor
fetchPageContent()(apps/api/src/worker.ts~2884) uses Firecrawl only whenFIRECRAWL_API_KEYis set, otherwise falls back to rawfetchText+ regexhtmlToText()(line 5217). The Node core path (packages/core/src/web.tsfetchHtml) is also raw-fetch with no JS execution. As a result, SPA shells (Next.js/React<div id="root">) extract to near-empty content whenever Firecrawl is absent — a real risk during a live demo on a modern site. There is no Browser Rendering binding inapps/api/cloudflare/wrangler.jsonc(bindings today: D1, R2, AI, Queues).Goal / user story
As a user extracting a JavaScript-rendered site, I want ContextMEM to render the page headlessly when a raw fetch returns an empty shell, so I get real content without Firecrawl being mandatory.
Acceptance criteria
browserbinding is added towrangler.jsoncandWorkerEnv, and@cloudflare/puppeteeris wired in the queue consumer.fetchPageContentorder becomes firecrawl (if key) → Browser Rendering (if empty-shell detected) → raw fetch + htmlToText, with the engine recorded onFetchedText.engine.id="root",__next,app-root, or large inline script payload with little prose).CrawlOptions.waitForMs/ networkidle) and a hard timeout; failures degrade gracefully to raw fetch rather than failing the run.Implementation notes
apps/api/cloudflare/wrangler.jsonc(add"browser": { "binding": "BROWSER" }),apps/api/src/worker.ts(WorkerEnv,fetchPageContent, empty-shell detector). Run rendering only in the queue consumer, never the request path, due to CPU/time limits.puppeteer.launch(env.BROWSER)→page.goto(url, { waitUntil: 'networkidle0' })→page.content()→ feed intohtmlToText/markdown. ReusefetchText's 1.5MB size guard semantics.packages/core/src/screenshots.ts) for the CLI; this is the hosted parity path.Sui Overflow angle
Many Sui/Walrus ecosystem sites and dapp docs are SPA-rendered; guaranteeing non-empty extraction without a paid Firecrawl key makes the demo robust on any URL a judge throws at it, including
*.wal.appportals.Dependencies
None (independent of chunking); complements the Walrus Sites on-chain enumeration issue for
.wal.apptargets.Part of the ContextMEM roadmap (#4) • Sui Overflow build.