Skip to content

Crawl Walrus Sites via on-chain resource enumeration (Site object + blobIds) #22

Description

@harrymove-ctrl

Context

The production Worker treats *.wal.app targets as plain HTTP through the portal — it sets mode: 'walrus' and scrapes x-walrus-* response headers (apps/api/src/worker.ts ~2462/2554) but never reads the on-chain Site object. The real, verifiable enumeration already exists in packages/walrus/src/resources.ts listWalrusResources(site), which reads each resource's blob_id from the Sui Site object (~line 103) using a WalrusSiteContext from resolve.ts — but it is Node-only (CLI materializeWalrusSite) and not used in the hosted crawl. So Walrus-native sites are crawled by HTML link-guessing instead of their authoritative on-chain manifest.

Goal / user story

As a user pointing ContextMEM at a Walrus Site (a .wal.app URL, SuiNS name, or 0x site object id), I want it to enumerate the site's resources and blobIds directly from the on-chain Site object, so extraction is complete and provenance is verifiable rather than guessed from anchors.

Acceptance criteria

  • For a Walrus-Site target, the run resolves a WalrusSiteContext and enumerates resource paths + blobIds from the on-chain Site object (not HTML link discovery).
  • Each resource is fetched via the Walrus aggregator (quilt.ts blobAggregatorEndpoint) and mapped into manifest pages/resources carrying source.blobId / resourcePath.
  • buildSiteStructure (packages/core/src/site-structure.ts) populates its existing Walrus Provenance group from input.walrus.resources for these runs.
  • Extracted chunks/facts carry blobId-backed FactSourceRef provenance so the "why" links resolve to on-chain blobs.
  • If the on-chain read fails (RPC/grpc error), the crawl falls back to today's HTTP portal crawl and the run still completes.

Implementation notes

  • Reuse packages/walrus/src/resolve.ts (resolveWalrusTarget) + resources.ts (listWalrusResources) from the Worker walrus branch, or port a Worker-safe variant.
  • SPIKE first: confirm @mysten/sui/grpc SuiGrpcClient.getObject runs in the Worker under nodejs_compat (the roadmap notes proof.ts/history.ts/resolve.ts already use it for reads, but they run in Node today). resources.ts uses node:buffer — verify under nodejs_compat.
  • Touch apps/api/src/worker.ts (walrus branch dispatch), wire WalrusResourceRecord[] into the existing SiteStructureInput.walrus.resources seam, and reuse aggregatorUrl from the resolved context.
  • Gotchas: large sites need a resource cap/concurrency limit; aggregator fetch sizes must respect the Worker's existing 1.5MB guard; non-HTML resources (assets) should be enumerated but not chunked.

Sui Overflow angle

This is the uniquely Sui-native crawl: reading content straight from on-chain Walrus Site objects and Walrus blobs gives verifiable, tamper-evident provenance no traditional crawler can match. Paired with the chunkGraphDigest-keyed extract receipt, it tells a complete on-chain provenance story — the most differentiated thing to demo at a Sui hackathon.

Dependencies

Spike: @mysten/sui/grpc (SuiGrpcClient) in the Cloudflare Worker (shared with the on-chain receipt work). Coordinates with, but is not blocked by, the on-chain attribution-receipt issue.

Part of the ContextMEM roadmap (#4) • Sui Overflow build.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Important: hardens the demo and core productcrawlingWeb/Walrus crawling and context-extraction qualitysuiSui chain: tx signing, objects, wallet, zkLogin, explorerwalrusWalrus blob storage / Sites / aggregator

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions