Skip to content

feat: clean HTML fetches with reader mode#1

Merged
code-yeongyu merged 1 commit into
mainfrom
code-yeongyu/webfetch-reader-mode
Jun 22, 2026
Merged

feat: clean HTML fetches with reader mode#1
code-yeongyu merged 1 commit into
mainfrom
code-yeongyu/webfetch-reader-mode

Conversation

@code-yeongyu

@code-yeongyu code-yeongyu commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add Mozilla Readability + jsdom extraction before markdown/text conversion for HTML responses.
  • Keep raw html format unchanged and fall back to the previous whole-document conversion when Readability cannot extract an article.
  • Add markdown and text regressions proving page chrome is removed while article content remains.
  • Align runtime Node engine with the new reader dependencies and pin undici@7.28.0 so the new dependency set does not introduce the prior undici advisories.

Evidence

  • RED: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-red.txt
  • Focused GREEN: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-green-focused-4.txt
  • Full tests: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-test-full-4.txt
  • Type/lint: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-check-4.txt
  • Manual QA: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-manual-qa-4.txt
  • Cleanup receipt: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-manual-cleanup-4.txt
  • Prod audit summary: /Users/yeongyu/local-workspaces/senpi/local-ignore/qa-evidence/20260622-webfetch-reader-mode/pi-webfetch-audit-prod-2.json

Reviewer approved after verifying the Node engine and text-format coverage fixes.


Summary by cubic

Use reader mode to extract the main article content before converting HTML responses to markdown or text. Raw html responses remain unchanged, and we fall back to the previous whole-document conversion when no article is detected.

  • New Features

    • Use @mozilla/readability + jsdom to strip chrome and convert only the article body; fallback to previous conversion if extraction fails.
    • Preserve titles by adding an H1 or prefixing text when missing; tests cover markdown/text and verify chrome removal.
  • Dependencies

    • Add @mozilla/readability, jsdom, and @types/jsdom.
    • Pin undici@7.28.0 and require Node >=20.19.0.

Written for commit 749ff2a. Summary will update on new commits.

Review in cubic

@code-yeongyu code-yeongyu merged commit b3922b1 into main Jun 22, 2026
6 checks passed
@code-yeongyu code-yeongyu deleted the code-yeongyu/webfetch-reader-mode branch June 22, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant