diff --git a/.gitignore b/.gitignore index 6240da8b..0a37991a 100644 --- a/.gitignore +++ b/.gitignore @@ -16,6 +16,7 @@ pnpm-debug.log* # environment variables .env .env.production +!examples/deployment/.env # macOS-specific files .DS_Store diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..ea2e2c7d --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,94 @@ +# Repository Guidelines + +## Scope and Ownership + +This repository (`html2rss.github.io/`) is the public docs and feed-directory site built with Astro/Starlight. +Classify most work here as `docs`. + +What this repo owns: + +- docs content and navigation under `src/content/docs/` +- docs-specific components and styling under `src/components/` +- feed-directory presentation and client behavior +- generated docs data consumed by the site (`src/data/configs.json`) + +What this repo does not own: + +- runtime extractor behavior and CLI semantics (`html2rss/`) +- web API behavior and OpenAPI generation (`html2rss-web/`) +- feed YAML catalog definitions (`html2rss-configs/`) + +When docs describe behavior from other repos, treat those repos as source-of-truth and update docs to match them. + +## Cross-Repo Contracts + +Before substantial edits, state cross-repo context in your notes: + +- Source-of-truth repo +- Downstream consumer repo(s) +- Whether this change needs coordinated follow-up outside `html2rss.github.io/` + +Common contracts: + +- Feed directory data comes from `html2rss-configs` via `bin/data-update`. +- Ruby gem docs should match `html2rss` behavior and CLI output. +- Web application docs should match `html2rss-web` behavior and published OpenAPI. + +If a cross-repo behavior changed but upstream is not updated yet, document the gap clearly instead of inventing new behavior. + +## Generated Artifacts + +Treat `src/data/configs.json` as generated. + +- Do not hand-edit it. +- Regenerate with repo-native commands: + - `make update` + - or `bin/data-update` (after dependencies are installed) +- `bin/data-update` reads packaged configs (from `html2rss-configs`) and writes `src/data/configs.json`. + +If a change only affects generated data, include the source change rationale in the PR description. + +## Build, Test, and Dev Commands + +Run commands from `html2rss.github.io/`: + +- `make setup` installs dependencies and refreshes generated data +- `make dev` runs Astro locally +- `make build` builds production output +- `make lint` checks formatting +- `make lintfix` applies formatting fixes +- `make update` refreshes feed-directory data from packaged configs + +Preferred verification flow for docs/content changes: + +1. Run targeted check(s) first (`make lint` or `make build`). +2. Run the broader check set before PR (`make lint` and `make build`). +3. If feed directory or config references changed, run `make update` and verify resulting diffs. + +## Docs Authoring Rules + +### Code Snippets + +In docs content (`src/content/docs/**`) and docs-supporting components: + +- Do not use triple-backtick fenced code blocks. +- Always render snippets with the `` component. +- Use this import: + `import { Code } from '@astrojs/starlight/components';` +- Do not use: + `import Code from "astro/components/Code.astro";` + +### Accuracy Rules + +- Prefer concrete, verifiable statements over aspirational wording. +- Keep repo and path references explicit when guidance is cross-repo. +- When referencing commands that belong to another repo, include that repo directory in the command example. + +## Commit and PR Expectations + +- Keep each commit scoped to one logical docs change. +- Do not mix unrelated changes or unrelated generated diffs. +- In PRs, call out: + - cross-repo assumptions + - generated files updated + - verification commands run diff --git a/Gemfile.lock b/Gemfile.lock index ca8724c5..f831f582 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -1,13 +1,13 @@ GIT remote: https://github.com/html2rss/html2rss-configs.git - revision: 2d64c4766e12835d64b6ecaa46ee23491a2be4a2 + revision: 24d38d64e578e93340e228c7bbfd2b187a2b94b6 specs: html2rss-configs (0.2.0) html2rss GIT remote: https://github.com/html2rss/html2rss.git - revision: 2e0fa0ca5835cc49268a85f7ad79e2a5b33b4b79 + revision: 04720b83c69f8b2fcf6d9ba7f360b84f6bcb5fe6 specs: html2rss (0.17.0) addressable (~> 2.7) @@ -59,7 +59,7 @@ GEM protocol-rack (~> 0.7) protocol-websocket (~> 0.17) base64 (0.3.0) - bigdecimal (4.0.1) + bigdecimal (4.1.0) brotli (0.8.0) concurrent-ruby (1.3.6) console (1.34.3) @@ -118,9 +118,9 @@ GEM fiber-storage fiber-storage (1.0.1) io-endpoint (0.17.2) - io-event (1.14.4) + io-event (1.14.5) io-stream (0.11.1) - json (2.19.2) + json (2.19.3) kramdown (2.5.2) rexml (>= 3.4.4) logger (1.7.0) @@ -178,6 +178,7 @@ GEM PLATFORMS arm64-darwin-24 + arm64-darwin-25 x86_64-linux DEPENDENCIES @@ -185,4 +186,4 @@ DEPENDENCIES html2rss-configs! BUNDLED WITH - 2.7.1 + 4.0.9 diff --git a/README.md b/README.md index 044845b4..773adb1a 100644 --- a/README.md +++ b/README.md @@ -13,21 +13,24 @@ website. ### Quick Setup ```bash -# Install dependencies and start development server +# Install dependencies and refresh generated feed data make setup + +# Start the local Astro development server +make dev ``` ### đź’» Try in Browser You can develop html2rss directly in your browser using GitHub Codespaces: -[Open in GitHub Codespaces](https://github.com/codespaces/new?repo=html2rss/html2rss) +[Open in GitHub Codespaces](https://github.com/codespaces/new?repo=html2rss/html2rss.github.io) The Codespace provides a cloud development environment with Node.js and Ruby pre-installed. Run `make setup` to install dependencies and get started! ### Available Commands -- `make setup` - Install dependencies and start dev server +- `make setup` - Install dependencies and refresh generated feed data - `make dev` - Start Astro development server - `make build` - Build for production - `make preview` - Preview production build diff --git a/examples/deployment/.env b/examples/deployment/.env new file mode 100644 index 00000000..62f8201d --- /dev/null +++ b/examples/deployment/.env @@ -0,0 +1,26 @@ +# Domain & routing +CADDY_HOST=example.com + +# Core runtime +RACK_ENV=production + +# Security +# Generate with: openssl rand -hex 32 +HTML2RSS_SECRET_KEY=replace-with-64-hex-characters-generated-by-openssl-rand-hex-32 + +# Authenticated health endpoint token +# Required by the documented Compose stack. +# If you build a custom stack and probe only /api/v1/health/live and /api/v1/health/ready, +# you can omit this value. +HEALTH_CHECK_TOKEN=replace-with-strong-health-token + +# Auto source (optional; keep false unless you need automatic feed generation) +AUTO_SOURCE_ENABLED=false + +# Observability (optional) +#SENTRY_DSN= + +# Performance tuning (override defaults only when needed) +WEB_CONCURRENCY=2 +WEB_MAX_THREADS=5 +RACK_TIMEOUT_SERVICE_TIMEOUT=15 diff --git a/examples/deployment/docker-compose.yml b/examples/deployment/docker-compose.yml index 5b98fa66..e22e093c 100644 --- a/examples/deployment/docker-compose.yml +++ b/examples/deployment/docker-compose.yml @@ -1,19 +1,24 @@ services: - html2rss: + html2rss-web: image: html2rss/web:latest - env_file: .env + restart: unless-stopped + env_file: + - path: .env + required: false + environment: + PORT: 4000 caddy: image: caddy:2-alpine depends_on: - - html2rss + - html2rss-web command: - caddy - reverse-proxy - --from - ${CADDY_HOST} - --to - - html2rss:3000 + - html2rss-web:4000 ports: - "80:80" - "443:443" @@ -23,14 +28,12 @@ services: watchtower: image: containrrr/watchtower depends_on: - - html2rss + - html2rss-web - caddy command: - --cleanup - --interval - - "300" - - html2rss - - caddy + - "7200" volumes: - /var/run/docker.sock:/var/run/docker.sock:ro restart: unless-stopped diff --git a/src/components/docs/DockerComposeSnippet.astro b/src/components/docs/DockerComposeSnippet.astro index e2706e43..2a37676e 100644 --- a/src/components/docs/DockerComposeSnippet.astro +++ b/src/components/docs/DockerComposeSnippet.astro @@ -1,5 +1,5 @@ --- -import Code from "astro/components/Code.astro"; +import { Code } from "@astrojs/starlight/components"; import { browserlessImage, caddyImage, watchtowerImage, webImage } from "../../data/docker"; interface Props { @@ -15,13 +15,19 @@ const snippets: Record = { restart: unless-stopped ports: - "127.0.0.1:4000:4000" + env_file: + - path: .env + required: false environment: RACK_ENV: production PORT: 4000 - HTML2RSS_SECRET_KEY: your-generated-secret-key - HEALTH_CHECK_TOKEN: your-health-check-token + BUILD_TAG: \${BUILD_TAG:-local} + GIT_SHA: \${GIT_SHA:-local} + HTML2RSS_SECRET_KEY: \${HTML2RSS_SECRET_KEY:?set HTML2RSS_SECRET_KEY} + HEALTH_CHECK_TOKEN: \${HEALTH_CHECK_TOKEN:?set HEALTH_CHECK_TOKEN} + SENTRY_DSN: \${SENTRY_DSN:-} BROWSERLESS_IO_WEBSOCKET_URL: ws://browserless:4002 - BROWSERLESS_IO_API_TOKEN: your-browserless-token + BROWSERLESS_IO_API_TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN} browserless: image: "${browserlessImage}" @@ -31,10 +37,11 @@ const snippets: Record = { environment: PORT: 4002 CONCURRENT: 10 - TOKEN: your-browserless-token`, + TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN}`, productionCaddy: `services: caddy: image: ${caddyImage} + restart: unless-stopped ports: - "80:80" - "443:443" @@ -46,39 +53,71 @@ const snippets: Record = { - --from - \${CADDY_HOST} - --to - - html2rss:3000 - html2rss: + - html2rss-web:4000 + + html2rss-web: image: ${webImage} - env_file: .env + restart: unless-stopped + env_file: + - path: .env + required: false + environment: + RACK_ENV: production + PORT: 4000 + BUILD_TAG: \${BUILD_TAG:-local} + GIT_SHA: \${GIT_SHA:-local} + HTML2RSS_SECRET_KEY: \${HTML2RSS_SECRET_KEY:?set HTML2RSS_SECRET_KEY} + HEALTH_CHECK_TOKEN: \${HEALTH_CHECK_TOKEN:?set HEALTH_CHECK_TOKEN} + SENTRY_DSN: \${SENTRY_DSN:-} + BROWSERLESS_IO_WEBSOCKET_URL: ws://browserless:4002 + BROWSERLESS_IO_API_TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN} + + browserless: + image: "${browserlessImage}" + restart: unless-stopped + environment: + PORT: 4002 + CONCURRENT: 10 + TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN} volumes: caddy_data:`, secure: `services: - html2rss: + html2rss-web: image: ${webImage} + restart: unless-stopped + env_file: + - path: .env + required: false environment: RACK_ENV: production - LOG_LEVEL: warn - HEALTH_CHECK_USERNAME: your-secure-username - HEALTH_CHECK_PASSWORD: your-very-secure-password - BASE_URL: https://yourdomain.com`, + PORT: 4000 + BUILD_TAG: \${BUILD_TAG:-local} + GIT_SHA: \${GIT_SHA:-local} + HTML2RSS_SECRET_KEY: \${HTML2RSS_SECRET_KEY:?set HTML2RSS_SECRET_KEY} + HEALTH_CHECK_TOKEN: \${HEALTH_CHECK_TOKEN:?set HEALTH_CHECK_TOKEN} + SENTRY_DSN: \${SENTRY_DSN:-} + BROWSERLESS_IO_WEBSOCKET_URL: ws://browserless:4002 + BROWSERLESS_IO_API_TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN} + + browserless: + image: "${browserlessImage}" + restart: unless-stopped + environment: + PORT: 4002 + CONCURRENT: 10 + TOKEN: \${BROWSERLESS_IO_API_TOKEN:?set BROWSERLESS_IO_API_TOKEN}`, watchtower: `services: watchtower: image: ${watchtowerImage} - depends_on: - - html2rss - - caddy - command: - - --cleanup - - --interval - - "300" - - html2rss - - caddy + restart: unless-stopped volumes: - /var/run/docker.sock:/var/run/docker.sock:ro - restart: unless-stopped`, + # Optional for private registries only: + # - "\${HOME}/.docker/config.json:/config.json:ro" + command: --cleanup --interval 7200 html2rss-web browserless caddy`, resourceGuardrails: `services: - html2rss: + html2rss-web: image: ${webImage} deploy: resources: diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx index 16ddd656..24a38fed 100644 --- a/src/content/docs/creating-custom-feeds.mdx +++ b/src/content/docs/creating-custom-feeds.mdx @@ -5,8 +5,7 @@ sidebar: order: 2 --- -import { Aside } from "@astrojs/starlight/components"; -import Code from "astro/components/Code.astro"; +import { Aside, Code } from "@astrojs/starlight/components"; When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs. @@ -70,11 +69,14 @@ This tells html2rss basic information about your feed - like giving it a name an **Example:** -```yaml -channel: - url: https://example.com/blog - title: My Awesome Blog -``` + This says: "Look at this website and call the feed 'My Awesome Blog'" @@ -84,16 +86,19 @@ This is where you tell the html2rss engine exactly what to find on the page. You **Example:** -```yaml -selectors: - items: - selector: "article.post" - title: - selector: "h2 a" - link: - selector: "h2 a" - attribute: href -``` + This says: "Find each article, get the title from the h2 anchor, and get the link from the same h2 anchor's href attribute" @@ -107,20 +112,23 @@ This says: "Find each article, get the title from the h2 anchor, and get the lin **Step 2:** Create a file called `example.com.yml` with this basic structure: -```yaml -channel: - url: https://example.com/blog - title: My Blog - -selectors: - items: - selector: "article.post" - title: - selector: "h2 a" - link: - selector: "h2 a" - attribute: href -``` + **Step 3:** Test it with your html2rss-web instance or the [Ruby gem](/ruby-gem/installation). @@ -147,15 +155,11 @@ html2rss supports many configuration options: 1. **Validate the config first:** - ```bash - html2rss validate your-config.yml - ``` + 2. **Then render the feed with the Ruby gem:** - ```bash - html2rss feed your-config.yml - ``` + 3. **Test with `html2rss-web`:** Add your config to the `feeds.yml` file and restart your instance @@ -169,9 +173,11 @@ Some sites need a little more request budget than the defaults. - Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches. diff --git a/src/content/docs/get-involved/contributing.mdx b/src/content/docs/get-involved/contributing.mdx index 87ddbf6c..1fe89d3d 100644 --- a/src/content/docs/get-involved/contributing.mdx +++ b/src/content/docs/get-involved/contributing.mdx @@ -3,6 +3,8 @@ title: Contributing description: Learn how to contribute to the html2rss project --- +import { Code } from "@astrojs/starlight/components"; + We're thrilled that you're interested in contributing to `html2rss`! There are many ways to get involved, and we welcome contributions of all kinds. --- @@ -31,9 +33,7 @@ Are you missing an RSS feed for a website? You can create your own feed config a **Want to test your config first?** Use the [Ruby gem](/ruby-gem/installation) to test it locally: -```bash -html2rss feed your-config.yml -``` + ### 2. Host a Public Instance diff --git a/src/content/docs/get-involved/self-hosting.mdx b/src/content/docs/get-involved/self-hosting.mdx index ea519ddd..111768cb 100644 --- a/src/content/docs/get-involved/self-hosting.mdx +++ b/src/content/docs/get-involved/self-hosting.mdx @@ -1,54 +1,28 @@ --- title: "Self-Host Your Own Instance" -description: "Take control of your information diet. Host your own html2rss instance and join the decentralized web movement." +description: "Start self-hosting with the current html2rss-web setup and deployment docs." sidebar: order: 3 --- -Turn any website into an RSS feed. Self-host your own instance to take back control of your information diet and help the html2rss ecosystem grow for everyone. +This page is the short routing point for self-hosting. The current setup and deployment instructions live under the `html2rss-web` docs so the Docker, token, and Browserless guidance only exists in one place. -## Before You Begin +## Recommended Path -This guide walks you through running a production-ready instance that friends, teams, or communities can rely on. You'll need: +1. **[Run html2rss-web locally](/web-application/getting-started/)** to verify your own instance with an included feed first. +2. **[Deploy html2rss-web to production](/web-application/how-to/deployment/)** when you are ready to expose or operate it. +3. **[Use automatic feed generation](/web-application/how-to/use-automatic-feed-generation/)** only if you want the token-gated page-URL workflow. -- A server you control (a VPS, home lab, or cloud instance) with Docker support. -- Comfort running a few terminal commands and editing configuration files. +## What To Expect -If that feels new, start with the [Getting Started guide](/web-application/getting-started/) for a friendly local install. It introduces the same concepts at a slower pace. When you're ready to go live, come back here and review the full [Deployment & Production guide](/web-application/how-to/deployment) for sizing tips, proxy examples, and hardening advice. - -Before you deploy, double-check this quick checklist: - -- Docker Engine and Docker Compose Plugin are installed on the host. -- Ports 80/443 (or the ports used by your TLS terminator) are open to the internet if you plan to serve other users. -- You can publish DNS for your chosen domain. - -## Deployment Overview - -1. Generate your `docker-compose.yml` and `config/feeds.yml` by following [Step 2 of the Getting Started guide](/web-application/getting-started/#step-2-create-the-configuration-file), then copy the resulting files into your deployment directory. -2. Create an `.env` file with production credentials and the values documented in the [environment reference](/web-application/reference/env-variables). Generate new secrets (`openssl rand -hex 32`) and avoid reusing the samples from local testing. -3. Adjust the compose file to match your host (volumes, proxy service, watchtower, resource limits). The [deployment guide](/web-application/how-to/deployment) shows complete examples for Caddy, health-check protection, and automatic updates. -4. Start the stack with `docker compose up -d` and verify the application is reachable at your chosen domain or internal endpoint. - -For extra reliability, integrate the instance with your existing reverse proxy, DNS, or platform tooling rather than running it ad hoc on a laptop. Treat it like any other production service so readers can trust it. - -## Harden & Secure - -- Follow the [Secure Your Instance](/web-application/how-to/deployment#secure-your-instance) checklist to lock down credentials, TLS, and network access. -- Enforce HTTPS by configuring a reverse proxy (see [Option A: Caddy](/web-application/how-to/deployment#option-a-caddy-automatic-https)) or your preferred terminator. If you manage certificates separately, document the renewal procedure alongside your deployment scripts. -- Review the [production preparation guidelines](/web-application/how-to/deployment#prepare-for-production) and keep secrets outside of version control. - -## Monitor & Maintain - -- Point your uptime monitor at `/health_check.txt` and review container logs regularly. The [Operate & Monitor](/web-application/how-to/deployment#operate--monitor) section outlines suggested thresholds. -- Automate updates with Watchtower (a Docker container that updates running containers) or your container management platform to receive the latest html2rss-web releases quickly. -- Track storage usage for feed cache volumes and prune unused images. Schedule periodic configuration reviews so feeds and credentials remain accurate. +- `html2rss-web` is the recommended self-hosted product surface. +- Included feeds are the lowest-maintenance way to prove a deployment. +- Automatic feed generation is disabled by default in production. +- The generated API contract is published as OpenAPI at `/openapi.yaml`. +- Custom config work belongs in the core `html2rss` docs and JSON Schema. ## Share Your Instance Running a reliable deployment benefits the broader community. Share your server with the broader community by adding it to the [community instance list](https://github.com/html2rss/html2rss-web/wiki/Instances) once it is stable and you are ready for other readers. Include details such as uptime expectations, moderation policy, and contact information so people know what to expect. Thanks for investing the time to share html2rss with others. Each new instance expands the open web and helps readers stay in control of the stories they follow. - -## License - -[MIT](https://github.com/html2rss/html2rss/blob/main/LICENSE) diff --git a/src/content/docs/getting-started.mdx b/src/content/docs/getting-started.mdx index c3874cfe..d0061e77 100644 --- a/src/content/docs/getting-started.mdx +++ b/src/content/docs/getting-started.mdx @@ -5,7 +5,7 @@ sidebar: order: 1 --- -import Code from "astro/components/Code.astro"; +import { Code } from "@astrojs/starlight/components"; This page points to the main onboarding flow. diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx index 7d1088b3..ab429c29 100644 --- a/src/content/docs/ruby-gem/how-to/advanced-features.mdx +++ b/src/content/docs/ruby-gem/how-to/advanced-features.mdx @@ -3,6 +3,8 @@ title: "Advanced Features" description: "Advanced features and performance optimizations for html2rss." --- +import { Code } from "@astrojs/starlight/components"; + This guide covers advanced features and performance optimizations for html2rss. ## Parallel Processing @@ -28,18 +30,21 @@ html2rss is designed to be memory-efficient: For websites with many items: -```yaml -channel: - url: "https://example.com/articles" -selectors: - items: - selector: ".article:not(.advertisement)" # Exclude ads - title: - selector: "h2" # More specific than generic selectors - url: - selector: "a" - extractor: "href" -``` + ## Error Recovery @@ -53,37 +58,42 @@ html2rss includes built-in error handling: Optimize requests with appropriate headers: -```yaml -headers: - Accept: "text/html,application/xhtml+xml" # Avoid JSON if not needed - Accept-Encoding: "gzip, deflate" # Enable compression -channel: - url: "https://example.com/articles" -selectors: - items: - selector: "article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ## Monitoring and Debugging ### Enable Debug Logging -```bash -LOG_LEVEL=debug html2rss feed config.yml -``` + ### Web Application Health Checks -Use the health check endpoint to monitor feed generation: +Use the authenticated health endpoint to monitor the web application, or use liveness/readiness endpoints when you do not use an auth token: -```bash -curl -u username:password http://localhost:4000/health_check.txt -``` + ## Article Validation @@ -105,22 +115,25 @@ Invalid articles are automatically filtered out to prevent empty or broken feed You can add custom validation by using post-processors: -```yaml -channel: - url: "https://example.com/articles" -selectors: - items: - selector: "article" - title: - selector: "h2" - post_process: - - name: "gsub" - pattern: "^\\s*$" - replacement: "Untitled" - url: - selector: "a" - extractor: "href" -``` + ## Best Practices diff --git a/src/content/docs/ruby-gem/how-to/backward-compatibility.mdx b/src/content/docs/ruby-gem/how-to/backward-compatibility.mdx index 7ea074f8..92a7a6df 100644 --- a/src/content/docs/ruby-gem/how-to/backward-compatibility.mdx +++ b/src/content/docs/ruby-gem/how-to/backward-compatibility.mdx @@ -3,6 +3,8 @@ title: "Backward Compatibility" description: "html2rss maintains backward compatibility with older configuration formats and attribute names." --- +import { Code } from "@astrojs/starlight/components"; + html2rss maintains backward compatibility with older configuration formats and attribute names. ## Renamed Attributes @@ -17,17 +19,20 @@ Some attribute names have been renamed for clarity, but the old names still work Both of these configurations work identically: -```yaml -# Current format (recommended) -selectors: - published_at: - selector: ".date" + ## Migration Guide diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx index 23fdfa7b..e361f410 100644 --- a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx +++ b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx @@ -3,7 +3,7 @@ title: "Custom HTTP Requests" description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss." --- -import Code from "astro/components/Code.astro"; +import { Code } from "@astrojs/starlight/components"; Some sites only work when requests carry the headers, tokens, or cookies your browser uses. `html2rss` supports those cases without changing the rest of your feed workflow. @@ -28,19 +28,21 @@ You might need custom HTTP requests when: Add a `headers` section to your feed configuration. This example is a complete, valid config: object" - title: - selector: "title" - url: - selector: "url"`} + code={` + headers: + User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)" + Authorization: "Bearer YOUR_API_TOKEN" + Accept: "application/json" + channel: + url: https://api.example.com/posts + selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" + `} lang="yaml" /> @@ -49,21 +51,23 @@ selectors: Request budgets are configured under `request`, not as top-level keys: @@ -77,99 +81,114 @@ selectors: Many APIs require authentication tokens: -```yaml -headers: - Authorization: "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." - X-API-Key: "your-api-key-here" -channel: - url: "https://api.example.com/posts" -selectors: - items: - selector: "array > object" - title: - selector: "title" - url: - selector: "url" -``` + object" + title: + selector: "title" + url: + selector: "url" + `} + lang="yaml" +/> ### User Agent Spoofing Some websites block requests that don't look like real browsers: -```yaml -headers: - User-Agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" - Accept: "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" - Accept-Language: "en-US,en;q=0.5" - Accept-Encoding: "gzip, deflate" -channel: - url: "https://example.com/articles" -selectors: - items: - selector: "article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ### Content Type Negotiation Request specific content types: -```yaml -headers: - Accept: "application/json" -channel: - url: "https://api.example.com/posts" -selectors: - items: - selector: "array > object" - title: - selector: "title" - url: - selector: "url" -``` + object" + title: + selector: "title" + url: + selector: "url" + `} + lang="yaml" +/> ### Custom API Headers Some APIs require specific headers: -```yaml -headers: - X-Requested-With: "XMLHttpRequest" - X-Custom-Header: "your-value" - Content-Type: "application/json" -channel: - url: "https://api.example.com/posts" -selectors: - items: - selector: "array > object" - title: - selector: "title" - url: - selector: "url" -``` + object" + title: + selector: "title" + url: + selector: "url" + `} + lang="yaml" +/> ## Dynamic Headers You can use dynamic parameters in headers for runtime values: -```yaml -headers: - Authorization: "Bearer %s" - X-User-ID: "%s" -channel: - url: "https://api.example.com/users/%s/posts" -selectors: - items: - selector: "array > object" - title: - selector: "title" - url: - selector: "url" -``` +s" + X-User-ID: "%s" + channel: + url: "https://api.example.com/users/%s/posts" + selectors: + items: + selector: "array > object" + title: + selector: "title" + url: + selector: "url" + `} + lang="yaml" +/> See our [Dynamic Parameters guide](/ruby-gem/how-to/dynamic-parameters) for more details. @@ -183,13 +202,14 @@ See our [Dynamic Parameters guide](/ruby-gem/how-to/dynamic-parameters) for more Test your configuration to ensure headers work correctly: -```bash -# Test with curl first -curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/posts - -# Then test with html2rss -html2rss feed your-config.yml -``` + ## Troubleshooting @@ -211,38 +231,44 @@ html2rss feed your-config.yml ### GitHub API -```yaml -headers: - Authorization: "token YOUR_GITHUB_TOKEN" - Accept: "application/vnd.github.v3+json" - User-Agent: "html2rss/1.0" -channel: - url: https://api.github.com/repos/owner/repo/issues -selectors: - items: - selector: "array > object" - title: - selector: "title" - url: - selector: "html_url" -``` + object" + title: + selector: "title" + url: + selector: "html_url" + `} + lang="yaml" +/> ### Reddit API -```yaml -headers: - User-Agent: "html2rss/1.0 by your-username" - Accept: "application/json" -channel: - url: https://www.reddit.com/r/programming.json -selectors: - items: - selector: "data > children > object > data" - title: - selector: "title" - url: - selector: "url" -``` + children > object > data" + title: + selector: "title" + url: + selector: "url" + `} + lang="yaml" +/> ## Related Topics diff --git a/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx b/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx index ad74ac31..92b1ca67 100644 --- a/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx +++ b/src/content/docs/ruby-gem/how-to/dynamic-parameters.mdx @@ -3,32 +3,35 @@ title: Dynamic Parameters description: "Learn how to use dynamic parameters in URLs and headers for creating reusable feed configurations. Pass runtime values to customize feeds." --- +import { Code } from "@astrojs/starlight/components"; + Use dynamic parameters when websites share the same structure but vary by URL or header values. ## Solution You can add dynamic parameters to the `channel` and `headers` values. This is useful for creating feeds from structurally similar pages with different URLs. -```yaml -channel: - url: "https://domainname.tld/whatever/%s.html" -headers: - X-Something: "%s" -selectors: - items: - selector: "article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` +s.html" + headers: + X-Something: "%s" + selectors: + items: + selector: "article" + title: + selector: "h2" + url: + selector: "a" + extractor: "href" + `} + lang="yaml" +/> You can then pass the values for these parameters when you run `html2rss`: -```bash -html2rss feed the_feed_config.yml --params id:42 foo:bar -``` + --- diff --git a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx index dbdc7dec..f7b739ad 100644 --- a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx +++ b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx @@ -3,7 +3,7 @@ title: Handling Dynamic Content description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically." --- -import Code from "astro/components/Code.astro"; +import { Code } from "@astrojs/starlight/components"; Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content. @@ -14,23 +14,25 @@ Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScr Keep the strategy at the top level and put request-specific options under `request`: @@ -50,40 +52,49 @@ HTML snapshot is taken. ### Wait Before Capturing Dynamic Content -```yaml -strategy: browserless -request: - browserless: - preload: - wait_after_ms: 4000 -``` + ### Click "Load More" Buttons -```yaml -strategy: browserless -request: - browserless: - preload: - wait_after_ms: 3000 - click_selectors: - - selector: ".load-more" - max_clicks: 3 - wait_after_ms: 250 -``` + ### Scroll Infinite Lists -```yaml -strategy: browserless -request: - browserless: - preload: - scroll_down: - iterations: 5 - wait_after_ms: 200 - wait_after_ms: 2500 -``` + These preload steps can be combined in a single config when a site needs several interactions before all items appear. diff --git a/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx b/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx index 89441f6a..a28968bf 100644 --- a/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx +++ b/src/content/docs/ruby-gem/how-to/managing-feed-configs.mdx @@ -3,68 +3,78 @@ title: Managing Feed Configs description: "Learn how to manage feed configurations with YAML files. Organize, version control, and maintain your html2rss feed configurations effectively." --- +import { Code } from "@astrojs/starlight/components"; + For easier management, especially when using the CLI or `html2rss-web`, you can store your feed configurations in a YAML file. ## Global and Feed-Specific Configurations You can define global settings that apply to all feeds, and then define individual feed configurations under the `feeds` key. -```yml -# Global settings -headers: - "User-Agent": "Mozilla/5.0 (compatible; html2rss/1.0; Mobile)" - "Accept": "text/html" - -# Feed-specific settings -feeds: - my-first-feed: - channel: - url: "https://example.com/blog" - selectors: - items: - selector: ".post" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" - my-second-feed: - channel: - url: "https://example.com/news" - selectors: - items: - selector: ".news-item" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ## Building Feeds from a YAML File ### Ruby -```ruby -require 'html2rss' + ### Command Line -```sh -# Build a specific feed -html2rss feed feeds.yml my-first-feed - -# Build a feed from a single-feed YAML file -html2rss feed single.yml -``` + diff --git a/src/content/docs/ruby-gem/how-to/scraping-json.mdx b/src/content/docs/ruby-gem/how-to/scraping-json.mdx index 5c375efc..d859ccec 100644 --- a/src/content/docs/ruby-gem/how-to/scraping-json.mdx +++ b/src/content/docs/ruby-gem/how-to/scraping-json.mdx @@ -3,6 +3,8 @@ title: Scraping JSON Responses description: "Learn how to scrape JSON APIs and responses with html2rss. Convert JSON to XML and use CSS selectors for data extraction." --- +import { Code } from "@astrojs/starlight/components"; + When a website returns a JSON response (i.e., with a `Content-Type` of `application/json`), `html2rss` converts the JSON to XML, allowing you to use CSS selectors for data extraction. > [!NOTE] @@ -14,26 +16,32 @@ When a website returns a JSON response (i.e., with a `Content-Type` of `applicat A JSON object like this: -```json -{ - "data": [{ "title": "Headline", "url": "https://example.com" }] -} -``` + is converted to this XML structure: -```xml - - - - - Headline - https://example.com - - - - -``` + + + + + Headline + https://example.com + + + + +`} + lang="xml" +/> You would use `array > object` as your `items` selector. @@ -41,20 +49,21 @@ You would use `array > object` as your `items` selector. A JSON array like this: -```json -[{ "title": "Headline", "url": "https://example.com" }] -``` + is converted to this XML structure: -```xml - - - Headline - https://example.com - - -``` + + + Headline + https://example.com + + +`} + lang="xml" +/> You would use `array > object` as your `items` selector. @@ -62,34 +71,40 @@ You would use `array > object` as your `items` selector. ### Ruby -```ruby -Html2rss.feed( - headers: { - Accept: 'application/json' - }, - channel: { - url: 'https://domainname.tld/whatever.json' - }, - selectors: { - items: { selector: 'array > object' }, - title: { selector: 'title' }, - url: { selector: 'url' } - } -) -``` + object' }, + title: { selector: 'title' }, + url: { selector: 'url' } + } + ) +`} + lang="ruby" +/> ### YAML -```yml -headers: - Accept: application/json -channel: - url: "https://domainname.tld/whatever.json" -selectors: - items: - selector: "array > object" - title: - selector: ".title" - url: - selector: "url" -``` + object" + title: + selector: "title" + url: + selector: "url" + `} + lang="yml" +/> diff --git a/src/content/docs/ruby-gem/index.mdx b/src/content/docs/ruby-gem/index.mdx index 903e4128..31b96436 100644 --- a/src/content/docs/ruby-gem/index.mdx +++ b/src/content/docs/ruby-gem/index.mdx @@ -7,6 +7,12 @@ sidebar: This section provides comprehensive documentation for the `html2rss` Ruby gem. +If you are looking for the stable machine-readable contract for config authoring, use the JSON Schema exported by the core repo: + +- Repository file: [`schema/html2rss-config.schema.json`](https://github.com/html2rss/html2rss/blob/master/schema/html2rss-config.schema.json) +- CLI export: `html2rss schema` +- Runtime validation: `html2rss validate config.yml` + ## Getting Started If you are getting started with `html2rss`, we recommend starting with the [tutorials](/ruby-gem/tutorials). diff --git a/src/content/docs/ruby-gem/installation.mdx b/src/content/docs/ruby-gem/installation.mdx index 2d388c9c..7ba024a3 100644 --- a/src/content/docs/ruby-gem/installation.mdx +++ b/src/content/docs/ruby-gem/installation.mdx @@ -3,6 +3,8 @@ title: "Installation" description: "Install the html2rss Ruby gem on your system. Choose from multiple installation methods including gem install, Bundler, or GitHub Codespaces." --- +import { Code } from "@astrojs/starlight/components"; + This guide will walk you through installing the html2rss Ruby gem on your system. Choose the method that works best for your setup - we'll walk you through each option step by step. --- @@ -18,9 +20,7 @@ This guide will walk you through installing the html2rss Ruby gem on your system The simplest way to get html2rss for command-line usage is to install it as a Ruby gem. -```bash -gem install html2rss -``` + After installation, you should be able to run `html2rss --version` to confirm it's working. @@ -30,10 +30,13 @@ After installation, you should be able to run `html2rss --version` to confirm it If you're integrating html2rss into an existing Ruby project, add it to your `Gemfile`: -```ruby -# Gemfile -gem 'html2rss' -``` + Then, run `bundle install` in your project directory. @@ -53,9 +56,7 @@ The Codespace comes pre-configured with Ruby, all dependencies, and VS Code exte To ensure html2rss is installed correctly, open your terminal and run: -```bash -html2rss --version -``` + You should see the installed version number. If you encounter any issues, please refer to the [Troubleshooting Guide](/troubleshooting/troubleshooting). diff --git a/src/content/docs/ruby-gem/reference/auto-source.mdx b/src/content/docs/ruby-gem/reference/auto-source.mdx index 82e92df0..277db98e 100644 --- a/src/content/docs/ruby-gem/reference/auto-source.mdx +++ b/src/content/docs/ruby-gem/reference/auto-source.mdx @@ -3,15 +3,20 @@ title: Auto Source description: "Learn about the auto_source scraper that automatically finds items on a page. No CSS selectors needed - html2rss intelligently detects content." --- +import { Code } from "@astrojs/starlight/components"; + The `auto_source` scraper automatically finds items on a page, so you don't have to specify CSS selectors. To enable it, add `auto_source: {}` to your configuration: -```yaml -channel: - url: https://example.com -auto_source: {} -``` + ## How It Works @@ -38,37 +43,43 @@ You can customize `auto_source` to improve its accuracy. Enable or disable specific scrapers and adjust their settings: -```yaml -channel: - url: https://example.com -auto_source: - scraper: - wordpress_api: - enabled: false # default: true - schema: - enabled: false # default: true - semantic_html: - enabled: false # default: true - json_state: - enabled: false # default: true - html: - enabled: true - minimum_selector_frequency: 3 # default: 2 - use_top_selectors: 3 # default: 5 -``` + ### Cleanup Options Remove unwanted items from the results: -```yaml -channel: - url: https://example.com -auto_source: - cleanup: - keep_different_domain: false # default: true - min_words_title: 4 # default: 3 -``` + --- diff --git a/src/content/docs/ruby-gem/reference/channel.mdx b/src/content/docs/ruby-gem/reference/channel.mdx index ccc12948..a074a532 100644 --- a/src/content/docs/ruby-gem/reference/channel.mdx +++ b/src/content/docs/ruby-gem/reference/channel.mdx @@ -3,28 +3,33 @@ title: Channel description: "Learn about the channel configuration block for RSS feed metadata. Configure feed title, description, author, and other RSS channel properties." --- +import { Code } from "@astrojs/starlight/components"; + The `channel` configuration block defines your feed metadata. This example is a complete feed config so you can see the `channel` block in context: -```yaml -channel: - url: https://example.com - title: "My Custom Feed" - description: "A feed of the latest news from Example.com" - author: "jane.doe@example.com (Jane Doe)" - ttl: 60 - language: "en" - time_zone: "Europe/Berlin" -selectors: - items: - selector: "article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ## Options diff --git a/src/content/docs/ruby-gem/reference/cli-reference.mdx b/src/content/docs/ruby-gem/reference/cli-reference.mdx index 458e3bfa..8e4ba22f 100644 --- a/src/content/docs/ruby-gem/reference/cli-reference.mdx +++ b/src/content/docs/ruby-gem/reference/cli-reference.mdx @@ -3,6 +3,8 @@ title: CLI Reference description: Complete reference for the html2rss command-line interface --- +import { Code } from "@astrojs/starlight/components"; + This page documents the `html2rss` command-line interface (CLI). For detailed documentation on the Ruby API, please refer to the official YARD documentation. @@ -17,42 +19,98 @@ The `html2rss` executable is the primary way to interact with the gem from your Automatically discovers items from a page and prints the generated RSS feed to stdout. -```bash -# Use the default faraday strategy -html2rss auto https://example.com/articles + -# Force browserless for JavaScript-heavy pages -html2rss auto https://example.com/app --strategy browserless +Command: `html2rss auto URL` -# Set custom request budgets -html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6 +#### URL Surface Guidance For `auto` -# Hint the item selector while keeping auto enhancement -html2rss auto https://example.com/articles --items_selector ".post-card" -``` +`auto` works best when the input URL already exposes a server-rendered list of entries. -Command: `html2rss auto URL` +- High-success surfaces: + - newsroom or press listing pages + - blog/category/tag listing pages + - changelog/release notes/update listing pages + - paginated archive/list views +- Low-success surfaces: + - generic homepages with heavy promo/navigation chrome + - search results pages + - client-rendered app shells (`#app`, `#root`, `#__next`, etc.) -### Feed +When possible, pass a direct listing/update URL instead of a top-level homepage or app entrypoint. -Loads a YAML config, builds the feed, and prints the RSS XML to stdout. +#### Failure Outcomes You Should Expect + +When no extractable items are found, `auto` now classifies likely causes instead of only returning a generic message: + +- `blocked surface likely (anti-bot or interstitial)`: + - retry with `--strategy browserless` + - try a more specific public listing URL +- `app-shell surface detected`: + - retry with `--strategy browserless` + - switch to a direct listing/update URL +- `unsupported extraction surface for auto mode`: + - switch to listing/changelog/category URLs + - use explicit selectors in a feed config + +Known anti-bot interstitial responses (for example Cloudflare challenge pages) are surfaced explicitly as blocked-surface errors. + +#### Browserless Setup And Diagnostics (CLI) + +`browserless` is opt-in for CLI usage. + + -# Pass dynamic parameters into %s placeholders -html2rss feed single.yml --params id:42 foo:bar -``` +If you see `Browserless connection failed`, check: + +- `BROWSERLESS_IO_WEBSOCKET_URL` points to a reachable Browserless endpoint +- `BROWSERLESS_IO_API_TOKEN` matches the Browserless `TOKEN` +- the Browserless service is running and reachable from your shell environment + +For custom Browserless endpoints, `BROWSERLESS_IO_API_TOKEN` is required. + +### Feed + +Loads a YAML config, builds the feed, and prints the RSS XML to stdout. + + Command: `html2rss feed YAML_FILE [feed_name]` @@ -62,16 +120,14 @@ The CLI keeps `strategy` as a top-level override and writes runtime request limi Prints the exported JSON Schema for the current gem version. -```bash -# Pretty-printed JSON (default) -html2rss schema - -# Compact JSON -html2rss schema --no-pretty - -# Write the schema to a file -html2rss schema --write tmp/html2rss-config.schema.json -``` + Command: `html2rss schema` @@ -79,13 +135,13 @@ Command: `html2rss schema` Validates a config with the runtime validator without generating a feed. -```bash -# Validate a single-feed file -html2rss validate single.yml - -# Validate one feed from a multi-feed file -html2rss validate feeds.yml my-first-feed -``` + Command: `html2rss validate YAML_FILE [feed_name]` diff --git a/src/content/docs/ruby-gem/reference/headers.mdx b/src/content/docs/ruby-gem/reference/headers.mdx index c2ff34ed..52053d05 100644 --- a/src/content/docs/ruby-gem/reference/headers.mdx +++ b/src/content/docs/ruby-gem/reference/headers.mdx @@ -3,27 +3,32 @@ title: Headers description: "Learn how to set custom HTTP headers for html2rss requests. Add authentication, user agents, and API keys to access protected content." --- +import { Code } from "@astrojs/starlight/components"; + The `headers` key allows you to set custom HTTP headers for your requests. This is useful for accessing APIs or other protected content. ## Configuration You can add any number of headers to your configuration. This example is a complete, valid feed config: -```yaml -headers: - User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)" - Authorization: "Bearer YOUR_TOKEN" - Accept: "application/json" -channel: - url: "https://api.example.com/posts" -selectors: - items: - selector: "array > object" - title: - selector: "title" - url: - selector: "url" -``` + object" + title: + selector: "title" + url: + selector: "url" + `} + lang="yaml" +/> ## Dynamic Parameters diff --git a/src/content/docs/ruby-gem/reference/selectors.mdx b/src/content/docs/ruby-gem/reference/selectors.mdx index e5adaa87..d25da5df 100644 --- a/src/content/docs/ruby-gem/reference/selectors.mdx +++ b/src/content/docs/ruby-gem/reference/selectors.mdx @@ -3,6 +3,8 @@ title: "Selectors" description: "The selectors scraper gives you fine-grained control over content extraction using CSS selectors." --- +import { Code } from "@astrojs/starlight/components"; + The `selectors` scraper gives you fine-grained control over content extraction using CSS selectors. > A valid RSS item requires at least a `title` or a `description`. @@ -11,37 +13,46 @@ The `selectors` scraper gives you fine-grained control over content extraction u At a minimum, you need an `items` selector to define the list of articles and a `title` selector for the article titles. -```yml -channel: - url: "https://example.com" -selectors: - items: - selector: ".article" - title: - selector: "h1" -``` + ## Automatic Item Enhancement To simplify configuration, `html2rss` can automatically extract the `title`, `url`, and `image` from each item. This feature is enabled by default. -```yml -selectors: - items: - selector: ".article" - enhance: true # default: true -``` + ## Item Ordering You can control the order of items in your feed: -```yml -selectors: - items: - selector: ".article" - order: "reverse" # Reverse the order of items (newest first) -``` + Available options: @@ -52,20 +63,23 @@ Available options: `html2rss` can follow a single `rel="next"` pagination chain when you configure `selectors.items.pagination.max_pages`. -```yml -channel: - url: "https://example.com/news" -selectors: - items: - selector: "article" - pagination: - max_pages: 3 - title: - selector: "h1" - url: - selector: "a" - extractor: "href" -``` + Behavior: @@ -135,48 +149,57 @@ Post-processors manipulate the extracted value. To add categories to an item, provide a list of selector names to the `categories` selector. -```yml -selectors: - genre: - selector: ".genre" - branch: - selector: ".branch" - categories: - - genre - - branch -``` + ### Custom GUID To create a custom GUID for an item, provide a list of selector names to the `guid` selector. -```yml -selectors: - title: - selector: "h1" - url: - selector: "a" - extractor: "href" - guid: - - url -``` + ### Enclosures To add an enclosure (e.g., an image, audio, or video file) to an item, use the `enclosure` selector to specify the URL of the file. -```yml -selectors: - items: - selector: ".post" - title: - selector: "h2" - enclosure: - selector: "audio" - extractor: "attribute" - attribute: "src" - content_type: "audio/mp3" -``` + --- diff --git a/src/content/docs/ruby-gem/reference/strategy.mdx b/src/content/docs/ruby-gem/reference/strategy.mdx index cd5e4d61..f35f383b 100644 --- a/src/content/docs/ruby-gem/reference/strategy.mdx +++ b/src/content/docs/ruby-gem/reference/strategy.mdx @@ -3,6 +3,8 @@ title: Strategy description: "Learn about different strategies for fetching website content with html2rss. Choose between faraday and browserless strategies for optimal performance." --- +import { Code } from "@astrojs/starlight/components"; + The `strategy` key defines how `html2rss` fetches a website's content. - **`faraday`** (default): Makes a direct HTTP request. It is fast but does not execute JavaScript. @@ -10,6 +12,8 @@ The `strategy` key defines how `html2rss` fetches a website's content. `strategy` is a top-level config key. Request-specific controls live under `request`. +Use `faraday` first for direct newsroom/listing/changelog pages. Prefer `browserless` when the target is client-rendered, protected by anti-bot checks, or otherwise requires JavaScript to expose article links. + ## `browserless` To use the `browserless` strategy, you need a running instance of [Browserless.io](https://www.browserless.io/). @@ -18,35 +22,41 @@ To use the `browserless` strategy, you need a running instance of [Browserless.i You can run a local Browserless.io instance using Docker: -```sh -docker run \ - --rm \ - -p 3000:3000 \ - -e "CONCURRENT=10" \ - -e "TOKEN=6R0W53R135510" \ - ghcr.io/browserless/chromium -``` + ### Configuration Set the `strategy` at the top level of your feed configuration and put request controls under `request`: -```yml -strategy: browserless -request: - max_redirects: 5 - max_requests: 6 -channel: - url: "https://example.com/app" -selectors: - items: - selector: ".article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ### Request Structure @@ -60,47 +70,53 @@ Use this split consistently: Example: -```yml -strategy: browserless -headers: - User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)" -request: - max_redirects: 5 - max_requests: 6 - browserless: - preload: - wait_after_ms: 5000 -channel: - url: "https://example.com/app" -selectors: - items: - selector: ".article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ### Browserless Preload Browserless can interact with the page before html2rss captures the final HTML. Configure preload steps under `request.browserless.preload`. -```yml -strategy: browserless -request: - browserless: - preload: - wait_after_ms: 5000 - click_selectors: - - selector: ".load-more" - max_clicks: 3 - wait_after_ms: 250 - scroll_down: - iterations: 5 - wait_after_ms: 200 -``` + - `wait_after_ms`: inserts a fixed wait before or after preload steps - `click_selectors`: clicks matching elements until they disappear or `max_clicks` is reached @@ -113,18 +129,29 @@ pagination therefore resolve against the page that was actually rendered after p You can also specify the strategy on the command line: -```sh -# Set environment variables for your Browserless.io instance -BROWSERLESS_IO_WEBSOCKET_URL="ws://127.0.0.1:3000" \ -BROWSERLESS_IO_API_TOKEN="6R0W53R135510" \ - html2rss feed my_config.yml --strategy browserless + + +### Browserless Troubleshooting + +If Browserless cannot connect, html2rss surfaces a `Browserless connection failed (...)` error with endpoint/token hints. + +Check these first: -# Override request budgets at runtime -html2rss feed my_config.yml --max-redirects 5 --max-requests 6 +- `BROWSERLESS_IO_WEBSOCKET_URL` is reachable from where html2rss runs +- `BROWSERLESS_IO_API_TOKEN` matches your Browserless `TOKEN` +- your Browserless service is running and accepting connections -# Or rely on the strategy stored in the YAML config -html2rss feed my_config.yml -``` +For custom Browserless websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is mandatory. The local default endpoint (`ws://127.0.0.1:3000`) can use the default local token `6R0W53R135510`. --- diff --git a/src/content/docs/ruby-gem/reference/stylesheets.mdx b/src/content/docs/ruby-gem/reference/stylesheets.mdx index 6110cf10..df4b1c62 100644 --- a/src/content/docs/ruby-gem/reference/stylesheets.mdx +++ b/src/content/docs/ruby-gem/reference/stylesheets.mdx @@ -3,6 +3,8 @@ title: Stylesheets description: "Learn how to add CSS and XSLT stylesheets to your RSS feeds. Customize the appearance and formatting of your generated RSS feeds." --- +import { Code } from "@astrojs/starlight/components"; + The `stylesheets` key allows you to add CSS or XSLT stylesheets to your RSS feed, improving its appearance in web browsers. This makes your RSS feeds look professional and branded when viewed directly in a browser. ## Why Style Your RSS Feed? @@ -18,25 +20,28 @@ Styling your RSS feed provides several benefits: You can add multiple stylesheets to a normal feed configuration: -```yaml -stylesheets: - - href: "/path/to/style.xsl" - media: "all" - type: "text/xsl" - - href: "https://example.com/rss.css" - media: "all" - type: "text/css" -channel: - url: "https://example.com/articles" -selectors: - items: - selector: "article" - title: - selector: "h2" - url: - selector: "a" - extractor: "href" -``` + ## Further Reading diff --git a/src/content/docs/ruby-gem/reference/wordpress-api.mdx b/src/content/docs/ruby-gem/reference/wordpress-api.mdx index 0e0a71ee..7798eab4 100644 --- a/src/content/docs/ruby-gem/reference/wordpress-api.mdx +++ b/src/content/docs/ruby-gem/reference/wordpress-api.mdx @@ -3,6 +3,8 @@ title: "WordPress API" description: "Use the WordPress API scraper inside auto_source to read WordPress posts through the site's REST API." --- +import { Code } from "@astrojs/starlight/components"; + The `wordpress_api` scraper is part of `auto_source`. When a WordPress site exposes its public REST API, `html2rss` can read posts from that API instead of scraping article HTML. This usually gives cleaner results because WordPress already exposes fields such as the title, content, excerpt, permalink, publish date, and category IDs. @@ -11,11 +13,14 @@ This usually gives cleaner results because WordPress already exposes fields such Enable `auto_source` as usual: -```yml -channel: - url: "https://example.com/blog" -auto_source: {} -``` + If the target is a standard WordPress site with a public API, no selectors are required. @@ -23,9 +28,7 @@ If the target is a standard WordPress site with a public API, no selectors are r The scraper works when the page exposes the standard WordPress API link in its ``: -```html - -``` +`} lang="html" /> If that link is missing or the API is blocked, `auto_source` falls back to its other discovery strategies. @@ -33,14 +36,17 @@ If that link is missing or the API is blocked, `auto_source` falls back to its o You can disable `wordpress_api` while keeping the rest of `auto_source` enabled: -```yml -channel: - url: "https://example.com/blog" -auto_source: - scraper: - wordpress_api: - enabled: false -``` + ## What Gets Extracted diff --git a/src/content/docs/ruby-gem/tutorials/simple-blog-list.mdx b/src/content/docs/ruby-gem/tutorials/simple-blog-list.mdx index df36c40c..8fa0781a 100644 --- a/src/content/docs/ruby-gem/tutorials/simple-blog-list.mdx +++ b/src/content/docs/ruby-gem/tutorials/simple-blog-list.mdx @@ -3,6 +3,8 @@ title: "Scraping a Simple Blog List" description: "This example demonstrates how to create a feed from a typical blog that has a list of articles on its homepage." --- +import { Code } from "@astrojs/starlight/components"; + This example demonstrates how to create a feed from a typical blog that has a list of articles on its homepage. --- @@ -17,18 +19,21 @@ We want to create an RSS feed that contains the title, link, and summary of each Here's a simplified view of the HTML structure we're targeting. The key is to find a container element that wraps each blog post (in this case, `.post-item`) and then find the selectors for the title, link, and summary within that container. -```html -
-
-

First Post Title

-

Summary of the first post...

-
-
-

Second Post Title

-

Summary of the second post...

+ +
+

First Post Title

+

Summary of the first post...

+
+
+

Second Post Title

+

Summary of the second post...

+
-
-``` +`} + lang="html" +/> --- @@ -36,20 +41,23 @@ Here's a simplified view of the HTML structure we're targeting. The key is to fi This configuration uses the `selectors` scraper to precisely extract the content we want. -```yaml -channel: - url: https://example.com/blog -selectors: - items: - selector: ".post-item" - title: - selector: ".post-title a" - url: - selector: ".post-title a" - extractor: "href" - description: - selector: ".post-summary" -``` + ### Configuration Breakdown diff --git a/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx b/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx index da94010f..3e8adb68 100644 --- a/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx +++ b/src/content/docs/ruby-gem/tutorials/your-first-feed.mdx @@ -3,6 +3,8 @@ title: "Your First Feed" description: "Welcome to html2rss! This guide will walk you through creating your first RSS feed." --- +import { Code } from "@astrojs/starlight/components"; + Welcome to `html2rss`! This guide will walk you through creating your first RSS feed. We'll start with the easiest method and gradually introduce more powerful options. --- @@ -13,9 +15,7 @@ The easiest way to create a feed is with the `auto` command. It requires no conf Open your terminal and run this command: -```bash -html2rss auto https://unmatchedstyle.com/ -``` + `html2rss` will analyze the website and generate an RSS feed for you, printing it directly to the console. This is a great way to see if `html2rss` can automatically handle your target website. @@ -32,26 +32,27 @@ Let's create a feed for a simple article listing page. 1. **Create a file** named `example.yml`. 2. **Add the following content:** - ```yaml - channel: - url: https://example.com/articles - selectors: - items: - selector: ".article-card" - title: - selector: "h2 a" - url: - selector: "h2 a" - extractor: "href" - description: - selector: ".summary" - ``` + 3. **Run the `feed` command:** - ```bash - html2rss feed example.yml - ``` + This configuration tells `html2rss`: diff --git a/src/content/docs/troubleshooting/troubleshooting.mdx b/src/content/docs/troubleshooting/troubleshooting.mdx index b961b036..a02a7290 100644 --- a/src/content/docs/troubleshooting/troubleshooting.mdx +++ b/src/content/docs/troubleshooting/troubleshooting.mdx @@ -3,6 +3,8 @@ title: "Troubleshooting" description: "This guide provides solutions to common issues encountered when using html2rss." --- +import { Code } from "@astrojs/starlight/components"; + This guide provides solutions to common issues encountered when using `html2rss`. ## Essential Tools @@ -15,6 +17,21 @@ Your browser's developer tools are essential for troubleshooting. Use them to in ## Common Issues (Ruby Gem / CLI) +### `auto` Picks The Wrong Surface Or Finds No Items + +The `auto` flow is URL-surface sensitive. + +- Higher success inputs: + - newsroom/press listing URLs + - category/tag/listing/archive URLs + - changelog/release/update listing URLs +- Lower success inputs: + - generic homepages + - search result pages + - client-rendered app-shell entrypoints + +If extraction quality is poor, switch to a more specific listing/update URL before tuning selectors. + ### Empty Feeds If your feed is empty, check the following: @@ -25,6 +42,50 @@ If your feed is empty, check the following: - **JavaScript Content:** If the content is loaded via JavaScript, use the `browserless` strategy instead of `faraday`. - **Authentication:** Some sites require authentication — check if you need to add headers or use a different strategy. +### `No scrapers found` Failure Taxonomy (`auto`) + +`auto` classifies no-scraper failures with actionable hints: + +- **Blocked surface likely (anti-bot or interstitial):** + - retry with `--strategy browserless` + - try a more specific public listing URL +- **App-shell surface detected:** + - retry with `--strategy browserless` + - target a direct listing/update page instead of homepage/shell entrypoint +- **Unsupported extraction surface for auto mode:** + - switch to listing/changelog/category URLs + - or use explicit selectors in YAML config + +Known anti-bot interstitial patterns (for example Cloudflare challenge pages) are surfaced as blocked-surface errors instead of silent empty extraction results. + +### Browserless Connection / Setup Failures + +If you receive `Browserless connection failed (...)`: + +1. Confirm Browserless is running and reachable from the machine running `html2rss`. +2. Confirm `BROWSERLESS_IO_WEBSOCKET_URL` points at that running service. +3. Confirm `BROWSERLESS_IO_API_TOKEN` matches the Browserless `TOKEN`. + +Example local startup: + + + +Then run with: + + + +For custom websocket endpoints, `BROWSERLESS_IO_API_TOKEN` is required. + ### Configuration Errors Common configuration-related errors: @@ -40,7 +101,7 @@ If parts of your items (e.g., title, link) are missing, check the following: - **Selector:** Ensure the selector for the missing part is correct and relative to the `items.selector`. - **Extractor:** Verify that you are using the correct `extractor` (e.g., `text`, `href`, `attribute`). -- **Dynamic Content:** `html2rss` does not execute JavaScript. If the content is loaded dynamically, `html2rss` may not be able to see it. +- **Dynamic Content:** `faraday` does not render JavaScript. If content loads dynamically, run with `--strategy browserless` (with the Browserless service available) so the page can be rendered before extraction. ### Date/Time Parsing Errors @@ -63,42 +124,33 @@ If you are getting a "command not found" error, try the following: ### Instance Won’t Start - Verify Docker is installed and running: - ```bash - docker --version - ``` + - Check logs for errors: - ```bash - docker compose logs - ``` -- Ensure the port (default: 3000) isn’t already in use: - ```bash - netstat -tulpn | grep :3000 - ``` + +- Ensure the app port (default compose binding: 4000) isn’t already in use: + +- If the app exits immediately in production, check that `HTML2RSS_SECRET_KEY` is set. ### Can’t Access the Web Interface -- Confirm your firewall allows traffic on port 3000 (or the port you configured) +- Confirm your firewall allows traffic on port 4000 or your reverse-proxy ports - Try accessing via the server’s IP instead of a domain name - Double-check that containers are running: - ```bash - docker compose ps - ``` + ### Authentication Errors -- **401 Unauthorized:** Check your `AUTO_SOURCE_USERNAME` and `AUTO_SOURCE_PASSWORD` environment variables. -- **403 Forbidden:** The URL is not in the `AUTO_SOURCE_ALLOWED_URLS` list, or the origin is not in `AUTO_SOURCE_ALLOWED_ORIGINS`. +- **401 Unauthorized when creating feeds:** The create-feed API expects a bearer token. Re-enter a valid access token in the UI or send `Authorization: Bearer ...` to `POST /api/v1/feeds`. +- **403 Forbidden when creating feeds:** Automatic feed generation may be disabled (`AUTO_SOURCE_ENABLED=false`) or the requested URL may not be allowed for the authenticated account. - **500 Internal Server Error:** Check the application logs for detailed error information. -- **Health check failures:** Use the `/health_check.txt` endpoint to identify which specific feed configurations are broken. +- **Health endpoint failures:** Use `GET /api/v1/health/live`, `GET /api/v1/health/ready`, or authenticated `GET /api/v1/health` depending on which probe you are testing. ### Feed Problems - Some sites may require JavaScript rendering; ensure the `browserless` service is running - Check the feed configuration in `feeds.yml` for typos or invalid selectors - Look for parsing errors in the logs: - ```bash - docker compose logs html2rss-web - ``` + --- diff --git a/src/content/docs/web-application/getting-started.mdx b/src/content/docs/web-application/getting-started.mdx index 2da2b129..1beb338a 100644 --- a/src/content/docs/web-application/getting-started.mdx +++ b/src/content/docs/web-application/getting-started.mdx @@ -5,10 +5,12 @@ sidebar: order: 2 --- +import { Code } from "@astrojs/starlight/components"; + import AutoGenerationOptional from "../../../components/docs/AutoGenerationOptional.astro"; import MinimalDockerCompose from "../../../components/docs/MinimalDockerCompose.astro"; -Run `html2rss-web` locally with Docker, open the web interface, and verify that your instance can serve a working included feed. +Run `html2rss-web` locally with Docker, open the web interface, and verify that your instance can serve a working included feed before you enable direct feed generation. ## What You Will Have When This Works @@ -17,7 +19,7 @@ After this guide, you should have: - `html2rss-web` running at `http://localhost:4000` - the web interface loading correctly - a first included feed URL you can copy into your reader -- a clear path to either custom configs or more advanced setup +- a clear path to either token-gated feed generation or custom configs ## Installation Guide @@ -34,10 +36,13 @@ If you do not already have Docker, [install it first](https://docs.docker.com/ge Create a new folder for `html2rss-web`: -```bash -mkdir html2rss-web -cd html2rss-web -``` + ### Step 2: Create a Minimal Configuration File @@ -45,15 +50,26 @@ Create a file called `docker-compose.yml` in that folder and start with the mini -Add update automation later, after the first run works. +This minimal stack intentionally proves the included-feed path first. Add automatic updates, reverse proxying, or your own config file only after this first run works. ### Step 3: Start html2rss-web -Run: +Create a `.env` file in the same folder (minimum required values for this stack): + + .env < -```bash -docker compose up -d -``` +Then run: + + ## First Success Check @@ -68,7 +84,7 @@ At this point, `html2rss-web` should be running. 4. Confirm the feed opens 5. Copy that feed URL into your reader -If that works, the deployment, static feed path, and reader subscription path are working together. +If that works, the deployment, included-config path, and reader subscription path are working together. ## What To Do First @@ -78,7 +94,18 @@ Start with an included config from your own instance: 2. copy that feed URL into your reader 3. confirm your reader can subscribe successfully -That proves the core path before you invest in automatic generation or custom configs. +That proves the lowest-friction path before you invest in automatic generation or custom configs. + +## What Changes If You Enable Feed Generation + +Automatic feed generation is off by default in production. When you enable it later: + +- the web app creates feeds through `POST /api/v1/feeds` +- that API requires a bearer token +- the UI starts with `faraday` and automatically retries once with `browserless` when appropriate +- Browserless still needs to be configured for JavaScript-heavy pages + +If you are integrating this flow programmatically, the generated OpenAPI is available at `/openapi.yaml`. diff --git a/src/content/docs/web-application/how-to/automatic-updates.mdx b/src/content/docs/web-application/how-to/automatic-updates.mdx index de414469..cda50295 100644 --- a/src/content/docs/web-application/how-to/automatic-updates.mdx +++ b/src/content/docs/web-application/how-to/automatic-updates.mdx @@ -5,6 +5,23 @@ sidebar: order: 10 --- -The [watchtower](https://containrrr.dev/watchtower/) service automatically pulls running Docker images and checks for updates. If an update is available, it will automatically start the updated image with the same configuration as the running one. Please read its manual. +import { Code } from "@astrojs/starlight/components"; -The `docker-compose.yml` in the [Installation Guide](/web-application/getting-started#installation-guide) contains a service description for watchtower. +import DockerComposeSnippet from "../../../../components/docs/DockerComposeSnippet.astro"; + +Use [watchtower](https://containrrr.dev/watchtower/) to periodically pull and restart containers when newer images are available. + +Add this service to your existing `docker-compose.yml`: + + + +Then restart the stack: + + + +Operational note: + +- Keep the Docker socket mount (read-only in this example). +- Add the optional Docker config mount only if you pull private images that require registry auth. +- The shown Watchtower command updates all running containers by default. +- Check `docker compose logs watchtower` to confirm scans and update runs. diff --git a/src/content/docs/web-application/how-to/deployment.mdx b/src/content/docs/web-application/how-to/deployment.mdx index 8a8cbea4..8e55d075 100644 --- a/src/content/docs/web-application/how-to/deployment.mdx +++ b/src/content/docs/web-application/how-to/deployment.mdx @@ -1,15 +1,26 @@ --- title: "Deployment & Production" -description: "Deploy html2rss-web to production with Docker. Learn best practices for hosting public instances with security, monitoring, and reliability." +description: "Deploy html2rss-web with Docker, keep the included-feed path simple, and only enable token-gated feed generation when you are ready to operate it." --- +import { Code } from "@astrojs/starlight/components"; + import DockerComposeSnippet from "../../../../components/docs/DockerComposeSnippet.astro"; -html2rss-web ships on Docker Hub, so you can launch this self-hosted service wherever Docker runs. Start with the official [`docker-compose.yml`](https://github.com/html2rss/html2rss-web/blob/master/docker-compose.yml) from the [Installation Guide](/web-application/getting-started) as your baseline. +html2rss-web ships on Docker Hub, so you can launch this self-hosted service wherever Docker runs. Start with the official [`docker-compose.yml`](https://github.com/html2rss/html2rss-web/blob/main/docker-compose.yml) as your baseline, and treat the [Getting Started guide](/web-application/getting-started) as the required first proof that your instance can already serve included feeds locally. If you have not yet created a local instance, complete the [Getting Started guide](/web-application/getting-started) first. It walks through the one-time project directory setup, creating a minimal compose file, and confirming the application locally, which gives you the right baseline before exposing a self-hosted instance publicly. -Already running html2rss-web on your workstation? Great! The sections below focus on what changes when you take that setup to production. +Already running html2rss-web on your workstation? The sections below focus on what changes when you take that setup to production. + +## Choose Your Production Scope First + +There are two materially different deployment modes: + +- **Included feeds only:** lowest-maintenance path, suitable when the packaged feed set already covers your needs +- **Included feeds plus automatic generation:** requires `AUTO_SOURCE_ENABLED=true`, bearer-token distribution, and Browserless capacity planning + +If you do not need page-URL generation yet, keep `AUTO_SOURCE_ENABLED` off and ship the simpler mode first. ## Prepare for Production @@ -18,6 +29,14 @@ Before exposing html2rss-web, ensure: - Your domain (for example `yourdomain.com`) resolves to the host running Docker - Inbound TCP ports 80 and 443 reach the host (check firewalls and cloud security groups) - You are ready to watch the first deployment logs for certificate issuance +- You have a value ready for `HTML2RSS_SECRET_KEY` +- You have a value ready for `HEALTH_CHECK_TOKEN` if you plan to monitor authenticated `GET /api/v1/health` (the documented Compose stack uses it; `/api/v1/health/live` and `/api/v1/health/ready` do not require it) + +If you plan to enable automatic feed generation, also prepare: + +- `BROWSERLESS_IO_API_TOKEN` +- Browserless capacity appropriate for the sites you expect to render +- an operator plan for how users obtain valid bearer tokens ### Why a Reverse Proxy? @@ -31,21 +50,41 @@ Caddy handles certificates and redirects with almost no configuration. - Create a `.env` file beside your compose file with the following variables: - ```env - # Required for reverse proxy and application - CADDY_HOST=yourdomain.com - BASE_URL=https://yourdomain.com - HTML2RSS_SECRET_KEY= # Generate with: openssl rand -hex 32 - HEALTHCHECK_USER= # Set a strong username - HEALTHCHECK_PASS= # Set a strong password - # Optional: see [environment reference](/web-application/reference/env-variables) - ``` + - Update your `.env` before starting the stack: - - Set `CADDY_HOST` and `BASE_URL` for your domain (for example `yourdomain.com` / `https://yourdomain.com`). + - Set `CADDY_HOST` for your domain. - Generate a production secret (`openssl rand -hex 32`) and assign it to `HTML2RSS_SECRET_KEY`. - - Replace the sample health-check credentials with strong values. - - Adjust optional knobs (auto source, Sentry, worker counts) as needed and refer to the [environment reference](/web-application/reference/env-variables) for details. + - Set a strong `HEALTH_CHECK_TOKEN` when you use authenticated `GET /api/v1/health`; liveness/readiness probes can use `/api/v1/health/live` and `/api/v1/health/ready` without it. + - Set `BUILD_TAG` and `GIT_SHA` to real release metadata for production. + - Adjust optional knobs such as `AUTO_SOURCE_ENABLED` and `SENTRY_DSN` as needed; refer to the [environment reference](/web-application/reference/env-variables) for details. - After `docker compose up -d`, run `docker compose logs caddy --tail 20`; look for `certificate obtained`. - Re-test after DNS changes with [SSL Labs](https://www.ssllabs.com/ssltest/). - Want automatic updates? Add the Watchtower service shown below. @@ -54,10 +93,9 @@ Caddy handles certificates and redirects with almost no configuration. Harden the application before inviting other users: -- Set strong credentials for health checks and any protected feeds -- Configure `BASE_URL` with the final HTTPS domain before the first public run +- Set a strong `HEALTH_CHECK_TOKEN` for authenticated `GET /api/v1/health`, and separate strong bearer tokens for any protected feeds - Prefer environment files (`.env`) stored outside version control for secrets -- Keep admin-only routes behind basic auth or IP restrictions in your proxy +- Keep any operator-only token distribution flow outside public docs and outside version control @@ -67,7 +105,7 @@ Store these variables in a `.env` file and reference it with `env_file:` as demo Keep the instance healthy once it is in production: -- Monitor `https://yourdomain.com/health_check.txt` from your uptime tool +- Monitor `https://yourdomain.com/api/v1/health` with the configured bearer token for authenticated health checks - Review `docker compose logs` regularly for feed errors or certificate renewals - Enable automatic image updates so security patches roll out quickly - Right-size CPU and memory to avoid starvation when parsing large feeds @@ -76,6 +114,8 @@ Keep the instance healthy once it is in production: +This Watchtower shape scopes updates to `html2rss-web`, `browserless`, and `caddy`; replace service names if your stack differs. + Check `docker compose logs watchtower` occasionally to confirm updates are applied. ### Resource Guardrails @@ -87,6 +127,7 @@ Adjust the limits to match your host capacity. Increase memory if you process ma ## Share & Support - Test different feed sources before inviting external users +- Publish `/openapi.yaml` from the running instance if you expect agents or integrations to call the API directly - Add your instance to the [community wiki](https://github.com/html2rss/html2rss-web/wiki/Instances) if you want it listed publicly - Visit the [troubleshooting guide](/troubleshooting/troubleshooting) or join the [community discussions](https://github.com/orgs/html2rss/discussions) when you need help - Ready to contribute fixes or docs? See the [contributing guide](/get-involved/contributing) diff --git a/src/content/docs/web-application/how-to/setup-for-development.mdx b/src/content/docs/web-application/how-to/setup-for-development.mdx index b100e49b..61c3f7ab 100644 --- a/src/content/docs/web-application/how-to/setup-for-development.mdx +++ b/src/content/docs/web-application/how-to/setup-for-development.mdx @@ -3,6 +3,8 @@ title: "Setup for development" description: "Check out the git repository and set up your development environment." --- +import { Code } from "@astrojs/starlight/components"; + Check out the git repository and set up your development environment. ### Using Docker @@ -10,27 +12,34 @@ Check out the git repository and set up your development environment. This approach allows you to experiment without installing Ruby on your machine. All you need to do is install and run Docker. -```sh -# Build image from Dockerfile and name/tag it as html2rss-web: -docker build -t html2rss-web -f Dockerfile . + ### Using installed Ruby diff --git a/src/content/docs/web-application/how-to/use-automatic-feed-generation.mdx b/src/content/docs/web-application/how-to/use-automatic-feed-generation.mdx index b7ed824e..15c7c06a 100644 --- a/src/content/docs/web-application/how-to/use-automatic-feed-generation.mdx +++ b/src/content/docs/web-application/how-to/use-automatic-feed-generation.mdx @@ -1,35 +1,48 @@ --- title: "Use automatic feed generation" -description: "Enable the web UI flow that generates a feed directly from a page URL." +description: "Enable the token-gated web UI flow that creates a stable feed from a page URL." --- -Automatic feed generation is a standout `html2rss-web` feature: paste a page URL, create a feed, then copy the generated feed URL. +import { Code } from "@astrojs/starlight/components"; -> **Note:** This feature is disabled by default. Enabling it should be a conscious decision on your self-hosted instance. +Automatic feed generation lets `html2rss-web` create a stable feed from a page URL. It is useful when the included config set does not already cover the site you want. + +Use this only after you have already verified your instance with an included feed. In production, this feature is disabled by default and should be enabled consciously on your own instance. + +## What This Flow Actually Requires + +This flow depends on three separate things: + +- `AUTO_SOURCE_ENABLED=true` on the server +- a bearer token that the instance accepts for feed creation +- Browserless configured if the target page needs JavaScript rendering + +The generated API contract for this flow is published at `/openapi.yaml`. ## How to Enable It Edit your `docker-compose.yml` and enable automatic feed generation: -```yaml -environment: - AUTO_SOURCE_ENABLED: "true" - AUTO_SOURCE_ALLOWED_ORIGINS: 127.0.0.1:4000 -``` + + +Keep the existing `BROWSERLESS_IO_WEBSOCKET_URL` and `BROWSERLESS_IO_API_TOKEN` settings if you want JavaScript-heavy pages to work reliably. -Then restart: +Then restart the stack: -```bash -docker compose down -docker compose up -d -``` + ## How to Use It 1. Open your instance at `http://localhost:4000` 2. Paste a page URL into `Create a feed` -3. Submit the form -4. If the instance requires access, provide a configured access token +3. Add a valid access token when prompted +4. Choose a strategy if needed, then submit 5. Copy the generated feed URL or open it directly ## What Success Looks Like @@ -40,9 +53,66 @@ When the flow works, you should see: - a copy action - an open-feed action - a preview of recent entries when available +- the same feed staying available at its tokenized URL That is enough to confirm the self-hosted flow is working. +## Strategy Behavior + +- `faraday` is the default strategy and should be your first try for most pages. +- During the feed-creation API request (`POST /api/v1/feeds`) from the web UI, a `faraday` submission may be retried once with `browserless` when the first failure looks retryable. +- If that fallback attempt fails, or if the first failure is clearly auth/URL/unsupported-strategy related, the UI stops and shows an error. +- This retry behavior is scoped to feed creation. It is not a general retry layer for later feed rendering (`GET /api/v1/feeds/:token`) or preview loading. + +## Input URL Guidance (Quality First) + +Automatic generation is most successful when the input URL is already a listing/update surface. + +- Higher-success inputs: +- newsroom/press listing pages +- category/tag/archive/listing pages +- changelog/release/update pages +- Lower-success inputs: +- generic homepages +- search pages +- app-shell entrypoints (client-rendered shells) + +If output quality is poor, switch the input to a direct listing/update URL before assuming the feature is broken. + +## Failure Meanings You May See + +The backend runtime classifies common extraction failures with clearer intent: + +- blocked/interstitial surface likely +- app-shell surface likely +- unsupported extraction surface for auto mode + +In the current web product flow, these categories are mostly internal/operator-level signals (runtime/logging). They are not guaranteed to appear as labeled categories in the UI. + +What users typically see today: + +- feed-creation API errors (for example auth/URL/unsupported strategy) +- preview-level fallback text such as `Preview unavailable right now.` +- feed render error payloads when opening feed URLs directly + +## Browserless Troubleshooting In `html2rss-web` + +If Browserless-backed attempts fail: + +- verify the Browserless container/service is running +- verify `BROWSERLESS_IO_WEBSOCKET_URL` is reachable from the web container +- verify `BROWSERLESS_IO_API_TOKEN` matches the Browserless `TOKEN` + +For local Compose-based setups, check container health/logs with: + + + ## When to Stop and Switch Automatic feed generation is the fast first pass, not the final answer for every site. @@ -52,4 +122,4 @@ Move on to [Creating Custom Feeds](/creating-custom-feeds) when: - the generated feed misses important fields - the wrong items are extracted - the site needs a stable, reviewable setup -- you need `browserless` or more precise selectors to make the feed reliable +- you need repeatable selector-level control to make the feed reliable diff --git a/src/content/docs/web-application/index.mdx b/src/content/docs/web-application/index.mdx index aa689b1c..c39dac1b 100644 --- a/src/content/docs/web-application/index.mdx +++ b/src/content/docs/web-application/index.mdx @@ -6,7 +6,7 @@ sidebar: order: 1 --- -`html2rss-web` is the recommended way to get started. Run it locally with Docker, verify a working included feed from your own instance, and only then decide whether you need direct generation or custom configs. +`html2rss-web` is the recommended way to get started. Run it locally with Docker, verify a working included feed from your own instance, and only then decide whether you need token-gated direct generation or custom configs. ## Get Started @@ -21,9 +21,10 @@ Start with **[Getting Started](/web-application/getting-started)** to: - **Included feed catalog:** real embedded configs you can use immediately from your own deployment - **Web interface:** direct feed creation when you explicitly enable it -- **Optional access control:** unlock custom generation with an access token when configured +- **Access-controlled generation:** `POST /api/v1/feeds` requires a bearer token - **Config-based extension path:** move to custom feeds when you need reviewable rules - **Caching and HTTP handling:** shipped as part of the deployment +- **Generated API contract:** OpenAPI is published at `/openapi.yaml` The scraping and feed-building engine is provided by the Ruby gem [`html2rss`](https://github.com/html2rss/html2rss). @@ -34,3 +35,9 @@ The scraping and feed-building engine is provided by the Ruby gem [`html2rss`](h 3. **[Browse working feed examples](/feed-directory/)**: compare against existing outputs 4. **[Use automatic feed generation](/web-application/how-to/use-automatic-feed-generation/)**: enable direct page-URL conversion when you want that workflow 5. **[Create Custom Feeds](/creating-custom-feeds)**: build a stable custom setup when needed + +## For Integrations + +- **OpenAPI:** `/openapi.yaml` on your instance, or [`public/openapi.yaml`](https://github.com/html2rss/html2rss-web/blob/main/public/openapi.yaml) in the repo +- **API metadata:** `/api/v1` +- **Config schema for the core gem:** [`schema/html2rss-config.schema.json`](https://github.com/html2rss/html2rss/blob/master/schema/html2rss-config.schema.json) diff --git a/src/content/docs/web-application/reference/env-variables.mdx b/src/content/docs/web-application/reference/env-variables.mdx index 11b050fd..42b357b4 100644 --- a/src/content/docs/web-application/reference/env-variables.mdx +++ b/src/content/docs/web-application/reference/env-variables.mdx @@ -5,22 +5,19 @@ description: "Configuration reference for html2rss-web environment variables." ## Supported ENV variables -| Name | Description | -| ------------------------------ | -------------------------------- | -| `BASE_URL` | default: 'http://localhost:3000' | -| `LOG_LEVEL` | default: 'warn' | -| `HEALTH_CHECK_USERNAME` | default: auto-generated on start | -| `HEALTH_CHECK_PASSWORD` | default: auto-generated on start | -| | | -| `AUTO_SOURCE_ENABLED` | default: false | -| `AUTO_SOURCE_USERNAME` | no default. | -| `AUTO_SOURCE_PASSWORD` | no default. | -| `AUTO_SOURCE_ALLOWED_ORIGINS` | no default. | -| | | -| `PORT` | default: 3000 | -| `RACK_ENV` | default: 'development' | -| `RACK_TIMEOUT_SERVICE_TIMEOUT` | default: 15 | -| `WEB_CONCURRENCY` | default: 2 | -| `WEB_MAX_THREADS` | default: 5 | -| | | -| `SENTRY_DSN` | no default. | +| Name | Description | +| --------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `HTML2RSS_SECRET_KEY` | required in production; development/test gets a temporary default | +| `HEALTH_CHECK_TOKEN` | bearer token for authenticated `GET /api/v1/health`; optional unless you use that endpoint (the documented Compose stack includes it); `/api/v1/health/live` and `/api/v1/health/ready` do not require it | +| `BUILD_TAG` | defaults to `local` in the Compose stack; set release metadata explicitly in production | +| `GIT_SHA` | defaults to `local` in the Compose stack; set deployed commit SHA explicitly in production | +| `SENTRY_DSN` | optional; enables Sentry errors/logs when set | +| `BROWSERLESS_IO_WEBSOCKET_URL` | optional; Browserless websocket endpoint for `browserless` strategy | +| `BROWSERLESS_IO_API_TOKEN` | required by this site's Compose stack and by custom websocket endpoints; standalone `html2rss` local defaults can omit it | +| `AUTO_SOURCE_ENABLED` | `true` by default in development/test, `false` otherwise | +| `ASYNC_FEED_REFRESH_ENABLED` | optional boolean; default `false` | +| `ASYNC_FEED_REFRESH_STALE_FACTOR` | optional integer `>= 1`; default `3` | +| `PORT` | app listen port; compose uses `4000` | +| `RACK_ENV` | Rack environment; compose uses `production` | + +Older environment-variable examples from previous docs revisions are obsolete. Use only the supported table above and the `Environment & Runtime Flags` table in [`html2rss-web/docs/README.md`](https://github.com/html2rss/html2rss-web/blob/main/docs/README.md). diff --git a/src/content/docs/web-application/reference/index.mdx b/src/content/docs/web-application/reference/index.mdx index f8956e7d..81786b95 100644 --- a/src/content/docs/web-application/reference/index.mdx +++ b/src/content/docs/web-application/reference/index.mdx @@ -1,9 +1,17 @@ --- title: "Reference" -description: "This section contains detailed reference material." +description: "Detailed reference material for operating and integrating html2rss-web." sidebar: label: "Reference" order: 20 --- -This section contains detailed reference material. +This section contains the stable reference material for `html2rss-web`. + +## Integration Entry Points + +- **OpenAPI:** `/openapi.yaml` on a running instance, or [`public/openapi.yaml`](https://github.com/html2rss/html2rss-web/blob/main/public/openapi.yaml) in the repository +- **API root metadata:** `/api/v1` +- **Environment reference:** [env variables](/web-application/reference/env-variables/) + +If you need the config contract for the core `html2rss` gem rather than the web API, use the JSON Schema at [`schema/html2rss-config.schema.json`](https://github.com/html2rss/html2rss/blob/master/schema/html2rss-config.schema.json). diff --git a/src/content/docs/web-application/reference/monitoring.mdx b/src/content/docs/web-application/reference/monitoring.mdx index a4d7645a..9218ae05 100644 --- a/src/content/docs/web-application/reference/monitoring.mdx +++ b/src/content/docs/web-application/reference/monitoring.mdx @@ -1,25 +1,39 @@ --- title: "Monitoring" -description: "Runtime monitoring and application performance monitoring for html2rss-web." +description: "Health endpoints and Sentry-based monitoring for html2rss-web." --- -## Runtime monitoring via `GET /health_check.txt` +import { Code } from "@astrojs/starlight/components"; -It is recommended to set up monitoring of the `/health_check.txt` endpoint. With that, you can find out when one of _your own_ configs breaks. The endpoint uses HTTP Basic authentication. +## Health Endpoints -First, set the username and password via these environment variables: `HEALTH_CHECK_USERNAME` and `HEALTH_CHECK_PASSWORD`. If these are not set, html2rss-web will generate a new random username and password on _each_ start. +`html2rss-web` exposes these health endpoints: -An authenticated `GET /health_check.txt` request will respond with: +- `GET /api/v1/health/live`: liveness probe, no auth +- `GET /api/v1/health/ready`: readiness probe, no auth +- `GET /api/v1/health`: authenticated health probe, bearer token required -- If the feeds are generatable: `success`. -- Otherwise: the names of the broken configs. +## Authenticated Health Checks -To get notified when one of your configs breaks, set up monitoring of this endpoint. +Use `GET /api/v1/health` when you want an authenticated operator-facing probe. -[UptimeRobot's free plan](https://uptimerobot.com/) is sufficient for basic monitoring (every 5 minutes). -Create a monitor of type _Keyword_ with this information and make it aware of your username and password: +Set the environment variable `HEALTH_CHECK_TOKEN`, then send it as a bearer token: -![A screenshot showing the Keyword Monitor: a name, the instance's URL to /health_check.txt, and an interval.](/assets/images/uptimerobot_monitor.jpg) + + +The response is JSON and reports the current status, timestamp, environment, uptime, and reserved checks payload. + +## Probe Selection + +- Use `/api/v1/health/live` for a simple process-alive signal. +- Use `/api/v1/health/ready` for config-readiness checks without auth. +- Use `/api/v1/health` for authenticated monitoring from your uptime system or operator tooling. ## Application Performance Monitoring using Sentry diff --git a/src/content/docs/web-application/reference/versioning-and-releases.mdx b/src/content/docs/web-application/reference/versioning-and-releases.mdx index 903adff8..168aa8d6 100644 --- a/src/content/docs/web-application/reference/versioning-and-releases.mdx +++ b/src/content/docs/web-application/reference/versioning-and-releases.mdx @@ -5,11 +5,11 @@ description: Learn about versioning and release strategy for html2rss-web import { dockerHubRepository, dockerHubUrl } from "../../../../data/docker"; -This web application is distributed in a [rolling release](https://en.wikipedia.org/wiki/Rolling_release) fashion from the `master` branch. +This web application is distributed in a [rolling release](https://en.wikipedia.org/wiki/Rolling_release) fashion from the `main` branch. -For the latest commit passing GitHub CI/CD on the master branch, an updated Docker image will be pushed to Docker Hub: {dockerHubRepository}. +For the latest commit passing GitHub CI/CD on the main branch, an updated Docker image will be pushed to Docker Hub: {dockerHubRepository}. The [SBOM](https://en.wikipedia.org/wiki/Software_supply_chain) is embedded in the Docker image. -GitHub's @dependabot is enabled for dependency updates and they are automatically merged to the `master` branch when the CI gives the green light. +GitHub's @dependabot is enabled for dependency updates and they are automatically merged to the `main` branch when the CI gives the green light. If you use Docker, you should update to the latest image automatically by [setting up _watchtower_ as described](/web-application/how-to/automatic-updates).