diff --git a/src/content/docs/creating-custom-feeds.mdx b/src/content/docs/creating-custom-feeds.mdx index 3814969c..16ddd656 100644 --- a/src/content/docs/creating-custom-feeds.mdx +++ b/src/content/docs/creating-custom-feeds.mdx @@ -6,6 +6,7 @@ sidebar: --- import { Aside } from "@astrojs/starlight/components"; +import Code from "astro/components/Code.astro"; When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs. @@ -160,6 +161,22 @@ html2rss supports many configuration options: 4. **Check the output:** Make sure all items have titles, links, and descriptions +### Useful CLI flags when a site is difficult + +Some sites need a little more request budget than the defaults. + +- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads. +- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches. + + + +Keep these values tight. Raise them only when the site proves it needs more. + ## Add It To html2rss-web Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your diff --git a/src/content/docs/getting-started.mdx b/src/content/docs/getting-started.mdx index f08de5ba..f5e2a871 100644 --- a/src/content/docs/getting-started.mdx +++ b/src/content/docs/getting-started.mdx @@ -5,6 +5,8 @@ sidebar: order: 1 --- +import Code from "astro/components/Code.astro"; + This page points to the main onboarding flow. ## Start Here @@ -23,3 +25,15 @@ That guide is the canonical setup flow for: - **[Browse working feed examples](/feed-directory/)** - See what success looks like - **[Create Custom Feeds](/creating-custom-feeds)** - Write configs when you need more control - **[Troubleshooting Guide](/troubleshooting/troubleshooting)** - Fix startup or extraction problems + +## Using the Ruby CLI + +If you are working directly with the gem instead of `html2rss-web`, start with: + + + +If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports: + + + +For config-driven runs, the same flags are available on `html2rss feed`. diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx index 703bd9e9..7d1088b3 100644 --- a/src/content/docs/ruby-gem/how-to/advanced-features.mdx +++ b/src/content/docs/ruby-gem/how-to/advanced-features.mdx @@ -7,13 +7,7 @@ This guide covers advanced features and performance optimizations for html2rss. ## Parallel Processing -html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration. - -### How It Works - -- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page -- **Item processing:** Each scraped item is processed in parallel -- **Performance benefit:** Significantly faster when dealing with many items +html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration. ### Performance Tips @@ -88,7 +82,7 @@ LOG_LEVEL=debug html2rss feed config.yml Use the health check endpoint to monitor feed generation: ```bash -curl -u username:password http://localhost:3000/health_check.txt +curl -u username:password http://localhost:4000/health_check.txt ``` ## Article Validation diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx index 33b6cca3..23fdfa7b 100644 --- a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx +++ b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx @@ -3,7 +3,15 @@ title: "Custom HTTP Requests" description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss." --- -Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases. +import Code from "astro/components/Code.astro"; + +Some sites only work when requests carry the headers, tokens, or cookies your browser uses. `html2rss` supports those cases without changing the rest of your feed workflow. + +Keep this structure in mind: + +- `headers` stays top-level +- `strategy` stays top-level +- request-specific controls such as budgets and Browserless options live under `request` ## When You Need Custom Headers @@ -19,8 +27,8 @@ You might need custom HTTP requests when: Add a `headers` section to your feed configuration. This example is a complete, valid config: -```yaml -headers: + + +## Request Controls + +Request budgets are configured under `request`, not as top-level keys: + + + +- `request.max_redirects` limits redirect hops +- `request.max_requests` limits the total request budget for the feed build +- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions ## Common Use Cases diff --git a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx index c0e5e379..2ca0db72 100644 --- a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx +++ b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx @@ -3,12 +3,38 @@ title: Handling Dynamic Content description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically." --- +import Code from "astro/components/Code.astro"; + Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content. ## Solution Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser. +Keep the strategy at the top level and put request-specific options under `request`: + + + ## When to Use Browserless The `browserless` strategy is necessary when: @@ -18,6 +44,53 @@ The `browserless` strategy is necessary when: - **Infinite scroll** - Content loads as you scroll - **Dynamic forms** - Content changes based on user interaction +## Preload Actions + +For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the +HTML snapshot is taken. + +### Wait for JavaScript Requests + +```yaml +strategy: browserless +request: + browserless: + preload: + wait_for_network_idle: + timeout_ms: 4000 +``` + +### Click "Load More" Buttons + +```yaml +strategy: browserless +request: + browserless: + preload: + click_selectors: + - selector: ".load-more" + max_clicks: 3 + delay_ms: 250 + wait_for_network_idle: + timeout_ms: 3000 +``` + +### Scroll Infinite Lists + +```yaml +strategy: browserless +request: + browserless: + preload: + scroll_down: + iterations: 5 + delay_ms: 200 + wait_for_network_idle: + timeout_ms: 2500 +``` + +These preload steps can be combined in a single config when a site needs several interactions before all items appear. + ## Performance Considerations The `browserless` strategy is slower than the default `faraday` strategy because it: diff --git a/src/content/docs/ruby-gem/reference/auto-source.mdx b/src/content/docs/ruby-gem/reference/auto-source.mdx index 33454232..82e92df0 100644 --- a/src/content/docs/ruby-gem/reference/auto-source.mdx +++ b/src/content/docs/ruby-gem/reference/auto-source.mdx @@ -17,16 +17,19 @@ auto_source: {} `auto_source` uses the following strategies to find content: -1. **`schema`:** Parses `