Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions src/content/docs/creating-custom-feeds.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ sidebar:
---

import { Aside } from "@astrojs/starlight/components";
import Code from "astro/components/Code.astro";

When auto-sourcing isn't enough, you can write your own configuration files to create custom RSS feeds for any website. This guide shows you how to take full control with YAML configs.

Expand Down Expand Up @@ -160,6 +161,22 @@ html2rss supports many configuration options:

4. **Check the output:** Make sure all items have titles, links, and descriptions

### Useful CLI flags when a site is difficult

Some sites need a little more request budget than the defaults.

- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads.
- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches.

<Code
code={`html2rss feed your-config.yml --max-redirects 10
html2rss feed your-config.yml --max-requests 5
html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`}
lang="bash"
/>

Keep these values tight. Raise them only when the site proves it needs more.

## Add It To html2rss-web

Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your
Expand Down
14 changes: 14 additions & 0 deletions src/content/docs/getting-started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ sidebar:
order: 1
---

import Code from "astro/components/Code.astro";

This page points to the main onboarding flow.

## Start Here
Expand All @@ -23,3 +25,15 @@ That guide is the canonical setup flow for:
- **[Browse working feed examples](/feed-directory/)** - See what success looks like
- **[Create Custom Feeds](/creating-custom-feeds)** - Write configs when you need more control
- **[Troubleshooting Guide](/troubleshooting/troubleshooting)** - Fix startup or extraction problems

## Using the Ruby CLI

If you are working directly with the gem instead of `html2rss-web`, start with:

<Code code={`html2rss auto https://example.com/blog`} lang="bash" />

If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports:

<Code code={`html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5`} lang="bash" />

For config-driven runs, the same flags are available on `html2rss feed`.
10 changes: 2 additions & 8 deletions src/content/docs/ruby-gem/how-to/advanced-features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,7 @@ This guide covers advanced features and performance optimizations for html2rss.

## Parallel Processing

html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration.

### How It Works

- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page
- **Item processing:** Each scraped item is processed in parallel
- **Performance benefit:** Significantly faster when dealing with many items
html2rss uses parallel processing in auto-source discovery. This happens automatically and doesn't require any configuration.

### Performance Tips

Expand Down Expand Up @@ -88,7 +82,7 @@ LOG_LEVEL=debug html2rss feed config.yml
Use the health check endpoint to monitor feed generation:

```bash
curl -u username:password http://localhost:3000/health_check.txt
curl -u username:password http://localhost:4000/health_check.txt
```

## Article Validation
Expand Down
46 changes: 41 additions & 5 deletions src/content/docs/ruby-gem/how-to/custom-http-requests.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,15 @@ title: "Custom HTTP Requests"
description: "Learn how to customize HTTP requests with custom headers, authentication, and API interactions for html2rss."
---

Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases.
import Code from "astro/components/Code.astro";

Some sites only work when requests carry the headers, tokens, or cookies your browser uses. `html2rss` supports those cases without changing the rest of your feed workflow.

Keep this structure in mind:

- `headers` stays top-level
- `strategy` stays top-level
- request-specific controls such as budgets and Browserless options live under `request`

## When You Need Custom Headers

Expand All @@ -19,8 +27,8 @@ You might need custom HTTP requests when:

Add a `headers` section to your feed configuration. This example is a complete, valid config:

```yaml
headers:
<Code
code={`headers:
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
Authorization: "Bearer YOUR_API_TOKEN"
Accept: "application/json"
Expand All @@ -32,8 +40,36 @@ selectors:
title:
selector: "title"
url:
selector: "url"
```
selector: "url"`}
lang="yaml"
/>

## Request Controls

Request budgets are configured under `request`, not as top-level keys:

<Code
code={`headers:
User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)"
request:
max_redirects: 5
max_requests: 6
channel:
url: https://example.com/articles
selectors:
items:
selector: article
title:
selector: h2
url:
selector: a
extractor: href`}
lang="yaml"
/>

- `request.max_redirects` limits redirect hops
- `request.max_requests` limits the total request budget for the feed build
- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions

## Common Use Cases

Expand Down
73 changes: 73 additions & 0 deletions src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,38 @@ title: Handling Dynamic Content
description: "Learn how to handle JavaScript-heavy websites and dynamic content with html2rss. Use browserless strategy for sites that load content dynamically."
---

import Code from "astro/components/Code.astro";

Some websites load their content dynamically using JavaScript. The default `html2rss` strategy might not see this content.

## Solution

Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser.

Keep the strategy at the top level and put request-specific options under `request`:

<Code
code={`strategy: browserless
request:
max_redirects: 5
max_requests: 6
browserless:
preload:
wait_for_network_idle:
timeout_ms: 5000
channel:
url: https://example.com/app
selectors:
items:
selector: .article
title:
selector: h2
url:
selector: a
extractor: href`}
lang="yaml"
/>

## When to Use Browserless

The `browserless` strategy is necessary when:
Expand All @@ -18,6 +44,53 @@ The `browserless` strategy is necessary when:
- **Infinite scroll** - Content loads as you scroll
- **Dynamic forms** - Content changes based on user interaction

## Preload Actions

For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the
HTML snapshot is taken.

### Wait for JavaScript Requests

```yaml
strategy: browserless
request:
browserless:
preload:
wait_for_network_idle:
timeout_ms: 4000
```

### Click "Load More" Buttons

```yaml
strategy: browserless
request:
browserless:
preload:
click_selectors:
- selector: ".load-more"
max_clicks: 3
delay_ms: 250
wait_for_network_idle:
timeout_ms: 3000
```

### Scroll Infinite Lists

```yaml
strategy: browserless
request:
browserless:
preload:
scroll_down:
iterations: 5
delay_ms: 200
wait_for_network_idle:
timeout_ms: 2500
```

These preload steps can be combined in a single config when a site needs several interactions before all items appear.

## Performance Considerations

The `browserless` strategy is slower than the default `faraday` strategy because it:
Expand Down
13 changes: 9 additions & 4 deletions src/content/docs/ruby-gem/reference/auto-source.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,19 @@ auto_source: {}

`auto_source` uses the following strategies to find content:

1. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
2. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
3. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
4. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
1. **`wordpress_api`:** Detects the `<link rel="https://api.w.org/">` tag used by WordPress and pulls posts from the REST API without parsing article HTML. See [WordPress API](/ruby-gem/reference/wordpress-api/).
2. **`schema`:** Parses `<script type="json/ld">` tags containing structured data (e.g., [Schema.org](https://schema.org/)).
3. **`semantic_html`:** Searches for semantic HTML5 tags like `<article>`, `<main>`, and `<section>`.
4. **`html`:** Analyzes the HTML structure to find frequently occurring selectors that are likely to contain the main content.
5. **json_state:** Single-page applications often stash pre-rendered article data in `<script type="application/json">` tags or global variables
such as `window.__NEXT_DATA__`, `window.__NUXT__`, or `window.STATE`. The JSON-state scraper walks those blobs, finds arrays with
`title`/`url` pairs, and converts them into the same hashes produced by `HtmlExtractor`.

**`json_state` Limitations:** the scraper requires discoverable arrays of hashes containing clear `title` and `url` fields. Minified or
obfuscated state objects, heavily encoded values, or blobs that require executing embedded functions are ignored.

**`wordpress_api` Limitations:** this scraper depends on the page exposing a public WordPress REST API root. The current implementation fetches post records directly, but it does not yet resolve category names or featured media metadata.

## Fine-Tuning

You can customize `auto_source` to improve its accuracy.
Expand All @@ -40,6 +43,8 @@ channel:
url: https://example.com
auto_source:
scraper:
wordpress_api:
enabled: false # default: true
schema:
enabled: false # default: true
semantic_html:
Expand Down
8 changes: 8 additions & 0 deletions src/content/docs/ruby-gem/reference/cli-reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ html2rss auto https://example.com/articles
# Force browserless for JavaScript-heavy pages
html2rss auto https://example.com/app --strategy browserless

# Set custom request budgets
html2rss auto https://example.com/app --strategy browserless --max-redirects 5 --max-requests 6

# Hint the item selector while keeping auto enhancement
html2rss auto https://example.com/articles --items_selector ".post-card"
```
Expand All @@ -44,12 +47,17 @@ html2rss feed feeds.yml my-first-feed
# Override the request strategy at runtime
html2rss feed single.yml --strategy browserless

# Override request budgets at runtime
html2rss feed single.yml --max-redirects 5 --max-requests 6

# Pass dynamic parameters into %<param>s placeholders
html2rss feed single.yml --params id:42 foo:bar
```

Command: `html2rss feed YAML_FILE [feed_name]`

The CLI keeps `strategy` as a top-level override and writes runtime request limits into the generated config under `request`.

### Schema

Prints the exported JSON Schema for the current gem version.
Expand Down
6 changes: 4 additions & 2 deletions src/content/docs/ruby-gem/reference/selectors.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,9 @@ selectors:
Behavior:

- `max_pages` is the total page budget for the item selector chain, including the initial page.
- `max_pages` is capped by the system request ceiling of 10 pages per feed build.
- Pagination follows strict `link[rel~="next"]` or `a[rel~="next"]` targets only.
- Follow-up pages use the current page's effective origin after redirects.
- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.

Expand Down Expand Up @@ -120,10 +122,10 @@ Post-processors manipulate the extracted value.
- `html_to_markdown`: Converts HTML to Markdown.
- `markdown_to_html`: Converts Markdown to HTML.
- `parse_time`: Parses a string into a `Time` object.
- `parse_uri`: Parses a string into a `URI` object.
- `parse_uri`: Resolves a relative URL against `channel.url` and returns the normalized URL string.
- `sanitize_html`: Sanitizes HTML to prevent security vulnerabilities.
- `substring`: Extracts a substring from a string.
- `template`: Creates a new string from a template and other selector values.
- `template`: Creates a new string from a template and other selector values. Use `%{self}` for the current selector value.

> Always use the `sanitize_html` post-processor for any HTML content to prevent security risks.

Expand Down
Loading
Loading