Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 17 additions & 9 deletions docs/01_introduction/quick-start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ The Actor's source code is in the `src` folder. This folder contains two importa
{MainExample}
</CodeBlock>
</TabItem>
<TabItem value="__main__.py" label="__main.py__">
<TabItem value="__main__.py" label="__main__.py">
<CodeBlock className="language-python">
{UnderscoreMainExample}
</CodeBlock>
Expand Down Expand Up @@ -97,12 +97,20 @@ To learn more about the features of the Apify SDK and how to use them, check out

### Guides

To see how you can integrate the Apify SDK with popular web scraping libraries, check out our guides:
To see how you can integrate the Apify SDK with popular scraping libraries and frameworks, check out these guides:

- [BeautifulSoup with HTTPX](../guides/beautifulsoup-httpx)
- [Parsel with Impit](../guides/parsel-impit)
- [Playwright](../guides/playwright)
- [Selenium](../guides/selenium)
- [Crawlee](../guides/crawlee)
- [Scrapy](../guides/scrapy)
- [Running webserver](../guides/running-webserver)
- [Scraping with BeautifulSoup and HTTPX](../guides/beautifulsoup-httpx)
- [Scraping with Parsel and Impit](../guides/parsel-impit)
- [Browser automation with Playwright](../guides/playwright)
- [Browser automation with Selenium](../guides/selenium)
- [Building crawlers with Crawlee](../guides/crawlee)
- [Building crawlers with Scrapy](../guides/scrapy)
- [Adaptive scraping with Scrapling](../guides/scrapling)
- [LLM-ready scraping with Crawl4AI](../guides/crawl4ai)
- [Browser AI agents with Browser Use](../guides/browser-use)

For other aspects of Actor development, explore these guides:

- [Project management with uv](../guides/uv)
- [Input validation with Pydantic](../guides/input-validation)
- [Running a web server](../guides/running-webserver)
10 changes: 7 additions & 3 deletions docs/03_guides/01_beautifulsoup_httpx.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
id: beautifulsoup-httpx
title: Use BeautifulSoup with HTTPX
title: Scraping with BeautifulSoup and HTTPX
description: Build an Apify Actor that scrapes web pages using BeautifulSoup and HTTPX.
---

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import BeautifulSoupHttpxExample from '!!raw-loader!roa-loader!./code/01_beautifulsoup_httpx.py';

In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library with the [HTTPX](https://www.python-httpx.org/) library in your Apify Actors.
In this guide, you'll learn how to scrape web pages with the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries in your Apify Actors.

## Introduction

Expand All @@ -20,12 +20,16 @@ To create an Actor which uses those libraries, start from the [BeautifulSoup & P

## Example Actor

Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages.
Below is a simple Actor that recursively scrapes data from linked pages on the same site, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract the title, headings, and links to other pages.

<RunnableCodeBlock className="language-python" language="python">
{BeautifulSoupHttpxExample}
</RunnableCodeBlock>

## Using Apify Proxy

Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new HTTPX client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.

## Conclusion

In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Expand Down
10 changes: 7 additions & 3 deletions docs/03_guides/02_parsel_impit.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
id: parsel-impit
title: Use Parsel with Impit
title: Scraping with Parsel and Impit
description: Build an Apify Actor that scrapes web pages using Parsel selectors and the Impit HTTP client.
---

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import ParselImpitExample from '!!raw-loader!roa-loader!./code/02_parsel_impit.py';

In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors.
In this guide, you'll learn how to scrape web pages with the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries in your Apify Actors.

## Introduction

Expand All @@ -18,12 +18,16 @@ In this guide, you'll learn how to combine the [Parsel](https://github.com/scrap

## Example Actor

The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links.
The following example shows a simple Actor that recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [Parsel](https://github.com/scrapy/parsel) to extract the title, headings, and links.

<RunnableCodeBlock className="language-python" language="python">
{ParselImpitExample}
</RunnableCodeBlock>

## Using Apify Proxy

Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new Impit client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.

## Conclusion

In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Expand Down
12 changes: 8 additions & 4 deletions docs/03_guides/03_playwright.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: playwright
title: Use Playwright
title: Browser automation with Playwright
description: Build an Apify Actor that scrapes dynamic web pages using Playwright browser automation.
---

Expand All @@ -11,7 +11,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import PlaywrightExample from '!!raw-loader!roa-loader!./code/03_playwright.py';

In this guide, you'll learn how to use [Playwright](https://playwright.dev) for web scraping in your Apify Actors.
In this guide, you'll learn how to use [Playwright](https://playwright.dev) for browser automation and web scraping in your Apify Actors.

## Introduction

Expand Down Expand Up @@ -48,14 +48,18 @@ playwright install --with-deps`

## Example Actor

This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.
This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.

It uses Playwright to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.
It uses Playwright to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.

<RunnableCodeBlock className="language-python" language="python">
{PlaywrightExample}
</RunnableCodeBlock>

## Using Apify Proxy

Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and launches the browser through it. Playwright applies the proxy at the browser level, so the whole run shares a single proxy URL rather than rotating per request; the `to_playwright_proxy` helper splits that URL into the `server`, `username`, and `password` fields Playwright expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.

## Conclusion

In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Expand Down
14 changes: 10 additions & 4 deletions docs/03_guides/04_selenium.mdx
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
id: selenium
title: Use Selenium
title: Browser automation with Selenium
description: Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver.
---

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';

In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for web scraping in your Apify Actors.
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors.

## Introduction

Expand All @@ -32,14 +32,20 @@ Refer to the [Selenium documentation](https://www.selenium.dev/documentation/web

## Example Actor

This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input.
This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.

It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load.
It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.

<RunnableCodeBlock className="language-python" language="python">
{SeleniumExample}
</RunnableCodeBlock>

## Using Apify Proxy

Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.

Chrome ignores the credentials passed in the `--proxy-server` flag, so an authenticated proxy such as Apify Proxy has to be configured from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.

## Conclusion

In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Expand Down
8 changes: 6 additions & 2 deletions docs/03_guides/05_crawlee.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: crawlee
title: Use Crawlee
title: Building crawlers with Crawlee
description: Build Apify Actors using Crawlee's BeautifulSoupCrawler, ParselCrawler, or PlaywrightCrawler.
---

Expand All @@ -10,7 +10,7 @@ import CrawleeBeautifulSoupExample from '!!raw-loader!roa-loader!./code/05_crawl
import CrawleeParselExample from '!!raw-loader!roa-loader!./code/05_crawlee_parsel.py';
import CrawleePlaywrightExample from '!!raw-loader!roa-loader!./code/05_crawlee_playwright.py';

In this guide, you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.
In this guide, you'll learn how to build web crawlers with the [Crawlee](https://crawlee.dev/python) library in your Apify Actors.

## Introduction

Expand Down Expand Up @@ -42,6 +42,10 @@ The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler
{CrawleePlaywrightExample}
</RunnableCodeBlock>

## Using Apify Proxy

All three crawlers above route their requests through [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. `Actor.create_proxy_configuration` returns a Crawlee-compatible proxy configuration, which is passed to the crawler as `proxy_configuration`; Crawlee then rotates the proxy IP for every request on its own. Because the configuration is only available inside the running Actor, the crawler is created in `main` and the request handler is registered on a standalone [`Router`](https://crawlee.dev/python/api/class/Router) up front. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide.

## Conclusion

In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
Expand Down
10 changes: 5 additions & 5 deletions docs/03_guides/06_scrapy.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
id: scrapy
title: Use Scrapy
title: Building crawlers with Scrapy
description: Convert Scrapy spiders into Apify Actors with platform storage and proxy integration.
---

Expand All @@ -15,17 +15,17 @@ import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py';
import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py';
import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';

In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framework in your Apify Actors.
In this guide, you'll learn how to build web crawlers with the [Scrapy](https://scrapy.org/) framework in your Apify Actors.

## Introduction

[Scrapy](https://scrapy.org/) is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify [Actors](https://docs.apify.com/platform/actors), integrated with Apify [storages](https://docs.apify.com/platform/storage), and executed on the Apify [platform](https://docs.apify.com/platform).

## Integrating Scrapy with the Apify platform

The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.

<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
<CodeBlock className="language-python" title="__main__.py: The Actor entry point">
{UnderscoreMainExample}
</CodeBlock>

Expand Down Expand Up @@ -74,7 +74,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli
The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.

<Tabs>
<TabItem value="__main__.py" label="__main.py__">
<TabItem value="__main__.py" label="__main__.py">
<CodeBlock className="language-python">
{UnderscoreMainExample}
</CodeBlock>
Expand Down
Loading