Lightweight web search: three complementary providers, any-URL fetching.
pip install webba
scrapling install # one-time: fetch browser binaries for the stealth/dynamic fetch tiersfrom webba import search, fetch
# Search — zero API keys needed (SearXNG by default)
results = search('python asyncio tutorial', n=5)
print(results.to_md())
# Fetch any URL as clean text
text = fetch('https://github.com/AnswerDotAI/ContextKit/blob/main/contextkit/read.py')webba "python asyncio" --n 5 --fmt md
webba "latest AI news" --provider searxng --fmt json
webba --purge-cache
webba --start-searxng
webba --stop-searxng
webba --chrome-debug # launch an isolated Chrome for login-gated searchwebba uses three providers that together cover the search space:
| Provider | Role | Needs |
|---|---|---|
| searxng | Meta-search aggregator (Google/Bing/Brave/DDG + news/science/code/video) | Docker (auto-started) |
| perplexity | LLM-synthesised cited answers for Q&A / research | PERPLEXITY_API_KEY |
| fastcdp | Real-browser SERP scrape via Chrome DevTools Protocol | Chrome/Chromium |
provider='auto' (default) routes Q&A/academic queries to Perplexity (when a key
is set) and everything else to SearXNG, then cascades to fastcdp if a tier
returns nothing. provider='all' runs all available providers and RRF-merges.
- Zero-key search: SearXNG runs locally in Docker, no accounts needed
- Smart routing: keyword intent detection picks the provider + SearXNG category
- Perplexity error handling: HTTP error codes drive retry-vs-fail decisions
(
401/403/404/400fatal;429/5xxretried once with backoff) - TTL cache: literal-query results cached in
~/.webba/cacheviadiskcache - Any-URL fetch: GitHub files/repos, arxiv, gists, PDFs, YouTube, docs, HTML
- scrapling-powered scraping: battle-tested fetch cascade (fast HTTP → stealth browser → dynamic browser)
- YouTube:
fetch()on a video URL returns metadata + transcript viayt-dlp
fetch(url, sel=None, heavy=False, cdp=False, save_pdf=False, **kwargs)sel: CSS selector to extract specific content from HTMLheavy: skip fast HTTP, go straight to a browser engine for JS-heavy pagescdp: attach to a running debug Chrome (see below) for login-gated/enterprise pages — reuses that Chrome's cookieskwargs: forwarded toread_gh_repofor repo URLs (branch, as_dict, etc.)
The fastcdp provider drives a real Chrome over the Chrome DevTools Protocol.
- Headless (default): webba auto-launches a headless Chrome with an isolated
profile at
~/.webba/chrome-profile. Zero-config — handles most public sites. - Login-gated / enterprise: run
chrome_debug_setup()(orwebba --chrome-debug). This launches a dedicated, isolated Chrome profile with remote debugging. Log into the SSO/enterprise sites you want webba to reach, once, in that window — webba reuses those cookies.
Best practice: webba never attaches to your primary Chrome profile. Chrome's remote-debugging port is unauthenticated, and a daily-driver profile exposes every logged-in session (mail, banking, cloud consoles) to anything that can connect. The dedicated
~/.webba/chrome-profilekeeps automation isolated from your personal browsing — log into only what webba needs.
| Variable | Effect |
|---|---|
PERPLEXITY_API_KEY |
Enables the Perplexity provider |
SEARXNG_URL |
Use an existing SearXNG instead of the auto-started one |
WEBBA_SEARXNG |
Set to false to disable SearXNG entirely (default: true) |
Search the web. provider: 'auto' | 'searxng' | 'perplexity' | 'fastcdp' | 'all'.
Fetch any URL or local path as clean text. Handles GitHub, arxiv, gists, PDFs, YouTube, docs, HTML.
crawl(seed, link_pat=None, sel=None, max_pages=500, delay=0.5, same_domain=True, save_dir=None) -> L
Crawl a site, follow links, return L of {url, text} dicts.
Format as markdown / concatenate snippets as LLM context / fetch all result URLs.
Launch the isolated debug Chrome for login-gated search.
Start / stop the local SearXNG container. searxng_start() is idempotent.
Wipe the search cache.
| File | Owns |
|---|---|
webba/search.py |
Providers (searxng/perplexity/fastcdp), routing, SearXNG setup, CLI |
webba/cache.py |
diskcache-backed literal-query TTL cache |
webba/fetch.py |
URL classification, scrapling fetch cascade, YouTube, crawl |
Apache-2.0