Skip to content

vedicreader/webba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webba

Lightweight web search: three complementary providers, any-URL fetching.

Install

pip install webba
scrapling install   # one-time: fetch browser binaries for the stealth/dynamic fetch tiers

Quick Start

from webba import search, fetch

# Search — zero API keys needed (SearXNG by default)
results = search('python asyncio tutorial', n=5)
print(results.to_md())

# Fetch any URL as clean text
text = fetch('https://github.com/AnswerDotAI/ContextKit/blob/main/contextkit/read.py')

CLI

webba "python asyncio" --n 5 --fmt md
webba "latest AI news" --provider searxng --fmt json
webba --purge-cache
webba --start-searxng
webba --stop-searxng
webba --chrome-debug          # launch an isolated Chrome for login-gated search

Providers

webba uses three providers that together cover the search space:

Provider Role Needs
searxng Meta-search aggregator (Google/Bing/Brave/DDG + news/science/code/video) Docker (auto-started)
perplexity LLM-synthesised cited answers for Q&A / research PERPLEXITY_API_KEY
fastcdp Real-browser SERP scrape via Chrome DevTools Protocol Chrome/Chromium

provider='auto' (default) routes Q&A/academic queries to Perplexity (when a key is set) and everything else to SearXNG, then cascades to fastcdp if a tier returns nothing. provider='all' runs all available providers and RRF-merges.

Features

  • Zero-key search: SearXNG runs locally in Docker, no accounts needed
  • Smart routing: keyword intent detection picks the provider + SearXNG category
  • Perplexity error handling: HTTP error codes drive retry-vs-fail decisions (401/403/404/400 fatal; 429/5xx retried once with backoff)
  • TTL cache: literal-query results cached in ~/.webba/cache via diskcache
  • Any-URL fetch: GitHub files/repos, arxiv, gists, PDFs, YouTube, docs, HTML
  • scrapling-powered scraping: battle-tested fetch cascade (fast HTTP → stealth browser → dynamic browser)
  • YouTube: fetch() on a video URL returns metadata + transcript via yt-dlp

Fetching

fetch(url, sel=None, heavy=False, cdp=False, save_pdf=False, **kwargs)
  • sel: CSS selector to extract specific content from HTML
  • heavy: skip fast HTTP, go straight to a browser engine for JS-heavy pages
  • cdp: attach to a running debug Chrome (see below) for login-gated/enterprise pages — reuses that Chrome's cookies
  • kwargs: forwarded to read_gh_repo for repo URLs (branch, as_dict, etc.)

fastcdp & enterprise / login-gated search

The fastcdp provider drives a real Chrome over the Chrome DevTools Protocol.

  • Headless (default): webba auto-launches a headless Chrome with an isolated profile at ~/.webba/chrome-profile. Zero-config — handles most public sites.
  • Login-gated / enterprise: run chrome_debug_setup() (or webba --chrome-debug). This launches a dedicated, isolated Chrome profile with remote debugging. Log into the SSO/enterprise sites you want webba to reach, once, in that window — webba reuses those cookies.

Best practice: webba never attaches to your primary Chrome profile. Chrome's remote-debugging port is unauthenticated, and a daily-driver profile exposes every logged-in session (mail, banking, cloud consoles) to anything that can connect. The dedicated ~/.webba/chrome-profile keeps automation isolated from your personal browsing — log into only what webba needs.

Environment Variables (all optional)

Variable Effect
PERPLEXITY_API_KEY Enables the Perplexity provider
SEARXNG_URL Use an existing SearXNG instead of the auto-started one
WEBBA_SEARXNG Set to false to disable SearXNG entirely (default: true)

API

search(q, n=10, provider='auto', cache=True, cache_ttl=3600) -> SearchResults

Search the web. provider: 'auto' | 'searxng' | 'perplexity' | 'fastcdp' | 'all'.

fetch(url, sel=None, heavy=False, cdp=False, save_pdf=False, **kwargs) -> str|dict

Fetch any URL or local path as clean text. Handles GitHub, arxiv, gists, PDFs, YouTube, docs, HTML.

crawl(seed, link_pat=None, sel=None, max_pages=500, delay=0.5, same_domain=True, save_dir=None) -> L

Crawl a site, follow links, return L of {url, text} dicts.

SearchResults.to_md() / .to_context(max_chars=4000) / .fetch_all(sel=None, heavy=False)

Format as markdown / concatenate snippets as LLM context / fetch all result URLs.

chrome_debug_setup(headless=False, port=9222) -> str

Launch the isolated debug Chrome for login-gated search.

searxng_start() / searxng_stop()

Start / stop the local SearXNG container. searxng_start() is idempotent.

purge_cache(db_path='~/.webba/cache')

Wipe the search cache.

Architecture

File Owns
webba/search.py Providers (searxng/perplexity/fastcdp), routing, SearXNG setup, CLI
webba/cache.py diskcache-backed literal-query TTL cache
webba/fetch.py URL classification, scrapling fetch cascade, YouTube, crawl

License

Apache-2.0

About

search the web with google, searxng, ddgs, and paid searchers but cap at free tier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages