Small scraper for https://lab.neos.eu/sitemap.xml.
It scrapes pages whose URLs start with:
https://lab.neos.eu/thinktank/glossarhttps://lab.neos.eu/thinktank/publikationen
It also follows matching links found inside main#main, downloads allowed assets referenced inside that content area, rewrites local links for offline viewing, writes Markdown files into ./LabsWebsiteContent next to the script, and stores assets flat in ./LabsWebsiteContent/Media.
For responsive image markup, the scraper prefers img[data-src] and ignores generated srcset variants, so it downloads the original image URL instead of every generated size.
Relative links that are not scraped or downloaded are rewritten to absolute https://lab.neos.eu/... links in the generated Markdown.
Custom YouTube players are converted into linked poster images, for example [](https://www.youtube.com/watch?v=...).
Inline SVG icons and XML processing instructions are removed before Markdown conversion to keep button/link text clean.
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txtFrom this folder:
python scrape_labs_content.pyTo override the output path:
python scrape_labs_content.py --output-dir /tmp/LabsWebsiteContentThe scraper uses a deliberately defensive 2.1 second wait after every request. A full run currently takes roughly 35 minutes, so it is best started during a quiet time, for example at 1am.
The output path is configured in scrape_labs_content.py:
OUTPUT_DIR = BASE_DIR / "LabsWebsiteContent"The script deletes and recreates that output folder on each run.