on this page...
- Overview
- Installation
- Usage
- What the script does
- How it works in practice
- Current state of the project
- FAQs
This script backs up Archive-It WARC files for collections listed as active in a Google Sheets tracking spreadsheet.
At a high level, it checks which collections are active, asks WASAPI (Archive-It's Web Archiving Systems API) what WARC files are available, downloads anything missing, writes local fixity information, and updates the spreadsheet with collection-level progress.
The local filesystem is treated as the source of truth. The spreadsheet is mainly there to help monitor activity and control which collections are in scope.
Clone the repository:
cd /path/to/warc_tracker_script_stuff/
git clone git@github.com:Brown-University-Library/warc_tracker_script.git
cd warc_tracker_scriptRun commands from the project root with uv.
Install/sync dependencies:
uv syncCreate a .env file. Required values for the production script are:
GSHEET_CREDENTIALS_JSON='{"type":"service_account", "...":"..."}'
GSHEET_SPREADSHEET_ID="the-google-sheet-id"
LOG_PATH="./logs/warc_tracker_script.log"
ARCHIVEIT_WASAPI_USERNAME="archive-it-username"
ARCHIVEIT_WASAPI_PASSWORD="archive-it-password"Optional values:
LOG_LEVEL="INFO"
WARC_STORAGE_ROOT="/path/to/storage"
ARCHIVEIT_WASAPI_BASE_URL="https://warcs.archive-it.org/wasapi/v1/webdata"
RUN_COORDINATION_MODE="skip_spreadsheet_coordination_check"
UNKNOWN_SEED_ALERT_RECIPIENTS='[["Name One", "name.one@example.edu"], ["Name Two", "name.two@example.edu"]]'
UNKNOWN_SEED_ALERT_FROM_EMAIL="warc-tracker@example.edu"
UNKNOWN_SEED_ALERT_SMTP_HOST="localhost"
UNKNOWN_SEED_ALERT_SMTP_PORT="25"RUN_COORDINATION_MODE is normally unset. When it is unset, startup checks active spreadsheet rows and refuses to start if any row already has a blocking in-progress status such as discovery-in-progress or downloading-in-progress. Set RUN_COORDINATION_MODE="skip_spreadsheet_coordination_check" only when an external cron or scheduler lock already guarantees that two copies of the script cannot run at the same time; that setting skips the spreadsheet coordination preflight.
UNKNOWN_SEED_ALERT_RECIPIENTS is used by cron_scripts/check_for_unknown_seeds.py. It must be JSON that parses to a list of (name, email_address) pairs.
Run the backup workflow:
uv run ./main.pyValidate that a spreadsheet can be opened, parsed, and edited before running the backup workflow:
uv run ./validate_spreadsheet_connection.py --spreadsheet-id the-google-sheet-idRun tests:
uv run ./run_tests.py
uv run ./run_tests.py -v tests.test_orchestrationCapture WASAPI metadata for one collection without downloading WARC files:
uv run ./tmp_inspect_collection_wasapi.py --collection-id 12345 --output-dir ./wasapi_inspectionCheck for downloaded WARC files that could not be assigned to a seed folder:
uv run ./cron_scripts/check_for_unknown_seeds.py --dry-run
uv run ./cron_scripts/check_for_unknown_seeds.py- Reads the tracking spreadsheet and selects active collections.
- Checks Archive-It WASAPI for WARC files associated with those collections.
- Downloads WARC files that are not yet backed up locally.
- Writes SHA-256 fixity-checksum files for downloaded WARCs
- Records per-collection state on disk so later runs can continue safely.
- Updates the spreadsheet with simple collection-level progress and summary information.
- The script will be run via a cron-job, but can also be run manually.
- On a collection's first successful run, the script aims to do a full historical backfill.
- On later runs, it re-checks a recent overlap window so that interrupted or partial runs are less likely to miss files.
- Files are downloaded into a predictable collection/seed/year/month folder structure.
- Each collection keeps a local
state.jsonfile so the script can remember what it has already seen and what may need retrying.
- The current production flow processes collections sequentially.
- It already performs collection discovery, download planning, downloading, fixity writing, and collection-level spreadsheet updates.
- The design plan still leaves room for a later concurrent version with dedicated download workers and a separate spreadsheet updater.
-
One of the first steps is determining which WARC files need download for each active collection listed in the tracking spreadsheet.
-
WASAPI exposes several timestamps, including
crawl-start-time,crawl-time, andstore-time. This script uses onlystore-timefor discovery and checkpointing. -
That choice is intentional:
store-timereflects when the WARC is actually available in WASAPI, and it can be later than the crawl-related timestamps. Since this script is about backup tracking rather than crawl tracking,store-timeis the safest single clock to follow. -
The per-collection local state stores one checkpoint value:
enumeration_checkpoint_store_time_max-- This is a bookmark for how far the script got in listing candidate files from WASAPI.
-
On each run, the script:
- reads that saved checkpoint
- subtracts 30 days from it
- queries WASAPI with
store-time-after=<checkpoint minus 30 days>
-
On a first run, when no checkpoint exists yet, the script does a full historical backfill for that collection instead of limiting itself to only the last 30 days.
-
Why keep the 30-day overlap window?
-
The overlap protects against incomplete or interrupted enumeration and download work.
-
Example:
- a run sees files with
store-timevalues of Feb-02, Feb-04, and Feb-06 - the script successfully enumerates all three files
- but a later step fails before every needed file is downloaded or before all local state is updated as intended
- a run sees files with
-
If the next run queried only for files strictly after Feb-06, it could miss a file that should still be retried.
-
By querying again from 30 days before the saved checkpoint, the script deliberately re-sees a recent slice of already-known records. That overlap is then made safe by local filename-based state and deduplication logic.
-
In short, the 30-day window is a recovery buffer: it reduces the chance that a partial run or transient failure causes the script to permanently skip a WARC that should have been backed up.
-
The script is meant to make safe backup decisions based on what is actually present on disk.
-
The spreadsheet is useful for visibility, but it is not detailed enough to serve as the authoritative record of every file and retry state.
-
By keeping the main truth locally, the script can recover more safely from interruptions, partial downloads, or spreadsheet write issues.
-
In practice, that means the most important record of progress is the collection's local folder plus its
state.jsonfile. -
Just a note that this
state.jsonfile gets updated as each download attempt is made. So if a file fails to successfully download, even if the checkpoint/bookmark-date may move forward, subsequent runs will retry the failed downloads.
-
Each collection gets its own local directory.
-
That directory includes:
- downloaded WARC files
- fixity metadata files
- a
state.jsonfile describing what the script has discovered and recorded for that collection
WARC and fixity files are stored by seed id:
collections/<collection_id>/<seed_id>/<year>/<month>/<filename>
collections/<collection_id>/<seed_id>/<year>/<month>/<filename>.sha256
collections/<collection_id>/<seed_id>/<year>/<month>/<filename>.json
If a WARC filename does not include a parseable SEED... value, the file is stored under UNKNOWN_SEED. The cron_scripts/check_for_unknown_seeds.py script can be scheduled to report those files by email.
- This layout is meant to keep each collection self-contained and easier to inspect.
-
The tracking spreadsheet is used as a reporting and control interface for collection-level backup activity.
-
It helps an operator quickly see whether a collection is currently being checked, whether downloads are planned, whether there is nothing new to fetch, and what the final collection outcome was.
-
The spreadsheet is not the source of truth for file correctness or retry logic.
-
The local filesystem plus each collection's
state.jsonremain authoritative for what has been discovered, downloaded, and recorded durably. These files can be viewed at(server)/warc_downloads/collections/collection-ID/state.json. -
In the current sequential flow, spreadsheet updates are written at a small number of collection-level checkpoints:
- when discovery begins
- after download planning completes
- when no new files need download
- when downloading begins
- at coarse in-progress milestones during downloading
- when final collection reporting is written
-
The in-progress download updates are intentionally coarse rather than per-file chatter.
-
status-last-fetchholds the coarse machine-readable status, such asdiscovery-in-progressordownloading-in-progress. -
status-detailholds the human-readable detail for that status, including discovery mode, no-new-files notes, final outcome details, and coarse download progress such as40% (2/5 files). -
status-last-fetch-file-countholds the numeric count of WARC filename records returned by the latest WASAPI fetch. -
This keeps the sheet useful for monitoring without making spreadsheet state responsible for correctness.
main.pyremains a thin entry point that loads config, configures logging, opens an authenticatedhttpx.Client, and iterates collection jobs.lib/orchestration.pyprocesses collections sequentially.lib/collection_sheet.pyloads active collection jobs from the spreadsheet.lib/local_state.pyloads and savesstate.jsonatomically and records durable1 per-file download/fixity outcomes.lib/wasapi_discovery.pyperforms production WASAPI discovery with overlap-window checkpoint logic.lib/storage_layout.pyderives seed/year/month partitions from WARC filenames and computes planned WARC/fixity destinations.lib/downloader.pystreams WARC files, writes to*.partial, removes stale partial files on retry, and atomically renames successful downloads into place.lib/fixity.pycomputes SHA-256 and writes.sha256and.jsonfixity files for successfully downloaded WARCs.cron_scripts/check_for_unknown_seeds.pyscans for WARC files underUNKNOWN_SEEDfolders and sends an email alert when any are found.
Footnotes
-
Here, durable means the recorded outcomes are meant to survive process exits, crashes, and later reruns because they are written into
state.jsonon disk, not just kept in memory for the current execution. ↩