The Boost Data Collector is a Django project with multiple Django apps. The main workflow is driven by Django's manage.py and management commands (or by Celery tasks that run the same commands). Each data-collection or processing step is a Django management command (e.g. python manage.py run_boost_library_tracker). The project uses one virtual environment and one database; all apps share the same Django settings and INSTALLED_APPS. Within a single command run—e.g. python manage.py run_all_collectors or python manage.py run_scheduled_collectors—collectors run one after another with no parallel execution; different Celery Beat entries (such as workflow.tasks.run_all_collectors_task or the YAML-driven groups in config/boost_collector_schedule.yaml) can run in parallel across workers.
You can run collectors in two ways:
- workflow app – Fixed list:
python manage.py run_all_collectors. Celery Beat can runworkflow.tasks.run_all_collectors_taskon a schedule. - boost_collector_runner app – YAML-driven schedule:
config/boost_collector_schedule.yamldefines groups, schedule types (daily, weekly, monthly, interval, on_release), and optional args. Usepython manage.py run_scheduled_collectors --schedule daily(or weekly/monthly/interval/on_release). Celery Beat is built from the YAML so adding or reordering collectors requires no code changes—only editing the YAML.
This document covers: main workflow process, Boost Collector Runner and YAML schedule, project details, execution order, error handling, and branching.
The main task runs at a set time (e.g. Celery Beat) or on demand. Each Django app exposes one or more management commands (e.g. run_boost_library_tracker). The runner runs them in order, one at a time, to avoid write conflicts and keep data dependencies in order.
- Start – Trigger the run (Beat, cron, or
python manage.py run_all_collectors/run_scheduled_collectors). - Run commands in order – For each command: run it, wait for completion, check exit code (0 = success, non-zero = failure), then run the next. Optionally stop on first failure.
- Finalize – Log how many succeeded or failed; exit with an overall success or failure code.
The boost_collector_runner app runs collectors from a single config file so you can add, reorder, or reschedule tasks without changing Python code.
- Path:
config/boost_collector_schedule.yaml - Setting:
BOOST_COLLECTOR_SCHEDULE_YAMLinconfig/settings.py(defaults to that path).
| Type | Meaning |
|---|---|
| daily | Run every day at the group's default_time. |
| weekly | Run once per week. Use on with a weekday: monday, mon, tuesday, tue, etc. |
| monthly | Run once per month on a given date. Use on with day of month (1–31). |
| interval | Run every N minutes. Use minutes (1–180). Use interval only for minutes; at most 3 hours. Suitable for short periodic runs (e.g. every 15 min). |
| on_release | Run when a new version release is detected. There is no dedicated Beat entry for on_release; grouped on_release tasks are evaluated during group batch runs, and standalone checks can also be triggered manually or from release-detection code (e.g. run_scheduled_collectors_task.delay(schedule_kind="on_release")). |
-
groups: Each group has default_time (required; 24h
"HH:MM", UTC) and a tasks list. -
Each task:
- command (required) – Management command name (e.g.
run_boost_library_tracker). - schedule (required) –
daily|weekly|monthly|interval|on_release. - on – For weekly: weekday name (
mondayormon, etc.). For monthly: day of month (1–31). Omit for daily, interval, and on_release. - minutes – For interval only: run every N minutes (1–180; at most 3 hours). Use interval only for minute-based runs.
- enabled (optional) –
true(default) orfalseto skip without removing the entry. - args (optional) – List of strings passed to the command (e.g.
["--format", "json"]).
Tasks do not have their own time; the group's default_time is when that group's non-interval tasks run. Within a group, tasks run sequentially. Each group has its own Celery Beat entry so groups can run in parallel on different workers. Interval tasks are configured under groups but excluded from the group batch; they get separate Beat entries and run independently. Tasks with
schedule: on_releasedo not get a dedicated Beat entry but are included in the group batch when the group runs (and run if a new release is detected). - command (required) – Management command name (e.g.
groups:
github:
default_time: "04:10"
tasks:
- command: run_boost_library_tracker
schedule: daily
- command: run_boost_usage_tracker
schedule: weekly
on: monday
reporting:
default_time: "06:00"
tasks:
- command: run_monthly_report
schedule: monthly
on: 3
- command: run_on_release_sync
schedule: on_release
- command: run_export
schedule: daily
args: ["--format", "json"]- Daily:
python manage.py run_scheduled_collectors --schedule daily(all groups) or--schedule daily --group github(one group). - Weekly (e.g. Monday):
python manage.py run_scheduled_collectors --schedule weekly --day-of-week mondayor add--group <name>for one group. - Monthly (e.g. 3rd):
python manage.py run_scheduled_collectors --schedule monthly --day-of-month 3or add--group <name>for one group. - Interval (e.g. every 15 min):
python manage.py run_scheduled_collectors --schedule interval --interval-minutes 15(runs all interval tasks with that minutes; no group). - On release:
python manage.py run_scheduled_collectors --schedule on_release
Add --stop-on-failure to stop after the first failing command.
CELERY_BEAT_SCHEDULE is built from the YAML: one Beat entry per group for daily/weekly/monthly (so groups run in parallel), and one entry per interval-minutes for interval tasks (run independently, not tied to a group). Tasks with schedule: on_release do not get dedicated Beat entries; grouped on_release tasks are checked during group runs, and standalone on_release runs can be triggered from release-detection logic.
- Framework - Django. One Django project with multiple Django apps; all apps share the same settings and database.
- ORM - Django ORM. All data access goes through Django models and the ORM; migrations are used for schema changes.
- Database - PostgreSQL. The project uses one PostgreSQL database (e.g.
boost_dashboard); there are no separate databases or schema-based isolation per app. - Task scheduling – Celery and Celery Beat run tasks on a schedule defined by configuration (for YAML-driven runs, by each group’s
default_time). The boost_collector_runner app builds the Beat schedule fromconfig/boost_collector_schedule.yamlwhen the YAML loads successfully; if the YAML is missing or invalid,CELERY_BEAT_SCHEDULEis set to{}(no autogenerated schedule)—there is no fallback to the workflow app’s daily task. Redis is the message broker. Run by hand:python manage.py run_all_collectorsorpython manage.py run_scheduled_collectors --schedule daily. Start the worker withcelery -A config worker -l infoand the scheduler withcelery -A config beat -l info. - Configuration - Django settings (e.g.
settings.py); environment variables for database URL, credentials, and API keys (e.g. viadjango-environorpython-decouple). - Structure - One Django project (e.g.
config/or project root withmanage.py,settings.py). Multiple Django apps (see table below); each app can expose management commands inmanagement/commands/. All apps are inINSTALLED_APPSand use the shared database.
The runner executes each app's command one after another. Order is defined by the workflow app's fixed list (run_all_collectors) or by the boost_collector_runner YAML (order of groups and order of tasks within each group). Order matters:
- Data dependencies - App tasks that produce reference or core data (e.g. Boost Library Tracker, GitHub Activity) run before app tasks that use that data (e.g. Boost Usage Tracker).
- Shared reference data - App tasks that own reference tables (e.g. language, license) run early so other app tasks can read that data.
Typical order: data-collection first, then processing or transform, then analysis or reporting. When using the YAML, set the order by arranging groups and tasks in config/boost_collector_schedule.yaml.
- If startup checks fail (e.g. missing settings, database unreachable), the main task can exit right away with a non-zero code.
- When an app's task returns non-zero or raises an uncaught exception, the main task records the failure. The project can choose "stop on first failure" or "continue and run remaining app tasks".
- The overall exit code is 0 only when all app tasks succeeded; otherwise it is non-zero so CI or schedulers can detect failure.
- The Django project sets up logging in
settings.LOGGING. App tasks (management commands or Celery tasks) use this configuration. - Log the start and end of each app task, success or failure, and exit codes. You can also write a final summary (how many ran, how many succeeded or failed) to the log or stdout.
The repository uses two long-lived branches:
- main – Default branch; production-ready code. CI and deployments typically track
main. - develop – Integration branch for active development. Feature branches are created from
develop, and pull requests targetdevelop. Code is merged fromdevelopintomainfor releases.
See the README for the full branching strategy.
- Schema.md - Database schema and table relationships.
- README.md - Project overview and quick start.
- Development_guideline.md - Development setup, app structure, and code examples (if present).