-
Notifications
You must be signed in to change notification settings - Fork 6
Add WG21 paper tracker with fetch, download, GCS upload, and tests (#24) #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
e3c1a65
Initial wg21-paper-tracker added
leostar0412 9892a45
wg21_paper_tracker: features, tests, and cleanup #24
leostar0412 18f07c3
Fix lint/format error #24
leostar0412 f4388ff
Validate mailing_date in get_raw_dir; WG21 author order/resolution an…
leostar0412 62d5d42
Fix: WG21 tracker (year, GCS guard, IntegrityError), author_alias, Pi…
leostar0412 e3e91c8
Fix: WG21 – optional Cloud Run, per-blob isolation, PDF priority, yea…
leostar0412 2159f53
wg21: fix author_alias migration default, fail job when bucket unset,…
leostar0412 be392a0
wg21: honor settings.RAW_DIR for raw paper storage #24
leostar0412 005278a
Fix: lint/format error #24
leostar0412 3547652
fix(openai_converter): use neutral page placeholder for failed pages #24
leostar0412 7403033
Fix: doc and converter fixes #24
leostar0412 c33c475
Fix: default sqlite, document #24
leostar0412 93ee8b7
Fix: author profile merge avoidance, blank paper_id rejection, mailin…
leostar0412 61a6c7f
Fix: author profile merge avoidance, blank paper_id rejection, pipeli…
leostar0412 c246241
Fix: OpenRouter retries, CSV year from parsed date, placeholder race …
leostar0412 c79c1dd
Merge branch 'develop' into dev-24
leostar0412 3e32ee2
refactor(wg21): pipeline dispatch + mailing range; remove Cloud Run s…
leostar0412 637b0e8
Merge remote-tracking branch 'origin/dev-24' into dev-24
leostar0412 818dcaf
Remove migration #24
leostar0412 4eb5c9b
wg21 paper updates, WG21 profile test fix, revert separate test DB UR…
leostar0412 22d740e
Merge branch 'develop' into dev-24
snowfox1003 57d9334
Fix: lint/format error
leostar0412 e122eb7
Fix: compose error
leostar0412 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
19 changes: 19 additions & 0 deletions
19
cppa_user_tracker/migrations/0005_wg21paperauthorprofile_author_alias.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| # Generated by Django 4.2.28 | ||
|
|
||
| from django.db import migrations, models | ||
|
|
||
|
|
||
| class Migration(migrations.Migration): | ||
|
|
||
| dependencies = [ | ||
| ("cppa_user_tracker", "0004_alter_slackuser_slack_user_id_and_more"), | ||
| ] | ||
|
|
||
| operations = [ | ||
| migrations.AddField( | ||
| model_name="wg21paperauthorprofile", | ||
| name="author_alias", | ||
| field=models.CharField(blank=True, db_index=True, default="", max_length=255), | ||
| preserve_default=False, | ||
| ), | ||
| ] | ||
13 changes: 13 additions & 0 deletions
13
cppa_user_tracker/migrations/0008_merge_wg21_author_alias_youtubespeaker_external_id.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| # Merge parallel branches from 0004: WG21 author_alias vs YouTube speaker chain. | ||
|
|
||
| from django.db import migrations | ||
|
|
||
|
|
||
| class Migration(migrations.Migration): | ||
|
|
||
| dependencies = [ | ||
| ("cppa_user_tracker", "0005_wg21paperauthorprofile_author_alias"), | ||
| ("cppa_user_tracker", "0007_youtubespeaker_external_id"), | ||
| ] | ||
|
|
||
| operations = [] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # WG21 Paper Tracker → GitHub Actions (`repository_dispatch`) | ||
|
|
||
| The Django app **`run_wg21_paper_tracker`** scrapes WG21 mailings and stores paper metadata in the database. It does **not** download PDFs or other documents. When **new** paper rows are created in a run, it can send **one** [repository dispatch](https://docs.github.com/en/rest/repos/repos#create-a-repository-dispatch-event) to another GitHub repository so a workflow there fetches each URL and runs conversion (e.g. PDF → Markdown). | ||
|
|
||
| ## Environment variables | ||
|
|
||
| | Variable | Required | Description | | ||
| |----------|----------|-------------| | ||
| | `WG21_GITHUB_DISPATCH_ENABLED` | No (default `false`) | Set to `true` to send `repository_dispatch` when there are new papers. | | ||
| | `WG21_GITHUB_DISPATCH_REPO` | Yes, if enabled | Target repo as `owner/repo` (the repo whose workflow will run). | | ||
| | `WG21_GITHUB_DISPATCH_TOKEN` | Yes, if enabled | PAT or token with permission to create repository dispatch events on that repo (classic PAT: `repo` scope for private repos). | | ||
| | `WG21_GITHUB_DISPATCH_EVENT_TYPE` | No | Must match `on.repository_dispatch.types` in the target workflow. Default: `wg21_papers_convert`. | | ||
|
|
||
| ## `client_payload` contract | ||
|
|
||
| The JSON body includes only a list of URL strings: | ||
|
|
||
| ```json | ||
| { | ||
| "papers": [ | ||
| "https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/…", | ||
| "https://www.open-std.org/…" | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| - **`papers`**: array of strings (WG21 document URLs), all new papers from **that** pipeline run in a **single** event. | ||
| - There is **no** `new_paper_count` field; use `length(papers)` in the workflow if needed. | ||
|
|
||
| ## Target repository workflow (example) | ||
|
|
||
| ```yaml | ||
| on: | ||
| repository_dispatch: | ||
| types: [wg21_papers_convert] | ||
|
|
||
| jobs: | ||
| convert: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: URLs | ||
| run: | | ||
| echo '${{ toJson(github.event.client_payload.papers) }}' | ||
| # Fetch each URL, convert, store artifacts / upload elsewhere | ||
| ``` | ||
|
|
||
| In expressions, `github.event.client_payload.papers` is a JSON array of strings. | ||
|
|
||
| ## Token security | ||
|
|
||
| Store `WG21_GITHUB_DISPATCH_TOKEN` in a secret manager or CI secret—never commit it. Prefer a fine-grained PAT scoped to the conversion repo if possible. | ||
|
|
||
| ## Payload size | ||
|
|
||
| Very large mailings could produce many URLs in one payload. If you approach GitHub or runner limits, document a split strategy (multiple dispatches) as an edge case; the default is one dispatch per tracker run with the full list. | ||
|
|
||
| ## CLI options | ||
|
|
||
| - **`--from-date YYYY-MM`**: Process mailings with `mailing_date >= YYYY-MM` (WG21 / CSV style). Backfills from that key onward when used alone. | ||
| - **`--to-date YYYY-MM`**: Upper bound: `mailing_date <= YYYY-MM`. With `--from-date`, the run uses the inclusive range `[from, to]`. Without `--from-date`, behavior stays incremental (only mailings **newer than** the latest `WG21Mailing` in the DB), but capped at `to`—useful to avoid pulling very new mailings in a controlled run. | ||
| - **`--dry-run`**: Log only; do not run the pipeline or send dispatch. | ||
|
|
||
| ## Flow summary | ||
|
|
||
| 1. Scheduler runs `run_wg21_paper_tracker` (optionally with `--from-date` / `--to-date`). | ||
| 2. Pipeline fetches mailings, upserts `WG21Mailing` / `WG21Paper` (metadata only). | ||
| 3. For each row **newly created** in that run, its document URL is collected. | ||
| 4. If the list is non-empty and dispatch is enabled, the app POSTs once to `POST /repos/{owner}/{repo}/dispatches` with `event_type` and `client_payload: { "papers": [ ... ] }`. | ||
| 5. The conversion repo’s workflow runs and downloads each URL. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.