Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
e3c1a65
Initial wg21-paper-tracker added
leostar0412 Mar 9, 2026
9892a45
wg21_paper_tracker: features, tests, and cleanup #24
leostar0412 Mar 10, 2026
18f07c3
Fix lint/format error #24
leostar0412 Mar 10, 2026
f4388ff
Validate mailing_date in get_raw_dir; WG21 author order/resolution an…
leostar0412 Mar 10, 2026
62d5d42
Fix: WG21 tracker (year, GCS guard, IntegrityError), author_alias, Pi…
leostar0412 Mar 10, 2026
e3e91c8
Fix: WG21 – optional Cloud Run, per-blob isolation, PDF priority, yea…
leostar0412 Mar 10, 2026
2159f53
wg21: fix author_alias migration default, fail job when bucket unset,…
leostar0412 Mar 11, 2026
be392a0
wg21: honor settings.RAW_DIR for raw paper storage #24
leostar0412 Mar 11, 2026
005278a
Fix: lint/format error #24
leostar0412 Mar 11, 2026
3547652
fix(openai_converter): use neutral page placeholder for failed pages #24
leostar0412 Mar 11, 2026
7403033
Fix: doc and converter fixes #24
leostar0412 Mar 11, 2026
c33c475
Fix: default sqlite, document #24
leostar0412 Mar 11, 2026
93ee8b7
Fix: author profile merge avoidance, blank paper_id rejection, mailin…
leostar0412 Mar 11, 2026
61a6c7f
Fix: author profile merge avoidance, blank paper_id rejection, pipeli…
leostar0412 Mar 11, 2026
c9023f5
#38-add youtube tracker. confirmed that transcripts are downloaded.
Mar 11, 2026
c246241
Fix: OpenRouter retries, CSV year from parsed date, placeholder race …
leostar0412 Mar 11, 2026
5051258
#38-fixed lint errors
Mar 11, 2026
17827f0
#38-fix a bug
Mar 11, 2026
5dce586
#38-updated youtube_tracker by adding search term, processing QuotaEx…
Mar 13, 2026
6abde94
#38-fixed lint error
Mar 13, 2026
516a0f8
#38-addressed the review results of coderabbitai
Mar 13, 2026
5f445ee
#38-addressed the coderabbitai's suggestions
Mar 13, 2026
bf571ec
#38-fixed a query issue
Mar 13, 2026
36eb704
#38-fix timezone issue
Mar 13, 2026
282046e
Merge branch 'develop' into dev-38
jonathanMLDev Mar 17, 2026
c79c1dd
Merge branch 'develop' into dev-24
leostar0412 Mar 18, 2026
3e32ee2
refactor(wg21): pipeline dispatch + mailing range; remove Cloud Run s…
leostar0412 Mar 21, 2026
637b0e8
Merge remote-tracking branch 'origin/dev-24' into dev-24
leostar0412 Mar 21, 2026
818dcaf
Remove migration #24
leostar0412 Mar 21, 2026
6748f28
unified issues/PR sync, backward commit pagination, Link-based REST -…
snowfox1003 Mar 25, 2026
85e0d6a
Fix: lint/format error - #125
snowfox1003 Mar 25, 2026
47bdf3f
fix(github): reconcile issue/PR labels and PR assignees; document uni…
snowfox1003 Mar 25, 2026
fcad9d9
fix(github_activity_tracker): paginate commits without rel=last; hard…
snowfox1003 Mar 25, 2026
1c9c741
#126-fixed this app and cppa-pinecone app
Mar 25, 2026
d778826
#126-fixed ci test errors
Mar 25, 2026
5532eb4
#126-added the removing logic for downloaded zip file
Mar 25, 2026
6ac0142
#126-removed the seen in sync.py
Mar 25, 2026
fe5ec76
Remove fetch issue and pr functions respectively in fetcher function …
snowfox1003 Mar 25, 2026
819634c
#126-addressed the coderabbitai review results
Mar 25, 2026
7219b54
#126-added version_operations and updated version operation logics
Mar 26, 2026
4e7b433
#126-update docs/service_api/boost_library_docs_tracker.md
Mar 26, 2026
5f61c33
Refactor fetch_issues_and_prs_from_github to separate the ETag/params…
snowfox1003 Mar 26, 2026
7c76ff0
#126-created text_processing.py for general purpose and applied concu…
Mar 27, 2026
6d5cc42
#126-update max concurrent number
Mar 27, 2026
cd06037
#126-removed the practice code
Mar 27, 2026
4eb5c9b
wg21 paper updates, WG21 profile test fix, revert separate test DB UR…
leostar0412 Mar 27, 2026
09b623a
#126-updated all preprocessors to contain "source_ids" key in metadata
Mar 27, 2026
2d8a312
#126-rename typo of boost mailing preprocessor
Mar 27, 2026
026b22b
#106-fixed test for renamed file
Mar 27, 2026
c993d3c
#126-addressed minor errors
Mar 27, 2026
9fa9a77
feat(ops): Slack/Discord startup notification after deploy health che…
snowfox1003 Mar 28, 2026
ae865fb
Fix: Lint/format error - #125
snowfox1003 Mar 28, 2026
4369f90
Fix: workspace and logs folder in docker compose
snowfox1003 Mar 28, 2026
430e3a0
Merge branch 'develop' into feature-125
snowfox1003 Mar 28, 2026
c6d929b
Merge pull request #127 from snowfox1003/feature-125
snowfox1003 Mar 28, 2026
965a046
Merge branch 'develop' into 126-bug-fix-boost-library-docs-tracker
jonathanMLDev Mar 30, 2026
1885fec
fix(dashboard): reliable publish and CLI cleanup - #132
snowfox1003 Mar 30, 2026
92944da
Fix: lint/format error - #132
snowfox1003 Mar 30, 2026
3968a2f
add 403 error fix logic in upload folder to github - #134
snowfox1003 Mar 30, 2026
da881ac
fix(dashboard): bootstrap publish clone and block publish without HTM…
snowfox1003 Mar 30, 2026
9a20939
fix: harden dashboard publish paths, GitHub HTTPS auth, and git commi…
snowfox1003 Mar 30, 2026
f688060
Update: use 60s base for blob 403 fallback - #134
snowfox1003 Mar 30, 2026
9de37f2
Merge branch 'develop' into dev-38
snowfox1003 Mar 31, 2026
b870f17
Fix: Update unecessary logic - #134
snowfox1003 Mar 31, 2026
22678cd
Merge pull request #133 from snowfox1003/bug/132
snowfox1003 Mar 31, 2026
00d23a5
Merge pull request #135 from snowfox1003/bug/134
snowfox1003 Mar 31, 2026
b8ccf7b
#126-fixed file name and duplications
Apr 6, 2026
256b352
Merge branch '126-bug-fix-boost-library-docs-tracker' of https://gith…
Apr 6, 2026
763ac54
#126-fixed content typo of test file
Apr 6, 2026
8d9e91a
Merge pull request #128 from jonathanMLDev/126-bug-fix-boost-library-…
snowfox1003 Apr 6, 2026
625e60a
Merge pull request #107 from jonathanMLDev/dev-38
snowfox1003 Apr 6, 2026
edd54e4
Update the clang_github_tracker
snowfox1003 Mar 31, 2026
7d50b2d
Fix: end_date update
snowfox1003 Apr 1, 2026
d5c9534
Add Clang markdown publishing with context-repo settings, chunked raw…
snowfox1003 Apr 2, 2026
113713b
Fix: lint/format error - #136
snowfox1003 Apr 2, 2026
dc5a43b
fix(clang): docstrings, batch updated_at, CSV/CLI/Pinecone fixes; red…
snowfox1003 Apr 2, 2026
2a1f4de
Fix: lint/format error
snowfox1003 Apr 2, 2026
9249784
fix(clang): max-merge duplicate rows in batch upserts; doc start_afte…
snowfox1003 Apr 2, 2026
abe1519
Fix: map prepare/pull git errors to CommandError - #136
snowfox1003 Apr 3, 2026
4968e5c
fix: defensive clang GitHub upserts, Docker safe.directory, and redac…
snowfox1003 Apr 3, 2026
6922fd5
fix: sanitize git_ops remote errors; clang publisher stale-title clea…
snowfox1003 Apr 3, 2026
f8c1880
docs: fix ClangGithubIssueItem watermark (+1ms); format services.py -…
snowfox1003 Apr 3, 2026
10daaa3
fix(clang): issue number validation + lowercase SHAs; git_ops timeout…
snowfox1003 Apr 3, 2026
9464188
fix: clang backfill raw-only; sanitize clone/publish git errors; sync…
snowfox1003 Apr 3, 2026
e7384f4
Update the code
snowfox1003 Apr 7, 2026
dfc7d69
Fix: staging logic in clange github tracker, use core functions - #136
snowfox1003 Apr 7, 2026
b59fc99
Merge pull request #137 from snowfox1003/dev-136
snowfox1003 Apr 8, 2026
22d740e
Merge branch 'develop' into dev-24
snowfox1003 Apr 20, 2026
57d9334
Fix: lint/format error
leostar0412 Apr 20, 2026
e122eb7
Fix: compose error
leostar0412 Apr 20, 2026
cf8f184
Merge pull request #104 from leostar0412/dev-24
snowfox1003 Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 26 additions & 19 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -76,19 +76,6 @@ DATABASE_URL=postgres://user:password@localhost:5432/boost_dashboard
# Slack webhook URL (get from Slack: https://api.slack.com/messaging/webhooks)
# SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# =============================================================================
# Clang GitHub Tracker
# =============================================================================
# GitHub repo to sync (default: llvm/llvm-project).
# CLANG_GITHUB_OWNER=llvm
# CLANG_GITHUB_REPO=llvm-project
#
# Private repo for Markdown export (optional).
# Issues/PRs are exported to: issues/YYYY/YYYY-MM/#N - title.md
# If unset, upload is skipped and an error is logged.
# CLANG_GITHUB_TRACKER_PRIVATE_REPO_OWNER=your-org
# CLANG_GITHUB_TRACKER_PRIVATE_REPO_NAME=your-private-repo
# CLANG_GITHUB_TRACKER_PRIVATE_REPO_BRANCH=main

# =============================================================================
# GitHub tokens (multiple use cases)
Expand All @@ -108,6 +95,13 @@ DATABASE_URL=postgres://user:password@localhost:5432/boost_dashboard
# GitHub repo to sync (default: llvm/llvm-project).
# CLANG_GITHUB_OWNER=llvm
# CLANG_GITHUB_REPO=llvm-project
# Markdown publish target (optional; see also Clang section above).
# CLANG_GITHUB_CONTEXT_REPO_OWNER=your-org
# CLANG_GITHUB_CONTEXT_REPO_NAME=your-repo
# CLANG_GITHUB_CONTEXT_REPO_BRANCH=main
# If that repo is private: set GITHUB_TOKEN_WRITE to a PAT that can read+push it
# (classic: repo scope; fine-grained: grant this repository). Publish uses the
# write token, not GITHUB_TOKENS_SCRAPING.
# Pinecone sync (run_cppa_pinecone_sync) — app_type and namespace when triggering from this app.
# CLANG_GITHUB_PINECONE_APP_TYPE=github-clang
# CLANG_GITHUB_PINECONE_NAMESPACE=github-clang
Expand Down Expand Up @@ -170,17 +164,18 @@ DATABASE_URL=postgres://user:password@localhost:5432/boost_dashboard
# REPO_COUNT_LANGUAGES=C++,Python,Rust

# =============================================================================
# Boost Library Usage Dashboard (optional; for --publish)
# Boost Library Usage Dashboard
# =============================================================================
# When set, run_boost_library_usage_dashboard --publish uses a persistent clone
# at raw/boost_library_usage_dashboard/<owner>/<repo> (clone if missing, pull, copy, push).
# Target repo for publishing (run_boost_library_usage_dashboard without --skip-publish).
# Clone/pull/push uses GITHUB_TOKEN_WRITE (see GitHub tokens above).
# BOOST_LIBRARY_USAGE_DASHBOARD_PUBLISH_OWNER=your-org
# BOOST_LIBRARY_USAGE_DASHBOARD_PUBLISH_REPO=your-dashboard-repo
# Token for clone/pull/push (defaults to GITHUB_TOKEN_WRITE if unset)
# BOOST_LIBRARY_USAGE_DASHBOARD_PUBLISH_TOKEN=ghp_xxxx
# Branch to publish to
# BOOST_LIBRARY_USAGE_DASHBOARD_PUBLISH_BRANCH=main

# Git commit author identity used when publishing (defaults shown)
# GIT_AUTHOR_NAME=unknown
# GIT_AUTHOR_EMAIL=unknown@noreply.github.com

# =============================================================================
# Workspace (optional; default: project_root/workspace)
# =============================================================================
Expand Down Expand Up @@ -256,3 +251,15 @@ DATABASE_URL=postgres://user:password@localhost:5432/boost_dashboard

# Path to context repository (where markdown files are exported)
# DISCORD_CONTEXT_REPO_PATH=F:\boost\discord-cplusplus-together-context

# =============================================================================
# YouTube (cppa_youtube_script_tracker)
# =============================================================================
# YouTube Data API v3 key (console.cloud.google.com → APIs & Services → Credentials)
# YOUTUBE_API_KEY=...

# Pinecone namespace for YouTube video/transcript sync (default: youtube-scripts)
# YOUTUBE_PINECONE_NAMESPACE=youtube-scripts

# Earliest published_at to use when DB is empty (ISO 8601, e.g. 2015-01-01T00:00:00Z)
# YOUTUBE_DEFAULT_PUBLISHED_AFTER=2015-01-01T00:00:00Z
3 changes: 3 additions & 0 deletions .github/workflows/deploy-script/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,7 @@ until make health >/dev/null 2>&1; do
done
log "Stack is healthy."

log "Sending startup notification..."
DEPLOY_BRANCH="$BRANCH" make notify || log "WARNING: Startup notification failed (non-fatal)."

log "Deploy completed."
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,4 @@ discord_activity_tracker/tools/
config/boost_collector_schedule.yaml
# temp files
temp/
nul
4 changes: 4 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,10 @@ RUN chmod +x /app/docker-entrypoint.sh

# Entrypoint runs as root, chowns mounted dirs, then exec's CMD as appuser via gosu
RUN useradd --create-home appuser && chown -R appuser /app
# Git 2.35+ blocks repos when directory owner != current user; bind mounts often
# disagree (e.g. Docker Desktop on Windows). System config applies to root and appuser
# (e.g. docker exec as root vs gosu appuser in entrypoint).
RUN git config --system --add safe.directory '/app/workspace/*'
ENTRYPOINT ["/app/docker-entrypoint.sh"]
# Container starts as root so entrypoint can chown; CMD runs as appuser via gosu

Expand Down
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
SHELL := /bin/bash
COMPOSE := docker compose
APP := web
BEAT := celery_beat
MANAGE := $(COMPOSE) run --rm $(APP) python manage.py

.DEFAULT_GOAL := help
Expand All @@ -32,6 +33,7 @@ help:
@echo " Logs & status"
@echo " ps Show running containers"
@echo " health Verify DB, Redis, Selenium, and Celery containers"
@echo " notify Send Slack/Discord startup notification (celery_beat; optional DEPLOY_BRANCH)"
@echo " logs Follow logs for all services"
@echo " logs-web Follow logs for the web service"
@echo " logs-worker Follow logs for the Celery worker"
Expand Down Expand Up @@ -101,6 +103,10 @@ health:
$(COMPOSE) ps --status running celery_worker | grep -q celery_worker
$(COMPOSE) ps --status running celery_beat | grep -q celery_beat

.PHONY: notify
notify:
$(COMPOSE) exec -T -e DEPLOY_BRANCH="$(DEPLOY_BRANCH)" $(BEAT) python manage.py send_startup_notification

.PHONY: logs
logs:
$(COMPOSE) logs -f
Expand Down
36 changes: 23 additions & 13 deletions boost_library_docs_tracker/fetcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,9 @@ def download_source_zip(version: str, dest_dir: Path) -> Path:
zip_name = f"boost_{normalized.replace('.', '_')}.zip"
zip_path = dest_dir / zip_name

# if zip_path.exists():
# logger.info("Source zip already present, skipping download: %s", zip_path)
# return zip_path
if zip_path.exists():
logger.info("Source zip already present, skipping download: %s", zip_path)
return zip_path

dest_dir.mkdir(parents=True, exist_ok=True)
session = _get_session()
Expand Down Expand Up @@ -320,16 +320,26 @@ def crawl_library_pages(

# Enqueue in-scope links
soup = BeautifulSoup(resp.text, "lxml")
for a in soup.find_all("a", href=True):
href: str = a["href"]
abs_url = urljoin(final_url, href)
# Strip fragment
abs_url = abs_url.split("#")[0]
if (
abs_url not in visited
and abs_url.startswith(start_url)
and abs_url not in queue
):
lib_segment = lib_key.split("/")[-1]
if not lib_segment:
logger.warning(
"Empty library key segment for lib_key=%r; skipping link discovery for %s",
lib_key,
final_url,
)
else:
for a in soup.find_all("a", href=True):
href: str = a["href"]
abs_url = urljoin(final_url, href)
# Strip fragment
abs_url = abs_url.split("#")[0]
if not abs_url.startswith(base_url):
continue
# Stay within this library's doc subtree (path contains lib segment)
if lib_segment not in abs_url:
continue
if abs_url in visited or abs_url in queue:
continue
queue.append(abs_url)

logger.debug(
Expand Down
9 changes: 7 additions & 2 deletions boost_library_docs_tracker/html_to_md.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@
--------
1. _preprocess_html – remove Boost boilerplate from HTML before pandoc sees it
2. _pandoc_convert – HTML → GFM via pypandoc (CLI fallback)
3. _postprocess_markdown – strip residual HTML artefacts and rejoin split lines
3. _postprocess_markdown – strip residual HTML artefacts, rejoin split lines, then clean_text (unicode/line endings only)
"""

import re
import subprocess

from bs4 import BeautifulSoup

from core.utils.text_processing import clean_text

try:
import pypandoc
except Exception: # optional runtime dependency
Expand Down Expand Up @@ -299,4 +301,7 @@ def _postprocess_markdown(md: str) -> str:
# 12. Collapse excessive blank lines to at most two
md = _RE_EXCESS_BLANK.sub("\n\n", md)

return md.strip() + "\n"
# 13. Unicode / line-ending cleanup (no space collapsing — preserves markdown indent)
md = clean_text(md, remove_extra_spaces=False)

return md.rstrip() + "\n"
Loading
Loading