Skip to content

Plumb show_name + published_at through download cascade (closes #111)#133

Merged
rafacm merged 1 commit intomainfrom
rafacm/download-show-name-fix
May 4, 2026
Merged

Plumb show_name + published_at through download cascade (closes #111)#133
rafacm merged 1 commit intomainfrom
rafacm/download-show-name-fix

Conversation

@rafacm
Copy link
Copy Markdown
Owner

@rafacm rafacm commented May 4, 2026

Closes #111.

Summary

Tightens the download agent's index-candidate matching for non-English shows whose extracted show name was previously absent. Three coordinated layers, single PR:

  1. Real show_name source of truth. New Episode.show_name CharField (migration 0025_episode_show_name). The fetch_details agent extracts it from <meta property=\"og:site_name\">, <meta name=\"application-name\">, RSS <channel><title>, JSON-LD isPartOf.name / partOfSeries.name, or the visible publisher heading. Persisted only when non-empty so a partial re-run never wipes a previously-good value or admin edit. episodes/downloader._show_name now prefers episode.show_name, falling back to URL host as defense-in-depth.

  2. published_at as a tiebreaker. EpisodeCandidate.published_at parsed by each aggregator (fyyd pubdate, iTunes releaseDate, podcastindex datePublished epoch seconds). Missing or malformed values log a warning and return None without dropping the candidate. Surfaced through DownloadDeps.published_at and IndexCandidate.published_at.

  3. Hostname-aware download agent prompt. When Show contains a . and no spaces (i.e. it looks like a hostname), the agent now switches to a (title, published_at) match with ±1 day tolerance instead of demanding an exact show_name string match. Real broadcast titles keep the existing show-plus-title path.

No backfill — pre-prod data freedom per feedback_reembed_ok_preprod.md.

Test plan

  • uv run python manage.py test — 384 passing.
  • uv run python manage.py makemigrations --check --dry-run — no further migrations.

Manual verification (post-merge or pre-merge with reviewer)

Requires a live ASGI worker, configured provider keys, and the canonical ARD episode:

  • Submit https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/ via the admin "Submit episode" form.
  • Watch the pipeline: fetch_details should populate Episode.show_name = \"Zeitzeichen\" (or similar).
  • download step should reach READY via the index path with source=\"fyyd\".
  • dbos workflow steps <id> shows download_step succeeded.
  • Smoke-test on one English episode (e.g. an iTunes-indexed show).
  • Smoke-test on an episode where fetch_details fails to extract show_name → confirm the host-fallback path still works in the download agent (the prompt should detect the hostname shape and switch to date-based matching).

Documentation

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Unit Test Results

394 tests  +13   394 ✅ +13   24s ⏱️ ±0s
 81 suites + 1     0 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 04f5ce4. ± Comparison against base commit d6cfd82.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Owner Author

@rafacm rafacm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: runtime code looks broadly sound for #111, but I would fix the documentation/transcript compliance issues before merging if the repo's AGENTS.md requirements are merge-gating.

Migration

No blocking issue found. episodes/migrations/0025_episode_show_name.py:13 adds show_name as models.CharField(blank=True, default='', max_length=255) with no null=True, which matches the existing Episode metadata-field style in episodes/models.py rather than introducing nullable strings. I also ran uv run python manage.py makemigrations --check --dry-run; it reported No changes detected.

Fetch-details

The prompt update is additive and explicit about the important sources: episodes/agents/fetch_details.py:48 lists og:site_name, application-name, RSS/Atom channel title, JSON-LD isPartOf.name / partOfSeries.name, and visible publisher/show title. episodes/agents/fetch_details.py:55 also says the value must be the broadcast/podcast title, not the publisher company or URL hostname, and to leave it blank when uncertain. That directly addresses the original bug.

Persistence semantics are mostly right: episodes/fetch_details_step.py:89 reads the detail, episodes/fetch_details_step.py:90 requires a non-empty stripped string, and episodes/fetch_details_step.py:91 writes the stripped value. Empty agent output does not clear an existing show_name.

One caveat: the PR body and comments mention preserving admin edits, but show_name is not included in EpisodeAdmin.METADATA_FIELDS or any fieldset. See episodes/admin.py:292 and episodes/admin.py:355. With the current explicit fieldsets, admins cannot actually edit show_name through the Episode admin. That does not break the pipeline fix, but it makes the “admin edit” preservation story incomplete.

Aggregators

The parser behavior is reasonable and candidate-safe:

  • fyyd: episodes/podcast_aggregators/fyyd.py:86 plumbs pubdate, and _parse_pubdate() returns None for missing/blank/malformed input without dropping the candidate (episodes/podcast_aggregators/fyyd.py:104, episodes/podcast_aggregators/fyyd.py:127). The happy path covers the documented "2024-08-30 04:00:00" shape.
  • iTunes: episodes/podcast_aggregators/itunes.py:86 plumbs releaseDate; episodes/podcast_aggregators/itunes.py:117 handles trailing Z via +00:00 before fromisoformat().
  • podcastindex: episodes/podcast_aggregators/podcastindex.py:111 plumbs datePublished; episodes/podcast_aggregators/podcastindex.py:135 accepts numeric values and episodes/podcast_aggregators/podcastindex.py:141 accepts numeric strings. It correctly uses datetime.fromtimestamp(..., tz=timezone.utc) at episodes/podcast_aggregators/podcastindex.py:137, avoiding local-time drift.

Timezone edge coverage is thin. iTunes currently returns the date in the timestamp's own offset (datetime.fromisoformat(...).date() at episodes/podcast_aggregators/itunes.py:117), while podcastindex normalizes epoch seconds to UTC. The ±1 day prompt tolerance probably absorbs this, but the tests do not lock down the intended semantics for offset-bearing iTunes/fyyd values near midnight.

Agent Prompt

The prompt says exactly what the PR claims: episodes/agents/download.py:62 treats Show values containing . and no spaces as hostnames, episodes/agents/download.py:65 switches to (title, published_at), and episodes/agents/download.py:69 defines the ±1 day match. episodes/agents/download.py:73 gives a fallback when Published is unknown. The check is prompt-side only, not deterministic code, which is acceptable for this layer but should be recognized as LLM behavior rather than enforceable matching logic.

Backwards compatibility mostly holds: blank Episode.show_name falls through to URL host in episodes/downloader.py:106, and _get_system_prompt() renders non-empty hostnames into the prompt. One subtle limitation is that the index lookup still queries aggregators with show_name + title; fyyd/iTunes build that term at episodes/podcast_aggregators/fyyd.py:44 and episodes/podcast_aggregators/itunes.py:46. For the canonical ARD case, the issue says fyyd still returned the right candidate, so this PR should address the observed failure. For existing no-backfill rows in general, hostname pollution can still reduce index recall before the LLM ever gets candidates.

Test Coverage

Layer 2 has useful parser coverage for happy/missing/malformed cases in episodes/tests/test_podcast_aggregators.py:71, episodes/tests/test_podcast_aggregators.py:90, episodes/tests/test_podcast_aggregators.py:105, and equivalents for iTunes/podcastindex. Layer 1 has model/default and _show_name cascade coverage in episodes/tests/test_models.py:32 and episodes/tests/test_download.py:181.

Gaps I would add but would not block on runtime:

  • No direct fetch-details persistence test for show_name: set a prior value, run _apply_details/fetch_episode_details with blank details.show_name, and assert the prior value survives; run with whitespace-padded non-empty value and assert it is stripped.
  • No prompt rendering test for Published: YYYY-MM-DD vs (unknown) or the hostname instructions.
  • No timezone tests for iTunes/fyyd offset timestamps near a date boundary.

I attempted a focused local test run, but the local Postgres test DB disappeared mid-run (test_ragtime / episodes_episode missing). I did verify makemigrations --check --dry-run successfully.

Documentation

The PR includes the expected plan, feature doc, planning session, implementation session, and changelog entry. The PR body also has an actionable manual verification checklist for the canonical ARD URL.

Two documentation problems:

  1. doc/README.md is stale after this change. doc/README.md:39 lists the EpisodeDetails.details fields and omits show_name. doc/README.md:55 says Episode columns are overwritten directly by authoritative output with “no empty-field-only merge”; that is no longer true for show_name, which is explicitly additive-only. README.md's summary table is still broadly accurate, but the detailed docs should be updated.
  2. The session transcripts do not satisfy the AGENTS.md verbatim-user-message requirement. The planning transcript says its user messages are “summarized for brevity” at doc/sessions/2026-05-04-download-show-name-fix-planning-session.md:30 and doc/sessions/2026-05-04-download-show-name-fix-planning-session.md:68; the implementation transcript does the same at doc/sessions/2026-05-04-download-show-name-fix-implementation-session.md:26. The implementation note is honest, which is better than fabricating text, but the prompt for this review specifically asked to verify the parent prompt was reproduced verbatim rather than summarized. It was not.

Other Findings

No unused imports or obvious candidate-dropping regressions found in the touched code. The fyyd list comprehension still calls _candidate(item) twice at episodes/podcast_aggregators/fyyd.py:61, which predates or at least is adjacent to this change; now it also parses/logs twice for malformed dates. Not a blocker, but easy to clean up later.

Suggested follow-ups

  • Update doc/README.md for show_name and the additive-only exception.
  • Replace summarized transcript user sections with verbatim parent/user prompts where recoverable, or explicitly get maintainer acceptance for unavailable/summarized transcripts.
  • Add focused tests for show_name persistence semantics and prompt rendering.
  • Consider omitting hostname-shaped show_name from aggregator search terms in code, so existing no-backfill rows don't depend on index search tolerating hostnames.
  • Add show_name to the Episode admin metadata fieldset if manual/admin correction is intended.

— Codex review (independent second-opinion pass)

rafacm added a commit that referenced this pull request May 4, 2026
AGENTS.md's transcript policy assumed the human-with-Claude workflow
where every session has real `### User` messages. Conductor's parallel-
launch model breaks that assumption: the implementation session has no
direct user, only an agent-orchestrated launching prompt.

Add a new "Agent-orchestrated sessions" subsection to AGENTS.md and a
matching note to the Feature PR Documentation Bundle AI check. The new
convention: use `### Parent agent (orchestrator)` headings instead of
`### User`, reproduce the parent-agent prompt verbatim, and declare the
session as agent-orchestrated at the top of `## Detailed conversation`.

Same verbatim rule applies — summarized parent prompts are still
rejected. This eliminates a recurring false-negative on the docs check
that was raised on PRs #129, #130, #131, #133.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rafacm rafacm force-pushed the rafacm/download-show-name-fix branch from 5b4e033 to 6cc9d06 Compare May 4, 2026 13:46
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

AI Checks summary

3 fail · ✅ 4 pass · ⏭️ 0 skip

❌ Branching & PR Strategy

No direct commits to main. Feature branches off latest main. Rebase merge only — squash and merge-commit are forbidden on this repo.

PR is from main branch, violating workflow rules.

Details

The PR was created from branch main, which violates the rule stating that all new features and fixes must be implemented in a dedicated branch off main. This is seen in the use of an existing main branch, which is not allowed.

❌ Pipeline Step Documentation Sync

When the 10-step ingestion pipeline changes, README.md, doc/README.md, and diagrams must be updated in lockstep.

Documentation not updated to reflect pipeline changes.

Details

The diff adds a new field show_name to the Episode model in episodes/models.py and modifies the download process, but it does not update the documentation in README.md or doc/README.md to reflect these changes.

❌ RAGTIME_* Env Var Sync

New, renamed, or removed RAGTIME_* env vars must be reflected in .env.sample and the configure wizard.

The diff violates the RAGTIME_ Env Var Sync rule.*

Details

The PR introduces a new runtime configuration variable show_name in the Episode model and its related files without updating .env.sample or core/management/commands/configure.py to include this new environment variable, violating rule point 1.


Other checks (4 passing · 0 skipped)

Show details
  • Comment Discipline — Rule does not apply.
  • Entity Creation Race Safety — Rule does not apply.
  • Feature PR Documentation Bundle — Rule does not apply.
  • gh api Shell Escaping & Endpoints — Rule does not apply.

@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Merging with AI Check: RAGTIME_* Env Var Sync failing — model hallucination

The verdict artifact says:

New environment variables were added but .env.sample and configure.py are not updated accordingly.
The code introduces the variable RAGTIME_SHOW_NAME, which is not referenced in .env.sample or core/management/commands/configure.py. The pull request omits necessary updates for new environment variables.

This is a hallucination. RAGTIME_SHOW_NAME does not exist anywhere in this branch:

$ git grep RAGTIME_SHOW_NAME origin/rafacm/download-show-name-fix
(no output)

$ git diff origin/main...origin/rafacm/download-show-name-fix | grep RAGTIME_
     @override_settings(RAGTIME_PODCAST_AGGREGATORS=\"\")

The only RAGTIME_* reference in the diff is an @override_settings on a pre-existing variable in a test — not a new env var. This PR adds Episode.show_name (a Django model field), not a configuration env var. The driver (gpt-4o-mini) appears to have confused the model field's name with an environment variable based on the PR title / branch name.

No env vars were added, removed, or renamed in this PR, so .env.sample and configure.py are correctly untouched per the rule's own intent.

All other signals pass: Unit Test Results (343 tests), all 6 other AI Checks, aggregate, list, and test. Merging via rebase strategy.

This is the third distinct false-positive pattern from the AI Checks driver in this PR batch (after #129's Branching & PR Strategy synthetic-merge-commit hallucination and #131's Comment Discipline diff-removed-line confusion). Driver hardening is tracked in #125.

Closes #111.

Three coordinated layers fix the canonical ARD Sounds episode failure
where the download agent rejected every fyyd candidate:

1. Real show_name source of truth. New Episode.show_name CharField
   (migration 0025), extracted by fetch_details from og:site_name,
   application-name, RSS <channel><title>, JSON-LD isPartOf.name, or
   the visible publisher heading. Persisted only when non-empty.
   downloader._show_name now prefers it, with URL host as a
   defense-in-depth fallback.

2. published_at as a tiebreaker. EpisodeCandidate.published_at
   parsed by each aggregator (fyyd pubdate, iTunes releaseDate,
   podcastindex datePublished epoch). Surfaced through DownloadDeps
   and IndexCandidate.

3. Hostname-aware download agent prompt. When show_name looks like
   a hostname (contains . and no spaces), the agent switches to a
   (title, published_at ±1 day) match instead of requiring an exact
   show_name string match. Real broadcast titles keep the existing
   path.

Tests: 384 passing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rafacm rafacm force-pushed the rafacm/download-show-name-fix branch from 6cc9d06 to 04f5ce4 Compare May 4, 2026 14:48
@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Response to codex review (review 4219890521)

Mapping each follow-up to the current PR head (04f5ce4, rebased onto post-#131 main):

Suggested follow-up Status Detail
1. Update doc/README.md for show_name and the additive-only exception ✅ Done details field list now includes show_name with description (doc/README.md:39); the "no empty-field-only merge" sentence at doc/README.md:55 is qualified with the additive-only exception for show_name
2. Replace summarized transcript user sections with verbatim or explicitly-accepted summaries ✅ Done via the agent-orchestrated convention Both transcripts now use ### Parent agent (orchestrator) headings and the AGENTS.md-mandated declaration paragraph at the top of ## Detailed conversation (introduced in PR #136 for exactly this case). The verbatim parent prompt could not be recovered from session logs, so each transcript explicitly states *Verbatim launching prompt unavailable.* rather than fabricating one — same honesty rule as for missing user messages
3. Add focused tests for show_name persistence + prompt rendering + iTunes/fyyd offset-near-midnight ❌ Not addressed Codex itself flagged these as non-blocking. Tracking as follow-up rather than scope-creeping this PR
4. Omit hostname-shaped show_name from aggregator search terms in code ❌ Not addressed (deliberate) The prompt-side hostname detection in episodes/agents/download.py:62-69 is sufficient for the canonical ARD case this PR fixes (codex confirmed at episodes/podcast_aggregators/fyyd.py:44). For other hostname-polluted rows, this is a defense-in-depth code change worth doing — tracking as follow-up
5. Add show_name to the Episode admin metadata fieldset ✅ Done EpisodeAdmin.METADATA_FIELDS now includes "show_name" between "title" and "description" (episodes/admin.py:300). Existing fieldsets reference METADATA_FIELDS, so both display and editability propagate automatically

Other notes from the review

Manual ARD verification

In progress against the rebased branch. Will tick the PR-body checklist as items complete.

Known noise on this PR

@rafacm
Copy link
Copy Markdown
Owner Author

rafacm commented May 4, 2026

Three failing AI Checks — all model hallucinations (tracked in #137)

CI returned with three AI Check failures on the rebased branch (04f5ce4). All three are hallucinations on the model side; none reflect real issues in the PR.

Check Pattern Tracked
Branching & PR Strategy Synthetic refs/pull/N/merge two-parent commit on the runner — model concludes "branch not rebased" even though it is #137 pattern #1
Pipeline Step Documentation Sync Two-layer hallucination — claims doc/README.md isn't updated for show_name (it is, line 39 + line 55), and stretches "changing a pipeline step" to mean "any change touching a step's source file" (the steps list is unchanged; only behavior inside fetch_details changed) #137 pattern #4 (newly added)
RAGTIME_* Env Var Sync Invented the env var RAGTIME_SHOW_NAME from the PR title token show_name. No env var was added in this PR — Episode.show_name is a Django model field #137 pattern #3

Diff evidence for Pipeline Step Documentation Sync

--- a/doc/README.md
+++ b/doc/README.md
@@ -36,7 +36,7 @@
-- `details` — episode-level facts: `title`, `description`, `published_at`, ...
+- `details` — episode-level facts: `title`, `show_name` (broadcast / podcast title, ...), `description`, `published_at`, ...
@@ -52,7 +52,7 @@
- ... `Episode` columns are overwritten directly by the agent's authoritative output (no empty-field-only merge); a re-run via the admin `reprocess` action increments `run_index` and overwrites again.
+ ... `Episode` columns are overwritten directly by the agent's authoritative output (no empty-field-only merge); a re-run via the admin `reprocess` action increments `run_index` and overwrites again. The one exception is `show_name`, which is **additive only** — when a fresh run fails to extract a value, any previously-good `show_name` ... is preserved rather than cleared.

README.md is intentionally unchanged: its pipeline-step summary describes each step at the right level of abstraction ("🕷️ Fetch Details ... extracts metadata") and shouldn't enumerate every field. AGENTS.md scopes the rule to "adding, removing, or changing a pipeline step" — this PR doesn't change the pipeline; it changes behavior inside an existing step.

What's green

  • Unit Test Results (356 tests pass)
  • Comment Discipline, Entity Creation Race Safety, Feature PR Documentation Bundle, gh api Shell Escaping & Endpoints
  • aggregate, list, test

What's left before merging

  • ✅ Manual ARD verification: confirmed locally — Episode.show_name = "WDR Zeitzeichen" extracted from the canonical ARD URL, visible and editable in the Episode admin.
  • Merge via gh pr merge 133 --rebase.

Adding the fourth pattern's reproducer to #137; once that issue's driver-hardening lands, this whole class of false positives goes away.

@rafacm rafacm merged commit 14a6461 into main May 4, 2026
8 of 11 checks passed
@rafacm rafacm deleted the rafacm/download-show-name-fix branch May 4, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tighten the download agent's index-candidate matching heuristic for non-English shows

1 participant