Skip to content

fix(ci): detect npm publish failures, split releases per language, surface credential runbooks#30

Merged
konard merged 8 commits intomainfrom
issue-29-7f70f0d87db9
May 3, 2026
Merged

fix(ci): detect npm publish failures, split releases per language, surface credential runbooks#30
konard merged 8 commits intomainfrom
issue-29-7f70f0d87db9

Conversation

@konard
Copy link
Copy Markdown
Member

@konard konard commented May 3, 2026

Summary

Fixes #29 — the v0.3.3 release looked successful in CI but never landed on npm, and the
publishing workflows did not give operators enough information to recover from broken
credentials. Comparison with the four template repos (js-template, python-template,
rust-template, csharp-template) showed several features adopted in templates but
missing here: per-language tag prefixes, multi-layer publish-failure detection, and
in-log credential runbooks. This PR ports those patterns and adds end-to-end
verification.

This PR is CI/CD only — no source code or test files were changed. Existing test
suite (146 tests) still passes locally.

What was wrong

  1. False-positive npm publish. js/scripts/publish-to-npm.mjs reported
    ✅ Published my-package@0.3.3 to npm even though changeset publish actually emitted
    E404 Not Found - PUT https://registry.npmjs.org/lino-objects-codec. Two bugs combined:
    (a) the script hardcoded PACKAGE_NAME = 'my-package' so the existence probe queried
    the wrong artifact, and (b) npm run changeset:publish exits 0 even when individual
    packages fail, so trusting the exit code masked the real outcome. Live HTTP probes
    confirm npm has only 0.3.1; v0.3.2 and v0.3.3 are 404 but the GitHub releases
    exist, advertising a state that does not exist.
  2. Tag namespace collision. JS releases used the bare v<X.Y.Z> tag while Rust used
    rust-v<X.Y.Z> and C# used csharp-v<X.Y.Z>. Only one language could own v*.*.*
    that has been JS by default — so the others were silently squeezed out.
  3. Unhelpful credential errors. When NUGET_API_KEY, CARGO_REGISTRY_TOKEN, or
    PyPI Trusted Publisher was missing/expired, CI emitted a single opaque line and
    exited. Operators had to read the case studies to figure out which page to open.

Full investigation, evidence, run logs, and HTTP probes are in
docs/case-studies/issue-29/.

What changed

  • js/scripts/publish-to-npm.mjs — read name and version dynamically from
    ./package.json; capture stdout/stderr from npm run changeset:publish; scan for
    failure markers (packages failed to publish, error occurred while publishing,
    npm error 4xx, eneedauth, …); re-query the registry with npm view after
    publish to confirm the version actually landed; surface a credential-recovery
    runbook on auth failures. Pattern is ported from
    link-assistant/agent#116.
  • js/scripts/format-release-notes.mjs & create-manual-changeset.mjs — read
    package name from package.json instead of the template default 'my-package'.
  • js/scripts/create-github-release.mjs & format-github-release.mjs — accept
    --tag-prefix (default v); the JS workflow now passes --tag-prefix "js-v" so JS
    releases get their own tag namespace, matching csharp/python/rust which already had
    their prefixes.
  • .github/workflows/csharp.yml — when NUGET_API_KEY is missing OR NuGet rejects
    it (401/403/unauthorized/forbidden), emit ::error:: with links to
    https://www.nuget.org/account/apikeys and the secret-update path.
  • .github/workflows/rust.yml — same treatment for CARGO_REGISTRY_TOKEN
    (https://crates.io/me) on both auto-release and manual-release jobs.
  • .github/workflows/python.ymlpypa/gh-action-pypi-publish hides its own
    diagnostics, so add (1) a failure() follow-up step that prints the trusted-publisher
    configuration URL with this repo's owner/name pre-filled and (2) a post-publish
    verification loop against the PyPI JSON API to catch the same false-positive class
    as Fix all CI/CD errors #29.
  • experiments/issue-29/test-failure-detection.mjs — regression test that drives
    the new detector against real captured CI fragments from run 25280681547.
  • docs/case-studies/issue-29/ — full case study with three CI run logs/JSON,
    HTTP probe evidence, timeline, root causes, and the solution plan.
  • js/.changeset/issue-29-publish-detection.mdpatch changeset documenting
    the JS-side fixes for the next release.

How to reproduce

The smoking gun is in docs/case-studies/issue-29/run-25280681547-js.log line ~4178:

🦋  error an error occurred while publishing lino-objects-codec: E404 Not Found - PUT https://registry.npmjs.org/lino-objects-codec
🦋  error npm error code E404
🦋  error packages failed to publish:
🦋  error   - lino-objects-codec
…
✅ Published my-package@0.3.3 to npm   <-- false positive emitted by the old script

Run the regression test against the live failure fragments:

node experiments/issue-29/test-failure-detection.mjs
# PASS real CI E404 output (run 25280681547) -> detected=packages failed to publish
# PASS clean success output                -> detected=null
# PASS credential failure E401             -> detected=npm error code E
# All 3 tests passed

Live registry probe (also in docs/case-studies/issue-29/registry-presence.txt):

GET https://registry.npmjs.org/lino-objects-codec/0.3.1 -> 200
GET https://registry.npmjs.org/lino-objects-codec/0.3.2 -> 404
GET https://registry.npmjs.org/lino-objects-codec/0.3.3 -> 404

Test plan

  • node experiments/issue-29/test-failure-detection.mjs — 3/3 pass
  • cd js && npm test — 146 tests pass, 0 fail
  • cd js && npm run lint — only pre-existing src/ complexity warnings remain (no errors)
  • cd js && npm run format:check — clean
  • cd js && node scripts/validate-changeset.mjs — passes
  • node --check on every modified .mjs
  • python3 -c "yaml.safe_load(...)" on all four workflow files
  • CI passes on this branch (push triggered, awaiting result)
  • After merge: next JS release should produce js-v<version> tag and a green
    ✅ Verified … on npm line; if npm publish actually fails, the workflow should
    now fail loudly with a credential runbook in the log.

Out of scope (deliberately not in this PR)

  • Bumping JS to 0.3.4 and re-publishing 0.3.3 → the changeset added here will do that
    on the next merge to main.
  • Reverting the false-positive GitHub releases v0.3.2 and v0.3.3 — those exist on
    GitHub but not on npm; deletion is operator-driven.
  • Changing source code in any language — the issue called out CI/CD only.

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #29
@konard konard self-assigned this May 3, 2026
konard added 6 commits May 3, 2026 17:10
…ng per-language releases, unhelpful credential errors

Captures live registry probes, three failed CI run logs, and JSON metadata
proving:
- npm v0.3.3 reported "published" but registry has only v0.3.1
- changeset publish exits 0 even when npm publish fails internally
- js/scripts/publish-to-npm.mjs hardcodes PACKAGE_NAME='my-package'
- JS releases tagged v* without language prefix, colliding with rust-v*/csharp-v*
- C#/Python/Rust credential guards print one-liner errors with no recovery steps

Reference: docs/case-studies/issue-25/README.md (predecessor work)
…cally

The v0.3.3 release looked successful in CI but the package was never
published to npm. Two bugs combined to produce that false positive:

1. publish-to-npm.mjs hardcoded the package name as 'my-package' and
   pulled the version from a stale source, so the post-publish
   verification queried the wrong artifact.
2. `npm run changeset:publish` exits 0 even when individual packages
   fail; the script trusted the exit code and never inspected the
   "packages failed to publish" / "error occurred while publishing"
   markers that @changesets/cli prints in that case.

This commit ports the multi-layer detection pattern from
link-assistant/agent PR #116 and reads name+version from
./package.json so the verifier always queries the real artifact.

- publish-to-npm.mjs: capture changeset stdout/stderr, scan for
  failure markers (FAILURE_PATTERNS), re-query the registry with
  `npm view`, and surface a credential runbook on auth failures.
- format-release-notes.mjs, create-manual-changeset.mjs: read package
  name from package.json instead of hardcoded 'my-package'.
- experiments/issue-29/test-failure-detection.mjs: regression test
  driving the new detector against real captured CI fragments.

Refs: docs/case-studies/issue-29/README.md, #29
The csharp, python, and rust workflows already pass `--tag-prefix
<lang>-v` so each language ships its own GitHub release. The JS
workflow used the bare `v<version>` tag, which caused the JS release
to share a tag namespace with whatever other release happened to use
the same version — exactly the conflation issue called out in #29.

- create-github-release.mjs / format-github-release.mjs: accept
  --tag-prefix (default "v"), build the tag as `${tagPrefix}${version}`
  and use that for both the tag name and the human-readable release
  title (matches js-template's pattern).
- .github/workflows/js.yml: pass --tag-prefix "js-v" to both the
  release and instant-release jobs.

Refs: docs/case-studies/issue-29/README.md, #29
When a publish step failed because a token was missing, expired, or a
PyPI Trusted Publisher hadn't been configured, CI emitted a single
opaque line and exited. Operators were left to dig through the case
studies to figure out what to do. This commit adds inline runbooks so
the GitHub Actions log itself tells you exactly which page to open and
which secret to set.

- csharp.yml: when NUGET_API_KEY is missing OR NuGet rejects it
  (401/403/unauthorized/forbidden), emit ::error directives with
  links to https://www.nuget.org/account/apikeys and the secret path.
  Wrap `dotnet nuget push` so we can scan its stderr.
- rust.yml: same treatment for CARGO_REGISTRY_TOKEN
  (https://crates.io/me) on both auto-release and manual-release jobs.
- python.yml: pypa/gh-action-pypi-publish hides its own diagnostics, so
  add (1) a `failure()` follow-up step that prints the trusted-publisher
  configuration URL with this repo's owner/name pre-filled, and (2) a
  post-publish `npm view`-style verification loop that retries against
  the PyPI JSON API to catch the same false-positive class as #29.

Refs: docs/case-studies/issue-29/README.md, #29
The publish script and the GitHub-release tag-prefix changes ship as
part of the JS package's CI tooling, so bump to patch on next release.
@konard konard changed the title [WIP] Fix all CI/CD errors fix(ci): detect npm publish failures, split releases per language, surface credential runbooks May 3, 2026
@konard konard marked this pull request as ready for review May 3, 2026 17:25
@konard konard mentioned this pull request May 3, 2026
@konard
Copy link
Copy Markdown
Member Author

konard commented May 3, 2026

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost: $11.456119

📊 Context and tokens usage:

Claude Opus 4.7: (4 sub-sessions)

  1. 115.3K / 1M (12%) input tokens, 5.7K / 128K (4%) output tokens
  2. 116.6K / 1M (12%) input tokens, 23.5K / 128K (18%) output tokens
  3. 117.2K / 1M (12%) input tokens, 28.5K / 128K (22%) output tokens
  4. 40.6K / 1M (4%) input tokens, 2.9K / 128K (2%) output tokens

Total: (10.8K new + 350.9K cache writes + 14.4M cache reads) input tokens, 80.9K output tokens, $11.456119 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.7 (claude-opus-4-7)

📎 Log file uploaded as Gist (4813KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard
Copy link
Copy Markdown
Member Author

konard commented May 3, 2026

✅ Ready to merge

This pull request is now ready to be merged:

  • All CI checks have passed
  • No merge conflicts
  • No pending changes

Monitored by hive-mind with --auto-restart-until-mergeable flag

@konard konard merged commit ddbf255 into main May 3, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix all CI/CD errors

1 participant