feat: skip cloud-init ready report and add standalone report_ready script#8056
feat: skip cloud-init ready report and add standalone report_ready script#8056awesomenix wants to merge 3 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a VHD-baked mechanism to suppress cloud-init’s built-in “ready” report and replaces it with an explicit, CSE-invoked readiness/failure report to Azure wireserver.
Changes:
- Add a standalone
report_ready.pyscript that writes Hyper-V KVP provisioning status and POSTs health to wireserver. - Bake
report_ready.pyinto multiple VHD build variants and copy it to/opt/azure/containers/. - Add
skipCloudInitReadyReportto write cloud-init config, and invoke the reporting script fromcse_start.sh.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/vhd-image-builder-mariner.json | Adds report_ready.py to files uploaded into the build VM. |
| vhdbuilder/packer/vhd-image-builder-mariner-cvm.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-mariner-arm64.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-flatcar.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-cvm.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-base.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-arm64-gen2.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-acl.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/packer_source.sh | Copies report_ready.py into /opt/azure/containers/ with execute perms. |
| vhdbuilder/packer/install-dependencies.sh | Invokes skipCloudInitReadyReport during VHD build. |
| vhdbuilder/packer/imagecustomizer/azlosguard/azlosguard.yml | Ensures OSGuard imagecustomizer also places report_ready.py into /opt/azure/containers/. |
| parts/linux/cloud-init/artifacts/report_ready.py | New standalone readiness/failure reporting implementation. |
| parts/linux/cloud-init/artifacts/cse_start.sh | Calls report_ready.py on provisioning success/failure if present on the VHD. |
| parts/linux/cloud-init/artifacts/cse_install.sh | Adds skipCloudInitReadyReport() helper that writes cloud-init config to skip built-in ready report. |
| addMarinerNvidiaRepo | ||
| updateDnfWithNvidiaPkg | ||
| overrideNetworkConfig || exit 1 | ||
| skipCloudInitReadyReport || exit 1 |
There was a problem hiding this comment.
skipCloudInitReadyReport is already invoked unconditionally earlier in this script, so calling it again in the Mariner/AzureLinux block is redundant and makes the flow harder to reason about. Consider removing this second call to keep the configuration applied in a single place.
| skipCloudInitReadyReport || exit 1 |
| if [ -x /opt/azure/containers/report_ready.py ]; then | ||
| if [ "$EXIT_CODE" -eq 0 ]; then | ||
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" | ||
| else | ||
| python3 /opt/azure/containers/report_ready.py -v --failure --description "ExitCode: ${EXIT_CODE}, ${message_string}" || echo "WARNING: Failed to report failure to Azure fabric" | ||
| fi |
There was a problem hiding this comment.
This report_ready.py invocation runs synchronously before log upload/exit, and can block provisioning for up to ~100s on wireserver timeouts/retries (GET/POST timeouts are 30s with multiple retries). If this is intended to be best-effort (as suggested by || echo "WARNING"), consider running it in the background on success and/or moving it after upload_logs (especially on failure) or passing tighter retry/timeout settings to avoid delaying provisioning and log upload.
| if [ -x /opt/azure/containers/report_ready.py ]; then | ||
| if [ "$EXIT_CODE" -eq 0 ]; then | ||
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" | ||
| else | ||
| python3 /opt/azure/containers/report_ready.py -v --failure --description "ExitCode: ${EXIT_CODE}, ${message_string}" || echo "WARNING: Failed to report failure to Azure fabric" | ||
| fi |
There was a problem hiding this comment.
This change updates cloud-init/CSE scripts under parts/, which are covered by snapshot-style golden tests in pkg/agent/testdata/* (e.g., baker_test.go reads ./testdata/<folder>/CustomData). Please run make generate (or regenerate the testdata via the repo’s standard workflow) and include the updated golden files in this PR; otherwise CI is likely to fail due to mismatched expected CustomData/CSE outputs.
89a690e to
ed3d302
Compare
vhdbuilder/packer/packer_source.sh
Outdated
| cpAndMode $CLOUD_INIT_BASE_PKG_SRC $CLOUD_INIT_BASE_PKG_DEST 0644 | ||
|
|
||
| CLOUD_INIT_AZURE_PKG_SRC=/home/packer/cloud-init-packages/cloud-init-azure_all.deb | ||
| CLOUD_INIT_AZURE_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init-azure_all.deb | ||
| cpAndMode $CLOUD_INIT_AZURE_PKG_SRC $CLOUD_INIT_AZURE_PKG_DEST 0644 | ||
|
|
||
| CLOUD_INIT_PKG_SRC=/home/packer/cloud-init-packages/cloud-init_all.deb | ||
| CLOUD_INIT_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init_all.deb | ||
| cpAndMode $CLOUD_INIT_PKG_SRC $CLOUD_INIT_PKG_DEST 0644 | ||
|
|
There was a problem hiding this comment.
copyPackerFiles() now unconditionally copies Ubuntu-only cloud-init .deb artifacts from /home/packer/cloud-init-packages into /opt/azure/cloud-init-packages. Those source files are only provisioned in vhd-image-builder-base.json, so Mariner/Flatcar/ACL packer templates will not have them and the VHD build will fail when copyPackerFiles runs. Guard these cpAndMode calls behind an Ubuntu check (and/or a file existence check) so non-Ubuntu builds don’t require these artifacts.
| cpAndMode $CLOUD_INIT_BASE_PKG_SRC $CLOUD_INIT_BASE_PKG_DEST 0644 | |
| CLOUD_INIT_AZURE_PKG_SRC=/home/packer/cloud-init-packages/cloud-init-azure_all.deb | |
| CLOUD_INIT_AZURE_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init-azure_all.deb | |
| cpAndMode $CLOUD_INIT_AZURE_PKG_SRC $CLOUD_INIT_AZURE_PKG_DEST 0644 | |
| CLOUD_INIT_PKG_SRC=/home/packer/cloud-init-packages/cloud-init_all.deb | |
| CLOUD_INIT_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init_all.deb | |
| cpAndMode $CLOUD_INIT_PKG_SRC $CLOUD_INIT_PKG_DEST 0644 | |
| CLOUD_INIT_AZURE_PKG_SRC=/home/packer/cloud-init-packages/cloud-init-azure_all.deb | |
| CLOUD_INIT_AZURE_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init-azure_all.deb | |
| CLOUD_INIT_PKG_SRC=/home/packer/cloud-init-packages/cloud-init_all.deb | |
| CLOUD_INIT_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init_all.deb | |
| # Ubuntu-only cloud-init .deb artifacts may not be present for non-Ubuntu images. | |
| # Guard these copies so Mariner/Flatcar/ACL builds do not fail when the source | |
| # directory or files are missing. | |
| if [ -d "/home/packer/cloud-init-packages" ]; then | |
| cpAndMode $CLOUD_INIT_BASE_PKG_SRC $CLOUD_INIT_BASE_PKG_DEST 0644 | |
| cpAndMode $CLOUD_INIT_AZURE_PKG_SRC $CLOUD_INIT_AZURE_PKG_DEST 0644 | |
| cpAndMode $CLOUD_INIT_PKG_SRC $CLOUD_INIT_PKG_DEST 0644 | |
| fi |
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-base_all.deb", | ||
| "destination": "/home/packer/cloud-init-packages/cloud-init-base_all.deb" | ||
| }, | ||
| { | ||
| "type": "file", | ||
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-azure_all.deb", | ||
| "destination": "/home/packer/cloud-init-packages/cloud-init-azure_all.deb" | ||
| }, | ||
| { | ||
| "type": "file", | ||
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init_all.deb", | ||
| "destination": "/home/packer/cloud-init-packages/cloud-init_all.deb" | ||
| }, | ||
| { | ||
| "type": "file", |
There was a problem hiding this comment.
This adds prebuilt cloud-init .deb artifacts into the repo and bakes them into the VHD. This creates a hard-to-audit dependency surface (no explicit version metadata/checksums, no provenance, and manual updates), and it increases repo/VHD build maintenance burden. Prefer downloading versioned packages during the VHD build from an approved source (and tracking the version via components.json/renovate where applicable), or at least include explicit versioning + integrity verification for these artifacts.
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-base_all.deb", | |
| "destination": "/home/packer/cloud-init-packages/cloud-init-base_all.deb" | |
| }, | |
| { | |
| "type": "file", | |
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-azure_all.deb", | |
| "destination": "/home/packer/cloud-init-packages/cloud-init-azure_all.deb" | |
| }, | |
| { | |
| "type": "file", | |
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init_all.deb", | |
| "destination": "/home/packer/cloud-init-packages/cloud-init_all.deb" | |
| }, | |
| { | |
| "type": "file", |
14bad23 to
e1d2a33
Compare
e1d2a33 to
bb6f742
Compare
|
|
||
| if [ -x /opt/azure/containers/report_ready.py ]; then | ||
| if [ "$EXIT_CODE" -eq 0 ]; then | ||
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" |
There was a problem hiding this comment.
Running report_ready.py synchronously here can add noticeable tail latency to successful provisioning (each attempt can block up to the HTTP timeout(s) plus retry delay). Consider running the success-path report in the background (similar to log upload) and/or tightening the per-request timeouts so a transient wireserver issue doesn't extend CSE completion by ~minutes.
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" | |
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" & |
| def write_provisioning_kvp(report: str) -> None: | ||
| """Write a provisioning report to the KVP pool file.""" | ||
| if len(report) >= KVP_AZURE_MAX_VALUE_SIZE: | ||
| report = report[:KVP_AZURE_MAX_VALUE_SIZE - 1] |
There was a problem hiding this comment.
KVP_AZURE_MAX_VALUE_SIZE is a byte limit, but the current truncation uses Python string length/slicing, which can still exceed the byte limit for non-ASCII input (e.g., descriptions containing UTF-8 multibyte chars). Truncate based on the encoded UTF-8 byte length before packing the record to avoid producing oversized KVP values.
| def write_provisioning_kvp(report: str) -> None: | |
| """Write a provisioning report to the KVP pool file.""" | |
| if len(report) >= KVP_AZURE_MAX_VALUE_SIZE: | |
| report = report[:KVP_AZURE_MAX_VALUE_SIZE - 1] | |
| def _truncate_utf8_to_max_bytes(text: str, max_bytes: int) -> str: | |
| """Truncate a string so its UTF-8 encoding is at most max_bytes - 1 bytes.""" | |
| if max_bytes <= 0: | |
| return "" | |
| limit = max_bytes - 1 | |
| encoded = text.encode("utf-8") | |
| if len(encoded) <= limit: | |
| return text | |
| truncated = encoded[:limit] | |
| # Ignore incomplete multibyte sequences at the end to keep valid UTF-8. | |
| return truncated.decode("utf-8", errors="ignore") | |
| def write_provisioning_kvp(report: str) -> None: | |
| """Write a provisioning report to the KVP pool file.""" | |
| report = _truncate_utf8_to_max_bytes(report, KVP_AZURE_MAX_VALUE_SIZE) |
| retry_delay=args.retry_delay, | ||
| ) | ||
| return 0 | ||
| except RuntimeError: |
There was a problem hiding this comment.
main() only catches RuntimeError; other unexpected exceptions (e.g., parsing issues, subprocess errors) will emit a full traceback. Since this is invoked from provisioning, consider catching Exception at the top-level and returning a non-zero exit with a concise logged error message to keep logs readable while still signaling failure.
| except RuntimeError: | |
| except Exception as exc: | |
| LOG.error("report_ready failed: %s", exc) |
…ript Bake experimental_skip_ready_report into VHD via cloud.cfg.d to skip cloud-init's built-in health ready report to Azure fabric. Add a standalone Python script (report_ready.py) that can be invoked from CSE to report ready at the appropriate time during node provisioning. Depends on canonical/cloud-init#6771. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bb6f742 to
fb7f2c9
Compare
Copy upstream cloud-init DataSourceAzure.py into Ubuntu VHD builds so experimental_skip_ready_report is honored during provisioning. Clean up the staged file after installation to avoid leaving an extra artifact on the image. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
fb7f2c9 to
c40dacc
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 92 changed files in this pull request and generated no new comments.
You can also share your feedback on Copilot code review. Take the survey.
Bake experimental_skip_ready_report into VHD via cloud.cfg.d to skip
cloud-init's built-in health ready report to Azure fabric. Add a
standalone Python script (report_ready.py) that can be invoked from
CSE to report ready at the appropriate time during node provisioning.
Depends on canonical/cloud-init#6771.
I tried importing the cloud init library directly and copilot suggested this
The problems:
more. It works on the VM (cloud-init's deps are installed), but it's a large dependency surface.
construct one or monkey-patch it.
instantiated_handler_registry.registered_items["telemetry"], which is only populated during cloud-init's normal boot. Outside of cloud-init's lifecycle, this
returns None and silently skips KVP reporting.
cloud-init versions.
A middle ground: you could import just the KVP handler class directly and the low-level wireserver pieces, bypassing the high-level functions:
from cloudinit.reporting.handlers import HyperVKvpReportingHandler
handler = HyperVKvpReportingHandler()
handler.write_key("PROVISIONING_REPORT", report_string)
But the wireserver reporting (GoalState + health POST) has deep entanglement with url_helper, distro objects, and telemetry decorators that make it hard to
call standalone.