Skip to content

Comments

RFC: Initialize vTPM with correct NVRAM size#2819

Draft
stunes-ms wants to merge 12 commits intomicrosoft:mainfrom
stunes-ms:user/mikestunes/service-tpm-size
Draft

RFC: Initialize vTPM with correct NVRAM size#2819
stunes-ms wants to merge 12 commits intomicrosoft:mainfrom
stunes-ms:user/mikestunes/service-tpm-size

Conversation

@stunes-ms
Copy link
Contributor

OpenHCL has a bug where the vTPM can try to write past the end of its NVRAM after OpenHCL servicing (the old TpmFail bug). The TPM's serialized state does not contain the NVRAM size, so after servicing, the TPM core thinks it has 32kB available regardless of the actual size.

OpenHCL creates the TPM device by creating a new TPM engine with a blank state, then reinitializing it with the saved state from the VMGS. This reinit makes it see the correct NVRAM size. In servicing, though, we skip that reinitialization and instead restore the serialized state. The TPM will have the size it was first initialized with, which is 32kB by default.

This is a proposed fix:

  • When creating the TPM device, get the correct NVRAM size from the VMGS.
  • Create the initial empty TPM with the correct size, instead of the default 32kB.

This fix depends on an ms-tpm-20-ref-rs change: microsoft/ms-tpm-20-ref-rs@32a4da7

The downside of this approach is that we now access the VMGS during servicing init. One way around that would be to save the NVRAM size in the OpenHCL servicing save state, but that doesn't fix the problem for existing VMs. AFAICT, the NVRAM size isn't available anywhere else when the TPM device is created.

The best solution is to fix this in the TPM core code by adding the NVRAM size to its serialized state. However, that would change the format of its save state, which would break UH's ability to live-service the TPM core. Doing this fix correctly would probably require adding versioning to the TPM core save-state, and also figuring out how to handle a serialized TPM state that doesn't have the NVRAM size in the TPM core.

@stunes-ms stunes-ms requested a review from a team as a code owner February 19, 2026 21:59
Copilot AI review requested due to automatic review settings February 19, 2026 21:59
@stunes-ms stunes-ms requested a review from a team as a code owner February 19, 2026 21:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This RFC proposes a fix for a bug in OpenHCL where the vTPM incorrectly uses a default 32KB NVRAM size after servicing, even when the actual NVRAM is a different size (e.g., 16KB). The issue occurs because the TPM's serialized state doesn't include the NVRAM size, and during servicing restore (when is_restoring: true), the TPM device skips the on_first_boot() initialization that would normally load the saved NVRAM and detect the correct size.

Changes:

  • Modified TPM device initialization to read the existing NVRAM blob before creating the TPM engine and use its size for initialization
  • Updated the dependency to point to a fork of ms-tpm-20-ref-rs that supports InitKind::ColdInitWithSize
  • Added the Cargo.lock changes reflecting the dependency update

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.

File Description
vm/devices/tpm/tpm_device/src/lib.rs Reads NVRAM blob size during TPM initialization and uses it to create the TPM engine with the correct size instead of the default 32KB
Cargo.toml Updates ms-tpm-20-ref dependency to point to a personal fork with the ColdInitWithSize feature
Cargo.lock Reflects the dependency change, including transitive updates to windows-sys versions

Comment on lines 401 to 403
let blob_size = existing_nvmem_blob
.map(|v| v.len())
.unwrap_or(ms_tpm_20_ref::NV_MEMORY_SIZE);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blob size is determined directly from the stored NVRAM data without validation. While the expected sizes are 16KB (LEGACY_VTPM_SIZE) or 32KB (STANDARD_VTPM_SIZE), this code will accept any blob size. Consider adding validation to ensure the blob_size is within expected bounds (e.g., at least a minimum size and not exceeding a maximum reasonable size) to prevent issues if the NVRAM store is corrupted or contains unexpected data. This is especially important in the servicing restore path where on_first_boot() is skipped and the blob won't be validated by reset().

Copilot uses AI. Check for mistakes.
Comment on lines 397 to 412
let existing_nvmem_blob = nvram_store
.restore()
.await
.map_err(TpmErrorKind::ReadNvramState)?;
let blob_size = existing_nvmem_blob
.map(|v| v.len())
.unwrap_or(ms_tpm_20_ref::NV_MEMORY_SIZE);

let tpm_engine = MsTpm20RefPlatform::initialize(
Box::new(TpmPlatformCallbacks {
pending_nvram: pending_nvram.clone(),
monotonic_timer,
}),
ms_tpm_20_ref::InitKind::ColdInit,
ms_tpm_20_ref::InitKind::ColdInitWithSize(blob_size),
)
.map_err(TpmErrorKind::InstantiateTpm)?;
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces logic to initialize the TPM with the correct NVRAM size from the stored blob, which is critical for the servicing restore path. However, there's no test coverage for the servicing scenario where is_restoring is true. Consider adding a test that verifies:

  1. A TPM can be created with is_restoring: true and an existing NVRAM blob
  2. The TPM correctly uses the blob's size instead of the default 32KB
  3. The TPM can be successfully restored from servicing state without encountering size mismatch errors

This would help ensure the fix addresses the TpmFail bug in servicing scenarios.

Copilot uses AI. Check for mistakes.
Comment on lines 397 to 403
let existing_nvmem_blob = nvram_store
.restore()
.await
.map_err(TpmErrorKind::ReadNvramState)?;
let blob_size = existing_nvmem_blob
.map(|v| v.len())
.unwrap_or(ms_tpm_20_ref::NV_MEMORY_SIZE);
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a second call to nvram_store.restore() during TPM initialization. The first call occurs here to determine the blob size, and a second call happens in on_first_boot() (when is_restoring is false). While this is safe according to the NonVolatileStore trait semantics, it may have performance implications depending on the store implementation, especially if the NVRAM store involves expensive I/O operations (e.g., reading from VMGS). Consider whether the blob could be cached after the first read to avoid the duplicate I/O, or add a comment explaining why the double read is acceptable.

Copilot uses AI. Check for mistakes.
macaddr = "1.0"
mimalloc = { version = "0.1.39", default-features = false }
ms-tpm-20-ref = { version = "0.1", git = "https://github.com/microsoft/ms-tpm-20-ref-rs.git", branch = "main" }
ms-tpm-20-ref = { version = "0.1", git = "https://github.com/stunes-ms/ms-tpm-20-ref-rs.git", branch = "user/mikestunes/init-size" }
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR points to a personal fork of ms-tpm-20-ref-rs on a user branch rather than the official Microsoft repository. This is appropriate for an RFC but should NOT be merged until the upstream changes are merged to the main branch of the official repository. The dependency change introduces a risk of using unvetted code and makes the build non-reproducible for others. Please ensure the upstream PR is merged and update this dependency to point to the official repository before merging.

Suggested change
ms-tpm-20-ref = { version = "0.1", git = "https://github.com/stunes-ms/ms-tpm-20-ref-rs.git", branch = "user/mikestunes/init-size" }
ms-tpm-20-ref = { version = "0.1", git = "https://github.com/microsoft/ms-tpm-20-ref-rs.git" }

Copilot uses AI. Check for mistakes.
@stunes-ms stunes-ms force-pushed the user/mikestunes/service-tpm-size branch from abdcb38 to d6a897a Compare February 20, 2026 01:18
@github-actions
Copy link

@stunes-ms stunes-ms marked this pull request as draft February 20, 2026 19:49
struct TpmPlatformCallbacks {
pending_nvram: Arc<Mutex<Vec<u8>>>,
monotonic_timer: MonotonicTimer,
monotonic_timer: Arc<Mutex<MonotonicTimer>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this become an arc mutex?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, for the unknown case?

};
self.requested_locality = requested_locality;

// TODO: UNKNOWN: if the NVRAM sizes in the save state and the VMGS don't match,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case is this even possible? It seems like this is fatal or we migrated in a way that's not really expected, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's right -- if we get here, we screwed up somewhere, somehow. Maybe I'll just crash here. (That'll let me undo the Arc also.)

@mebersol
Copy link
Collaborator

can we add a servicing test here?

dependencies = [
"libc",
"windows-sys 0.60.2",
"windows-sys 0.59.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"windows-sys 0.59.0",

rebase needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deps need to be cleaned up altogether -- I'll take care of this after microsoft/ms-tpm-20-ref-rs#9 merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants