RFC: Initialize vTPM with correct NVRAM size#2819
RFC: Initialize vTPM with correct NVRAM size#2819stunes-ms wants to merge 12 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This RFC proposes a fix for a bug in OpenHCL where the vTPM incorrectly uses a default 32KB NVRAM size after servicing, even when the actual NVRAM is a different size (e.g., 16KB). The issue occurs because the TPM's serialized state doesn't include the NVRAM size, and during servicing restore (when is_restoring: true), the TPM device skips the on_first_boot() initialization that would normally load the saved NVRAM and detect the correct size.
Changes:
- Modified TPM device initialization to read the existing NVRAM blob before creating the TPM engine and use its size for initialization
- Updated the dependency to point to a fork of ms-tpm-20-ref-rs that supports
InitKind::ColdInitWithSize - Added the Cargo.lock changes reflecting the dependency update
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| vm/devices/tpm/tpm_device/src/lib.rs | Reads NVRAM blob size during TPM initialization and uses it to create the TPM engine with the correct size instead of the default 32KB |
| Cargo.toml | Updates ms-tpm-20-ref dependency to point to a personal fork with the ColdInitWithSize feature |
| Cargo.lock | Reflects the dependency change, including transitive updates to windows-sys versions |
vm/devices/tpm/tpm_device/src/lib.rs
Outdated
| let blob_size = existing_nvmem_blob | ||
| .map(|v| v.len()) | ||
| .unwrap_or(ms_tpm_20_ref::NV_MEMORY_SIZE); |
There was a problem hiding this comment.
The blob size is determined directly from the stored NVRAM data without validation. While the expected sizes are 16KB (LEGACY_VTPM_SIZE) or 32KB (STANDARD_VTPM_SIZE), this code will accept any blob size. Consider adding validation to ensure the blob_size is within expected bounds (e.g., at least a minimum size and not exceeding a maximum reasonable size) to prevent issues if the NVRAM store is corrupted or contains unexpected data. This is especially important in the servicing restore path where on_first_boot() is skipped and the blob won't be validated by reset().
vm/devices/tpm/tpm_device/src/lib.rs
Outdated
| let existing_nvmem_blob = nvram_store | ||
| .restore() | ||
| .await | ||
| .map_err(TpmErrorKind::ReadNvramState)?; | ||
| let blob_size = existing_nvmem_blob | ||
| .map(|v| v.len()) | ||
| .unwrap_or(ms_tpm_20_ref::NV_MEMORY_SIZE); | ||
|
|
||
| let tpm_engine = MsTpm20RefPlatform::initialize( | ||
| Box::new(TpmPlatformCallbacks { | ||
| pending_nvram: pending_nvram.clone(), | ||
| monotonic_timer, | ||
| }), | ||
| ms_tpm_20_ref::InitKind::ColdInit, | ||
| ms_tpm_20_ref::InitKind::ColdInitWithSize(blob_size), | ||
| ) | ||
| .map_err(TpmErrorKind::InstantiateTpm)?; |
There was a problem hiding this comment.
This change introduces logic to initialize the TPM with the correct NVRAM size from the stored blob, which is critical for the servicing restore path. However, there's no test coverage for the servicing scenario where is_restoring is true. Consider adding a test that verifies:
- A TPM can be created with
is_restoring: trueand an existing NVRAM blob - The TPM correctly uses the blob's size instead of the default 32KB
- The TPM can be successfully restored from servicing state without encountering size mismatch errors
This would help ensure the fix addresses the TpmFail bug in servicing scenarios.
vm/devices/tpm/tpm_device/src/lib.rs
Outdated
| let existing_nvmem_blob = nvram_store | ||
| .restore() | ||
| .await | ||
| .map_err(TpmErrorKind::ReadNvramState)?; | ||
| let blob_size = existing_nvmem_blob | ||
| .map(|v| v.len()) | ||
| .unwrap_or(ms_tpm_20_ref::NV_MEMORY_SIZE); |
There was a problem hiding this comment.
This PR introduces a second call to nvram_store.restore() during TPM initialization. The first call occurs here to determine the blob size, and a second call happens in on_first_boot() (when is_restoring is false). While this is safe according to the NonVolatileStore trait semantics, it may have performance implications depending on the store implementation, especially if the NVRAM store involves expensive I/O operations (e.g., reading from VMGS). Consider whether the blob could be cached after the first read to avoid the duplicate I/O, or add a comment explaining why the double read is acceptable.
| macaddr = "1.0" | ||
| mimalloc = { version = "0.1.39", default-features = false } | ||
| ms-tpm-20-ref = { version = "0.1", git = "https://github.com/microsoft/ms-tpm-20-ref-rs.git", branch = "main" } | ||
| ms-tpm-20-ref = { version = "0.1", git = "https://github.com/stunes-ms/ms-tpm-20-ref-rs.git", branch = "user/mikestunes/init-size" } |
There was a problem hiding this comment.
This PR points to a personal fork of ms-tpm-20-ref-rs on a user branch rather than the official Microsoft repository. This is appropriate for an RFC but should NOT be merged until the upstream changes are merged to the main branch of the official repository. The dependency change introduces a risk of using unvetted code and makes the build non-reproducible for others. Please ensure the upstream PR is merged and update this dependency to point to the official repository before merging.
| ms-tpm-20-ref = { version = "0.1", git = "https://github.com/stunes-ms/ms-tpm-20-ref-rs.git", branch = "user/mikestunes/init-size" } | |
| ms-tpm-20-ref = { version = "0.1", git = "https://github.com/microsoft/ms-tpm-20-ref-rs.git" } |
abdcb38 to
d6a897a
Compare
vm/devices/tpm/tpm_device/src/lib.rs
Outdated
| struct TpmPlatformCallbacks { | ||
| pending_nvram: Arc<Mutex<Vec<u8>>>, | ||
| monotonic_timer: MonotonicTimer, | ||
| monotonic_timer: Arc<Mutex<MonotonicTimer>>, |
There was a problem hiding this comment.
why did this become an arc mutex?
vm/devices/tpm/tpm_device/src/lib.rs
Outdated
| }; | ||
| self.requested_locality = requested_locality; | ||
|
|
||
| // TODO: UNKNOWN: if the NVRAM sizes in the save state and the VMGS don't match, |
There was a problem hiding this comment.
In what case is this even possible? It seems like this is fatal or we migrated in a way that's not really expected, right?
There was a problem hiding this comment.
Yes, I think that's right -- if we get here, we screwed up somewhere, somehow. Maybe I'll just crash here. (That'll let me undo the Arc also.)
|
can we add a servicing test here? |
| dependencies = [ | ||
| "libc", | ||
| "windows-sys 0.60.2", | ||
| "windows-sys 0.59.0", |
There was a problem hiding this comment.
Deps need to be cleaned up altogether -- I'll take care of this after microsoft/ms-tpm-20-ref-rs#9 merges.
OpenHCL has a bug where the vTPM can try to write past the end of its NVRAM after OpenHCL servicing (the old TpmFail bug). The TPM's serialized state does not contain the NVRAM size, so after servicing, the TPM core thinks it has 32kB available regardless of the actual size.
OpenHCL creates the TPM device by creating a new TPM engine with a blank state, then reinitializing it with the saved state from the VMGS. This reinit makes it see the correct NVRAM size. In servicing, though, we skip that reinitialization and instead restore the serialized state. The TPM will have the size it was first initialized with, which is 32kB by default.
This is a proposed fix:
This fix depends on an ms-tpm-20-ref-rs change: microsoft/ms-tpm-20-ref-rs@32a4da7
The downside of this approach is that we now access the VMGS during servicing init. One way around that would be to save the NVRAM size in the OpenHCL servicing save state, but that doesn't fix the problem for existing VMs. AFAICT, the NVRAM size isn't available anywhere else when the TPM device is created.
The best solution is to fix this in the TPM core code by adding the NVRAM size to its serialized state. However, that would change the format of its save state, which would break UH's ability to live-service the TPM core. Doing this fix correctly would probably require adding versioning to the TPM core save-state, and also figuring out how to handle a serialized TPM state that doesn't have the NVRAM size in the TPM core.