Skip to content

feat(cache): replace Git-based hub cache with HF Storage Buckets#1102

Open
dacorvo wants to merge 9 commits intoreview/hf-buckets-cachefrom
hf_buckets_cache
Open

feat(cache): replace Git-based hub cache with HF Storage Buckets#1102
dacorvo wants to merge 9 commits intoreview/hf-buckets-cachefrom
hf_buckets_cache

Conversation

@dacorvo
Copy link
Copy Markdown
Collaborator

@dacorvo dacorvo commented Apr 1, 2026

Summary

Replace the Git-based CompileCacheHfProxy with HF Storage Bucket-based NEFF cache.

  • Bucket operations run in isolated subprocess via uv run --with "huggingface_hub>=1.0" to avoid transformers version conflict
  • Context manager (hub_neuronx_cache) handles fetch on enter, sync on exit
  • NEFFs stored in both flat (hf-mount compatible) and per-model areas (Xet dedup)
  • Export records are flat neuron config dicts (advisory, for lookup only)
  • No monkey-patching of libneuronxla

Test plan

  • Unit tests: pytest tests/cache/ (40 tests, CPU)
  • Integration: real Neuron compilation against aws-neuron/optimum-neuron-neff-cache
  • Phase 1: compile + sync to bucket
  • Phase 2: clean cache + fetch from bucket (zero recompilation)
  • Phase 3-4: warm cache (no fetch, no compilation)
  • vLLM model loader select_hub_cached_entries key name normalization (known TODO)

Ref: #1101

🤖 Generated with Claude Code

@dacorvo dacorvo force-pushed the hf_buckets_cache branch from 9f1d2d2 to 252039c Compare April 1, 2026 10:38
dacorvo and others added 9 commits April 1, 2026 15:21
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dacorvo dacorvo force-pushed the hf_buckets_cache branch from 69c68f0 to d894880 Compare April 1, 2026 15:31
@dacorvo dacorvo marked this pull request as ready for review April 1, 2026 15:32
Copilot AI review requested due to automatic review settings April 1, 2026 15:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates Optimum Neuron’s remote NEFF cache from a Git-backed Hub repo to Hugging Face Storage Buckets, using a hub_neuronx_cache context manager that fetches on enter and syncs on exit via a long-running uv-launched bucket API proxy server.

Changes:

  • Replace the legacy CompileCacheHfProxy/registry/sync implementation with bucket-based fetch_cache, sync_cache, and lookup_cache plus a Unix-socket bucket server.
  • Update exporters/inference/training integration points and CLI commands to use the new bucket cache flow.
  • Add new bucket cache tests and remove the old Hub sync retry unit tests; update cache docs/AGENTS guidance.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
tests/decoder/test_cache_sync_retry.py Removed tests tied to the old Git/commit-conflict sync strategy.
tests/conftest.py Removes Hub repo cache fixtures and notes sync is now context-managed.
tests/cache/test_bucket_utils.py Adds unit tests for bucket resolution + hashing + path helpers.
tests/cache/test_bucket_cache.py Adds end-to-end bucket fetch/sync/lookup tests via the uv bucket server.
pyproject.toml Adds uv dependency required to isolate bucket operations.
optimum/neuron/utils/runner.py Switches “cache configured” checks from Hub repo to bucket configuration warnings.
optimum/neuron/utils/cache_utils.py Removes Hub repo selection/write-access utilities; keeps local cache path helpers.
optimum/neuron/trainers/transformers.py Removes explicit sync call; continues wrapping training with hub_neuronx_cache.
optimum/neuron/models/inference/backend/pretrained_model.py Moves export caching to new hub_neuronx_cache usage.
optimum/neuron/modeling_traced.py Removes Hub-based cache lookup/load path; always compiles via exporters.
optimum/neuron/modeling_diffusion.py Removes Hub-based cache lookup/load path; always compiles via exporters.
optimum/neuron/cache/traced.py Deleted legacy traced-artifact caching that depended on Hub proxy.
optimum/neuron/cache/optimum_neuron_cc_wrapper.py Wraps compiler wrapper in hub_neuronx_cache without Hub repo config.
optimum/neuron/cache/hub_cache.py Converts to a shim that re-exports bucket cache and makes synchronize a no-op.
optimum/neuron/cache/bucket_utils.py Implements bucket ID persistence, hashing, and path helpers.
optimum/neuron/cache/bucket_server.py Adds Unix-socket server proxying HfApi bucket methods in uv-isolated env.
optimum/neuron/cache/bucket_cache.py Implements context manager + fetch/sync/lookup using the bucket server.
optimum/neuron/cache/AGENTS.md Updates cache subsystem documentation to bucket architecture (needs alignment).
optimum/neuron/cache/init.py Reduces exported API surface to cached-entries + synchronize.
optimum/exporters/neuron/convert.py Wraps tracing in hub_neuronx_cache for bucket fetch/sync.
optimum/commands/neuron/cache.py Refactors CLI from Hub repo commands to bucket set/fetch/lookup/status/cleanup.
AGENTS.md Updates root onboarding note to describe bucket-based cache.
Comments suppressed due to low confidence (1)

optimum/commands/neuron/cache.py:288

  • The optimum-cli neuron cache synchronize subcommand has been removed from SUBCOMMANDS, but internal tests/fixtures still invoke it (e.g. tests/decoder/test_cache.py). This will break CI unless those callers are updated in the same PR, or a backward-compatible synchronize command is kept (even if it delegates to a new bucket-based sync implementation).
class CustomCacheRepoCommand(BaseOptimumCLICommand):
    SUBCOMMANDS = (
        CommandInfo(
            name="create",
            help="Create a storage bucket on the Hugging Face Hub for Neuron compilation files.",
            subcommand_class=CreateCacheBucketCommand,
        ),
        CommandInfo(
            name="set",
            help="Set the name of the Neuron cache bucket to use locally.",
            subcommand_class=SetCacheBucketCommand,
        ),
        CommandInfo(
            name="fetch",
            help="Pre-warm the local cache by downloading MODULE dirs for a model from the bucket.",
            subcommand_class=FetchCommand,
        ),
        CommandInfo(
            name="lookup",
            help="Lookup cached export configurations for the specified model id.",
            subcommand_class=LookupCommand,
        ),
        CommandInfo(
            name="status",
            help="Show local Neuron compile cache status (entry counts by state, disk usage, compiler versions).",
            subcommand_class=StatusCacheCommand,
        ),
        CommandInfo(
            name="cleanup",
            help="Clean up poisoned entries (failed compilations, stale locks) from the local Neuron compile cache.",
            subcommand_class=CleanupCacheCommand,
        ),
    )

if cache_dir is not None:
return Path(cache_dir)
cache_url = os.environ.get("NEURON_COMPILE_CACHE_URL")
if cache_url is not None:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_cache_dir() treats NEURON_COMPILE_CACHE_URL as a filesystem path unconditionally. In this codebase the same env var can be an S3 URL (see cleanup.py:_get_cache_path which ignores values starting with "s3://"). Returning Path("s3://...") will break mount detection and local file operations; consider rejecting/ignoring s3:// here (or parsing it explicitly) to match the local-cache utilities’ behavior.

Suggested change
if cache_url is not None:
# NEURON_COMPILE_CACHE_URL may sometimes be an S3 URL (e.g. "s3://...").
# In that case, treat it as a remote location and fall back to the local
# cache resolution logic, matching the behavior of the local-cache utilities.
if cache_url is not None and not cache_url.startswith("s3://"):

Copilot uses AI. Check for mistakes.
Comment on lines +374 to +376
flat_path = f"{local_to_flat_bucket_path(module_name, compiler_version)}/{file_path.name}"
files_to_add.append([str(file_path), flat_path])
model_path = f"{model_task_prefix}/{module_name}/{file_path.name}"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_cache() uploads every file in each MODULE_* directory, including transient/poison files like "*.lock" and "model.log" (failed compilation marker). Since cleanup_local_cache() treats these specially, pushing them to a shared bucket can propagate stale locks or failed entries to other users. Filter uploads to only the expected success artifacts and explicitly exclude lock/log/temporary files.

Suggested change
flat_path = f"{local_to_flat_bucket_path(module_name, compiler_version)}/{file_path.name}"
files_to_add.append([str(file_path), flat_path])
model_path = f"{model_task_prefix}/{module_name}/{file_path.name}"
# Skip transient/poison files that should not be shared via the bucket.
# These include lock files and failed-compilation markers like model.log.
name = file_path.name
if name.endswith(".lock") or name == "model.log":
continue
flat_path = f"{local_to_flat_bucket_path(module_name, compiler_version)}/{name}"
files_to_add.append([str(file_path), flat_path])
model_path = f"{model_task_prefix}/{module_name}/{name}"

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 26
# Re-export lookup_cache under the old name for callers that use it
get_hub_cached_entries = lookup_cache

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_hub_cached_entries is now a direct alias to lookup_cache, but lookup_cache does not accept the legacy parameters (e.g. cache_repo_id) that existing callers/tests still pass. This will raise TypeError at runtime. Consider adding a small compatibility wrapper that accepts the old signature (and maps/ignores cache_repo_id) rather than re-exporting lookup_cache directly.

Suggested change
# Re-export lookup_cache under the old name for callers that use it
get_hub_cached_entries = lookup_cache
def get_hub_cached_entries(
model_id: str,
task: str | None = None,
cache_repo_id: str | None = None,
):
"""
Compatibility wrapper for the legacy get_hub_cached_entries API.
The underlying implementation is lookup_cache, which does not accept
the deprecated `cache_repo_id` parameter. We keep it here to avoid
breaking existing callers, but it is ignored.
"""
return lookup_cache(model_id=model_id, task=task)

Copilot uses AI. Check for mistakes.
Comment on lines +31 to 34
"""No-op — immediate sync in hub_neuronx_cache context replaces explicit sync."""
logger.info("synchronize_hub_cache is a no-op with bucket-based cache (sync happens in context).")


Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronize_hub_cache() is now a no-op, but large parts of the test suite/fixtures still call it expecting remote synchronization (and CLI still historically exposed a cache synchronize command). If explicit sync is intentionally removed, the dependent tests/fixtures need updating; otherwise keep a functional implementation (or at least raise a clear error) to avoid silently masking missing uploads.

Suggested change
"""No-op — immediate sync in hub_neuronx_cache context replaces explicit sync."""
logger.info("synchronize_hub_cache is a no-op with bucket-based cache (sync happens in context).")
"""
Explicit hub cache synchronization is no longer supported.
Cache uploads and synchronization are handled immediately within the
`hub_neuronx_cache` context manager. This function used to trigger an
explicit sync, but is now deprecated and will raise to avoid silently
masking missing uploads or incorrect cache usage.
"""
logger.error(
"synchronize_hub_cache() has been deprecated and no longer performs an explicit "
"synchronization. Synchronization now happens automatically within the "
"hub_neuronx_cache context manager. Please remove calls to synchronize_hub_cache "
"or update your code/tests accordingly."
)
raise RuntimeError(
"synchronize_hub_cache() is deprecated and no longer supported. "
"Synchronization is now performed automatically by the hub_neuronx_cache "
"context manager. Please update your code to rely on that behavior."
)

Copilot uses AI. Check for mistakes.

# Only expose the hub cache public API
from .hub_cache import get_hub_cached_entries, get_hub_cached_models, synchronize_hub_cache
from .hub_cache import get_hub_cached_entries, synchronize_hub_cache
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimum.neuron.cache no longer exports get_hub_cached_models, but existing tests (e.g. tests/decoder/test_cache.py) and possibly external callers still import it. This is a breaking public API change; either keep exporting a compatibility implementation or update/remove all internal references in the same PR to keep CI green.

Suggested change
from .hub_cache import get_hub_cached_entries, synchronize_hub_cache
from .hub_cache import get_hub_cached_entries, synchronize_hub_cache
def get_hub_cached_models(*args, **kwargs):
"""
Backwards-compatible alias for :func:`get_hub_cached_entries`.
This function is kept for compatibility with older code that imports
``get_hub_cached_models`` from :mod:`optimum.neuron.cache`.
"""
return get_hub_cached_entries(*args, **kwargs)

Copilot uses AI. Check for mistakes.
Comment on lines +7 to 16
| `bucket_utils.py` | Bucket resolution, model ID encoding, config hashing, path translation | No |
| `bucket_cache.py` | Context manager (`hub_neuronx_cache`), `fetch_cache`, `sync_cache`, `lookup_cache` | No |
| `bucket_cli.py` | Standalone subprocess for bucket API (runs via `uv run --with "huggingface_hub>=1.0"`) | No |
| `hub_cache.py` | Public API shim — re-exports `hub_neuronx_cache`, `synchronize_hub_cache`, `select_hub_cached_entries` | No |
| `cleanup.py` | Local compile cache inspection and cleanup (`get_local_cache_status`, `cleanup_local_cache`) | No |
| `hub_cache.py` | HF Hub cache proxy, monkey-patching, registry, sync with retry | Yes (runtime) |
| `entries/cache_entry.py` | `ModelCacheEntry` dataclass — serialization, arch digest, hash | No |
| `entries/single_model.py`, `entries/multi_model.py` | Concrete entry types | No |
| `training.py`, `traced.py` | Legacy cache wrappers for training and traced model paths | Yes |
| `entries/cache_entry.py` | `ModelCacheEntry` dataclass — kept for backwards compat with callers | No |
| `entries/single_model.py`, `entries/multi_model.py` | Concrete entry types — kept for backwards compat | No |
| `training.py` | Training cache wrapper | Yes |
| `traced.py` | Deprecated — traced model caching now handled by bucket context | No |
| `optimum_neuron_cc_wrapper.py` | Compiler wrapper hooks | Yes |
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGENTS.md references bucket_cli.py and a cache synchronize command, but this PR adds bucket_server.py (not bucket_cli.py) and the CLI SUBCOMMANDS no longer include synchronize. The document also states model_id encoding uses --, but bucket_model_prefix currently uses the raw model_id with /. Please update this guide to match the implemented files/layout so it remains a reliable onboarding source.

Copilot uses AI. Check for mistakes.
Comment on lines 121 to 133
## CLI Commands

All defined in `optimum/commands/neuron/cache.py`:

| Command | Description |
|---------|-------------|
| `cache create` | Create a Hub cache repo |
| `cache set` | Set active Hub cache repo locally |
| `cache synchronize` | Upload local cache to Hub (cleans up first) |
| `cache lookup` | Search Hub cache for compiled models |
| `cache create` | Create/verify a cache bucket |
| `cache set` | Set active cache bucket locally |
| `cache synchronize` | Upload deferred-sync MODULE dirs (scans for `.bucket_meta.json`) |
| `cache fetch` | Pre-warm local cache from bucket for a model |
| `cache lookup` | List cached export configs for a model |
| `cache status` | Show local cache entry counts, sizes, compiler versions |
| `cache cleanup` | Remove failed/locked/empty entries from local cache |
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI command table/documentation here still lists cache synchronize (deferred sync via .bucket_meta.json), but the current implementation in optimum/commands/neuron/cache.py removed synchronize and there is no deferred-sync tagging/upload logic in bucket_cache.py. Either implement the deferred/manual sync flow or remove it from the docs to avoid promising a feature that isn’t present.

Copilot uses AI. Check for mistakes.
Comment on lines 1232 to 1234
"""
with hub_neuronx_cache(cache_dir=get_neuron_cache_path()):
output = self._train(resume_from_checkpoint=resume_from_checkpoint)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hub_neuronx_cache is now bucket-based and uses model_id/task to scope fetch/sync. Calling it with only cache_dir means it will default to model_id="unknown" and can upload new MODULE_* dirs into an "unknown" bucket prefix (and fetch will never find model-specific entries). Consider plumbing a real model_id/task from the training config (e.g., self.model.config._name_or_path + task) or disable remote bucket operations when the model identity isn’t known.

Copilot uses AI. Check for mistakes.
Comment on lines 22 to 24
def main():
with hub_neuronx_cache(cache_repo_id=get_hf_hub_cache_repos()[0], cache_dir=get_neuron_cache_path()):
with hub_neuronx_cache(cache_dir=get_neuron_cache_path()):
return neuron_cc_wrapper_main()
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hub_neuronx_cache now scopes bucket fetch/sync by model_id/task. Wrapping neuron_cc_wrapper_main() with only cache_dir will default to model_id="unknown" and may sync compiler artifacts into an ambiguous remote prefix. If this wrapper can’t reliably determine the model identity, consider disabling per-model bucket sync here (or requiring callers to provide model_id/task explicitly).

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +249
def hub_neuronx_cache(
model_id: str = "unknown",
task: str | None = None,
export_config: dict | None = None,
compiler_version: str | None = None,
bucket_id: str | None = None,
cache_dir: str | Path | None = None,
):
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hub_neuronx_cache defaults model_id to "unknown" and will still perform bucket fetch/sync under that prefix when callers omit model_id (several call sites still do). That can pollute the shared bucket with ambiguous entries. Consider making model_id optional and skipping remote bucket operations (or raising) when it isn’t provided, so callers must opt in with a stable identifier.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants