Skip to content

Add optional module allowlist to PickleLoader deserialization#1648

Open
elijahbenizzy wants to merge 1 commit into
mainfrom
improve/pickle-loader-validation
Open

Add optional module allowlist to PickleLoader deserialization#1648
elijahbenizzy wants to merge 1 commit into
mainfrom
improve/pickle-loader-validation

Conversation

@elijahbenizzy

Copy link
Copy Markdown
Contributor

Summary

Adds an optional allowlist to PickleLoader.load_data in hamilton/io/default_data_loaders.py. Without an allowlist: backward-compatible behavior, but emits a one-time UserWarning per load to surface that unrestricted pickle deserialization is in use. With an allowlist: a _RestrictedUnpickler(pickle.Unpickler) subclass overrides find_class(module, name) to validate against the allowlist and raises UnpicklingError before any __reduce__ payload can resolve.

API

Two ways to set the allowlist, in resolution order (highest priority first):

  1. Per-instance: PickleLoader(allowlist=[("myapp.models", "MyClass"), ...])
  2. Process-wide: set_pickle_loader_allowlist([...])
  3. Otherwise: legacy unrestricted path + a one-time UserWarning

Allowlist entries are (module, qualname) tuples — per-class granularity rather than prefix-string. The motivation: pickle attacks routinely reach for specific symbols within otherwise-trusted modules (builtins.eval, os.system, subprocess.Popen), so denying by full (module, qualname) pair is more useful than module-prefix matching.

Trust model (now in the docstring)

PickleLoader reads pickle bytes from a caller-supplied path. Pickle deserialization is unsafe when those bytes originate from a source the application doesn't fully control — files passed between teams, intermediate caches written by upstream pipelines, shared persistence backends. The allowlist closes the __reduce__ arbitrary-code-execution primitive while keeping the historical contract of PickleLoader available for callers who explicitly opt out of restriction.

Tests

5 new tests in tests/io/test_default_adapters.py:

  • Legacy unrestricted-with-warning path roundtrip
  • Allowlist permits a legitimate roundtrip when the type is listed
  • Per-instance allowlist blocks a malicious __reduce__ payload (sentinel side effect never fires)
  • Module-level set_pickle_loader_allowlist applies to fresh PickleLoader instances
  • Per-instance allowlist overrides module-level default

Full tests/io/ runs 36/36 green.

PickleLoader has historically called pickle.load on caller-supplied bytes
without restriction, mirroring pandas.read_pickle / joblib.load. That
puts the safety burden entirely on the caller, which is brittle once
pickle files are shared across teams or produced by upstream systems.

Introduce an optional allowlist of (module, qualname) pairs, either
per-instance via PickleLoader(allowlist=...) or process-wide via
set_pickle_loader_allowlist(...). When configured, a RestrictedUnpickler
overrides find_class to reject anything off the allowlist, raising
UnpicklingError before any global is resolved. When no allowlist is set,
behavior is unchanged for backward compatibility but a warning is now
emitted on every load to signal the unrestricted mode.

Tests cover: the legacy unrestricted-with-warning path, allowlist
permitting a legitimate roundtrip, per-instance allowlist blocking a
malicious __reduce__ payload, module-level allowlist applying to fresh
instances, and per-instance overriding the module-level default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elijahbenizzy elijahbenizzy requested a review from skrawcz June 22, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant