Skip to content

fix(cache): cap actualEvict by totalUsed_ to prevent eviction deadloop#400

Open
yuchen0cc wants to merge 2 commits intocontainerd:mainfrom
yuchen0cc:main
Open

fix(cache): cap actualEvict by totalUsed_ to prevent eviction deadloop#400
yuchen0cc wants to merge 2 commits intocontainerd:mainfrom
yuchen0cc:main

Conversation

@yuchen0cc
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Problem

FileCachePool::eviction() computes the eviction target as:

actualEvict = max(evictByCache, evictByDisk)

evictByDisk is derived from the whole-disk free space, which is not under the cache pool's control. When other processes occupy the disk, evictByDisk can exceed totalUsed_ — the total bytes the pool actually owns. After truncating every cache file to zero, actualEvict is still positive while the LRU holds only entries with openCount > 0 and size == 0. These entries are moved to the LRU head on every iteration without reducing actualEvict, causing an infinite loop.

Concrete example:

diskAvailInBytes_ = 150 MB (configured minimum free space)
disk free = 50 MB (occupied by other processes)
evictByDisk = 100 MB
totalUsed_ = 30 MB (all cache pool owns)

actualEvict starts at 100 MB. After clearing all 30 MB of cache,
actualEvict = 70 MB > 0, but the LRU has no more evictable entries.

Fix

Cap actualEvict to at most totalUsed_ before entering the loop:

actualEvict = min(max(evictByCache, evictByDisk), totalUsed_)

The cache pool cannot free more than it owns. With this bound, after all truncations complete, actualEvict ≤ 0 and the loop exits normally.

A diagnostic log line is added at loop entry to aid future analysis.

Verification

Reproduced with a 200 MB ext4 loop device as the cache directory. 120 MB is pre-occupied by a noise file, leaving ~50 MB free against a 150 MB diskAvailInBytes_ threshold. Three cache files are kept open (openCount > 0) to hold path-B entries in the LRU after truncation. Without the fix eviction() never returns; with the fix it exits in O(n) iterations.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Please check the following list:

  • Does the affected code have corresponding tests, e.g. unit test, E2E test?
  • Does this change require a documentation update?
  • Does this introduce breaking changes that would require an announcement or bumping the major version?
  • Do all new files have an appropriate license header?

Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Caps the eviction target in FileCachePool::eviction() to avoid an infinite-loop deadlock when disk-pressure-based eviction exceeds what the cache pool actually owns, and adds a diagnostic log at eviction start.

Changes:

  • Cap actualEvict to totalUsed_ before entering the eviction loop to prevent dead-looping when the pool cannot free enough bytes.
  • Add an eviction start log line with key inputs (actualEvict, evictByCache, evictByDisk, totalUsed_).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +206 to +209
auto actualEvict = std::min(
static_cast<int64_t>(std::max(evictByCache, evictByDisk)),
totalUsed_
);
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior (capping actualEvict by totalUsed_) fixes a previously-reported infinite-loop scenario; please add a regression test to prevent reintroduction. There are existing gtest-based cache tests under src/overlaybd/cache/test/; consider adding a case that forces evictByDisk > totalUsed_ (e.g., via an IFileSystem wrapper that overrides statvfs) and verifies FileCachePool::eviction() terminates even when the LRU contains entries with openCount > 0 and size == 0.

Copilot uses AI. Check for mistakes.
isFull_ = true;

if (!lru_.empty() && !exit_) {
LOG_INFO("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_));
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eviction() is triggered periodically by a timer; logging at INFO on every eviction start can generate high-volume logs in steady state. Consider lowering this to DEBUG or gating it so it only logs when the cap is actually applied (i.e., when max(evictByCache, evictByDisk) > totalUsed_) so the message remains diagnostic without being noisy.

Suggested change
LOG_INFO("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_));
LOG_DEBUG("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_));

Copilot uses AI. Check for mistakes.
isFull_ = true;

if (!lru_.empty() && !exit_) {
LOG_INFO("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be too many logs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will use LOG_AUDIT instead of LOG_INFO.

Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants