fix(cache): cap actualEvict by totalUsed_ to prevent eviction deadloop#400
fix(cache): cap actualEvict by totalUsed_ to prevent eviction deadloop#400yuchen0cc wants to merge 2 commits intocontainerd:mainfrom
Conversation
Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
There was a problem hiding this comment.
Pull request overview
Caps the eviction target in FileCachePool::eviction() to avoid an infinite-loop deadlock when disk-pressure-based eviction exceeds what the cache pool actually owns, and adds a diagnostic log at eviction start.
Changes:
- Cap
actualEvicttototalUsed_before entering the eviction loop to prevent dead-looping when the pool cannot free enough bytes. - Add an eviction start log line with key inputs (
actualEvict,evictByCache,evictByDisk,totalUsed_).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| auto actualEvict = std::min( | ||
| static_cast<int64_t>(std::max(evictByCache, evictByDisk)), | ||
| totalUsed_ | ||
| ); |
There was a problem hiding this comment.
The new behavior (capping actualEvict by totalUsed_) fixes a previously-reported infinite-loop scenario; please add a regression test to prevent reintroduction. There are existing gtest-based cache tests under src/overlaybd/cache/test/; consider adding a case that forces evictByDisk > totalUsed_ (e.g., via an IFileSystem wrapper that overrides statvfs) and verifies FileCachePool::eviction() terminates even when the LRU contains entries with openCount > 0 and size == 0.
| isFull_ = true; | ||
|
|
||
| if (!lru_.empty() && !exit_) { | ||
| LOG_INFO("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_)); |
There was a problem hiding this comment.
eviction() is triggered periodically by a timer; logging at INFO on every eviction start can generate high-volume logs in steady state. Consider lowering this to DEBUG or gating it so it only logs when the cap is actually applied (i.e., when max(evictByCache, evictByDisk) > totalUsed_) so the message remains diagnostic without being noisy.
| LOG_INFO("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_)); | |
| LOG_DEBUG("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_)); |
| isFull_ = true; | ||
|
|
||
| if (!lru_.empty() && !exit_) { | ||
| LOG_INFO("start eviction ", VALUE(actualEvict), VALUE(evictByCache), VALUE(evictByDisk), VALUE(totalUsed_)); |
There was a problem hiding this comment.
Will there be too many logs?
There was a problem hiding this comment.
Good point, I will use LOG_AUDIT instead of LOG_INFO.
Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
What this PR does / why we need it:
Problem
FileCachePool::eviction() computes the eviction target as:
evictByDisk is derived from the whole-disk free space, which is not under the cache pool's control. When other processes occupy the disk, evictByDisk can exceed totalUsed_ — the total bytes the pool actually owns. After truncating every cache file to zero, actualEvict is still positive while the LRU holds only entries with openCount > 0 and size == 0. These entries are moved to the LRU head on every iteration without reducing actualEvict, causing an infinite loop.
Concrete example:
diskAvailInBytes_ = 150 MB (configured minimum free space)
disk free = 50 MB (occupied by other processes)
evictByDisk = 100 MB
totalUsed_ = 30 MB (all cache pool owns)
actualEvict starts at 100 MB. After clearing all 30 MB of cache,
actualEvict = 70 MB > 0, but the LRU has no more evictable entries.
Fix
Cap actualEvict to at most totalUsed_ before entering the loop:
The cache pool cannot free more than it owns. With this bound, after all truncations complete, actualEvict ≤ 0 and the loop exits normally.
A diagnostic log line is added at loop entry to aid future analysis.
Verification
Reproduced with a 200 MB ext4 loop device as the cache directory. 120 MB is pre-occupied by a noise file, leaving ~50 MB free against a 150 MB diskAvailInBytes_ threshold. Three cache files are kept open (openCount > 0) to hold path-B entries in the LRU after truncation. Without the fix eviction() never returns; with the fix it exits in O(n) iterations.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Fixes #
Please check the following list: