Skip to content

feat(ragfs): add CachedFileSystem and Redis/Mooncake/Yuanrong cache providers#2520

Open
tuofang wants to merge 21 commits into
volcengine:mainfrom
tuofang:ragfs-add-cache
Open

feat(ragfs): add CachedFileSystem and Redis/Mooncake/Yuanrong cache providers#2520
tuofang wants to merge 21 commits into
volcengine:mainfrom
tuofang:ragfs-add-cache

Conversation

@tuofang

@tuofang tuofang commented Jun 8, 2026

Copy link
Copy Markdown

Related Discussion

add CachedFileSystem + CacheProvider trait and Redis/Mooncake/Yuanrong Provide

Summary

This PR adds the RAGFS cache provider architecture and introduces native cache provider adapters for Mooncake and Yuanrong.

Implemented functionality:

  • Added a shared CacheProvider abstraction used by CachedFileSystem.
  • Kept file and directory cache semantics in the common cache layer:
    • file cache keys and directory cache keys
    • cache hit / miss handling
    • backend fallback and cache fill
    • write/delete/rename invalidation
    • remove_all / directory rename subtree invalidation
    • metrics, timeout, and degraded fallback behavior
  • Added feature-gated Mooncake cache adapter:
    • MooncakeProvider
    • MooncakeClient
    • MooncakeConfig
    • native Mooncake Store setup and health check
    • blocking native calls isolated behind concurrency limits
    • provider contract support for get/put/delete/batch/close
  • Added feature-gated Yuanrong cache adapter:
    • YuanrongProvider
    • YuanrongClient
    • YuanrongConfig
    • ragfs-cache-yuanrong-sys FFI crate
    • C ABI bridge over Yuanrong C++ SDK / KVClient
    • health/get/set/delete/exists/batch/flush/close support
    • timeout, error mapping, concurrency control, key hashing, and empty-value encoding
  • Added Docker-based native test environments for provider validation.
  • Default build behavior remains unchanged: native cache providers are behind features and default RAGFS behavior does not require linking Mooncake or Yuanrong.

Testing

Mooncake validation:

  • Docker Mooncake native environment built successfully.
  • Mooncake official Rust native smoke: 1/1 passed.
  • OpenViking Mooncake native smoke: 1/1 passed.
  • Mooncake Provider contract: 8/8 passed.
  • Mooncake Provider + CachedFileSystem: 5/5 passed.
  • MemoryMock + CachedFileSystem regression: 18/18 passed.
  • RAGFS default feature unit tests: 111/111 passed.
  • RAGFS default feature doc tests: 1/1 passed.
  • Static checks for this change: passed.

Yuanrong validation:

  • Default feature provider contract: 6/6 passed.
  • Default feature CachedFileSystem tests: 3/3 passed.
  • Native feature provider contract: 5/5 passed.
  • Real ETCD + Yuanrong worker CachedFileSystem tests: 4/4 passed.
  • Real ETCD + Yuanrong worker native smoke: 1/1 passed.

Covered behavior:

  • cache hit reads
  • miss fallback and cache fill
  • write-after-read correctness
  • directory cache
  • rename/delete/remove_all invalidation
  • batch get/put/delete
  • provider unavailable fallback to backend
  • close lifecycle
  • timeout and concurrency control
  • default build without native provider linking

Note: Yuanrong native validation used the currently available Docker image, whose installed openyuanrong-datasystem version is 0.6.3

压测

测试一种后端形态:

storage backend = localfs
vector db = local
cache provider = redis

目标是在后端存储和向量库都固定为本地实现的前提下,对比 cache off / cache on(redis) 对 OpenViking 业务接口的端到端收益和额外开销。

规模:

数据集 数量 文件大小 用途
small docs 1,000 1 KB - 16 KB 高频 content/readfs/stat
medium docs 500 64 KB - 512 KB 主压测读路径
large docs 100 1 MB - 8 MB 大对象 read-through/warm read
hot set 总量 20% 混合大小 hot/cold mixed 中承担 80% 请求
cold set 总量 80% 混合大小 hot/cold mixed 中承担 20% 请求

总共1600个文件

1. Workload 矩阵

阶段 Workload 目标 Users Spawn rate 时长 备注
smoke warm_read 脚本、认证、metrics 冒烟 5 5/s 2m cache off/on 各一次
baseline cold_read cache on 首次 miss 成本 32 32/s 10m cache on 前清空 Redis
benchmark warm_read 最大命中收益 64 64/s 15m 正式主场景
benchmark hot_cold 真实热冷收益 64 64/s 15m hot 80% / cold 20%
benchmark read_write write-through 成本和写后读 64 64/s 15m read 90% / write 10%
benchmark stat_ls 元数据 cache 收益 64 64/s 15m 关注 stat/list/tree
correctness invalidation 失效正确性 4 4/s 5m cache on(redis) 下执行
supplement search_supplement local vector db 稳定性补充 16 16/s 10m 不作为缓存收益主结论
sweep warm_read 并发曲线 16/32/64/128 同 users 每档 10m cache off/on 成对跑

2. 执行摘要

  • 正式 A/B 与 sweep 共 20 次运行,累计 798,792 个请求,Locust failure=0。
  • cache on 在 cold_readwarm_readhot_coldread_writestat_ls 的聚合平均延迟均有改善。
  • read_write 中读请求显著改善,但写请求 P95 与 cache off 持平,平均写延迟略有增加。
  • 128 users sweep 中 Avg/P50 仍改善,但 P95 略差、P99 持平,说明高并发尾延迟不再受缓存主导。

3. 主 Workload A/B 收益

Workload RPS 变化 Avg 改善 P50 改善 P95 改善 P99 改善 Max 改善
cold_read +0.0% +17.4% +24.4% +11.8% +9.5% -18.6%
warm_read +0.0% +15.7% +17.6% +0.0% -3.3% -42.8%
hot_cold +0.0% +18.5% +25.5% +26.3% +27.3% +9.0%
read_write +0.0% +17.2% +18.8% +3.8% +3.2% +2.7%
stat_ls -0.0% +23.3% +16.7% +12.1% +11.3% -89.2%
invalidation -0.0% -8.5% -7.7% -9.5% -4.2% -21.2%

4. OpenViking Prometheus 对比收益

Workload Cache Route Δ count Server Avg ms Server Avg 收益 Locust Avg ms
cold_read off content/read:cold_read 19,200 46.1 baseline 96.5
cold_read on content/read:cold_read 19,200 36.6 +20.7% 79.7
warm_read off content/read:warm_read 57,600 88.5 baseline 171.3
warm_read on content/read:warm_read 57,600 70.8 +20.0% 144.3
hot_cold off content/read:hot_cold 57,600 49.8 baseline 119.3
hot_cold on content/read:hot_cold 57,600 42.1 +15.4% 97.2
read_write off content/read:read_write 51,742 85.1 baseline 165.8
read_write off content/write:read_write 5,858 139.9 baseline 193.5
read_write on content/read:read_write 51,921 66.5 +21.9% 132.7
read_write on content/write:read_write 5,679 155.7 -11.3% 203.2
stat_ls off fs/stat:stat_ls 28,840 425.6 baseline 480.3
stat_ls on fs/stat:stat_ls 28,666 337.6 +20.7% 382.2
invalidation off content/read:invalidation (2 reads/sequence) 2,400 3.3 baseline 5.7
invalidation off content/write:invalidation 1,200 30.7 baseline 39.4
invalidation on content/read:invalidation (2 reads/sequence) 2,400 3.4 -3.7% 5.6
invalidation on content/write:invalidation 1,200 35.2 -14.7% 42.8

5. 并发 Sweep

Users Avg 改善 P50 改善 P95 改善 P99 改善
16 +24.1% +27.3% +6.0% +0.0%
32 +29.0% +34.0% +27.8% +18.2%
64 +24.0% +23.5% +11.1% +6.5%
128 +10.3% +9.1% -2.3% +0.0%

add CachedFileSystem + CacheProvider trait and Mooncake/Yuanrong Provide
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add RAGFS cache framework

Relevant files:

  • crates/ragfs/src/cache/mod.rs
  • crates/ragfs/src/cache/wrapper.rs
  • crates/ragfs/src/cache/metrics.rs
  • crates/ragfs/src/cache/policy.rs
  • crates/ragfs/src/cache/provider.rs
  • crates/ragfs/src/cache/envelope.rs
  • crates/ragfs/src/cache/memory.rs
  • crates/ragfs/tests/cache_wrapper.rs
  • crates/ragfs/src/lib.rs

Sub-PR theme: Add Mooncake cache provider

Relevant files:

  • crates/ragfs-cache-mooncake/**/*

Sub-PR theme: Add Yuanrong cache provider

Relevant files:

  • crates/ragfs-cache-yuanrong/**/*
  • crates/ragfs-cache-yuanrong-sys/**/*

⚡ No major issues detected

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Handle mutex poisoning for known_keys instead of unwrapping

Handle mutex poisoning for known_keys instead of using unwrap(), consistent with the
flush method. A poisoned mutex can cause unexpected panics in async code. Replace
lock().unwrap() with lock().map_err(...) to return a CacheError::Internal when the
mutex is poisoned.

crates/ragfs-cache-mooncake/src/provider.rs [82-125]

 async fn put(&self, key: &str, value: Bytes) -> CacheResult<()> {
     self.client.put(key, &value, &self.replicate).await?;
-    self.known_keys.lock().unwrap().insert(key.to_owned());
+    self.known_keys
+        .lock()
+        .map_err(|_| CacheError::Internal("Mooncake key tracker is poisoned".into()))?
+        .insert(key.to_owned());
     Ok(())
 }
 
 async fn delete_object(&self, key: &str) -> CacheResult<()> {
     if !self.client.is_exist(key).await? {
-        self.known_keys.lock().unwrap().remove(key);
+        self.known_keys
+            .lock()
+            .map_err(|_| CacheError::Internal("Mooncake key tracker is poisoned".into()))?
+            .remove(key);
         return Ok(());
     }
     match self.client.remove(key).await {
         Ok(()) => {
-            self.known_keys.lock().unwrap().remove(key);
+            self.known_keys
+                .lock()
+                .map_err(|_| CacheError::Internal("Mooncake key tracker is poisoned".into()))?
+                .remove(key);
             Ok(())
         }
         Err(error) => match self.client.is_exist(key).await {
             Ok(false) => {
-                self.known_keys.lock().unwrap().remove(key);
+                self.known_keys
+                    .lock()
+                    .map_err(|_| CacheError::Internal("Mooncake key tracker is poisoned".into()))?
+                    .remove(key);
                 Ok(())
             }
             Ok(true) | Err(_) => Err(error),
         },
     }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion adds proper mutex poisoning handling to put and delete_object methods, consistent with the existing flush method, improving robustness and preventing unexpected panics.

Medium

@Mijamind719 Mijamind719 changed the title add RAGFS CachedFileSystem feat(ragfs): add CachedFileSystem and Redis/Mooncake/Yuanrong cache providers Jun 9, 2026
@Mijamind719 Mijamind719 self-assigned this Jun 9, 2026
@baojun-zhang

Copy link
Copy Markdown
Collaborator

感谢在 Rust 文件系统中实现分布式缓存能力,这为我们的系统整体性能提供了巨大的帮助。但是能否增加使用说明文档和配置实例。如:

@Mijamind719 Mijamind719 self-requested a review June 9, 2026 02:46
@baojun-zhang baojun-zhang requested review from qin-ctx and zhoujh01 June 9, 2026 02:51
@Mijamind719

Copy link
Copy Markdown
Collaborator

Findings

P1 crates/ragfs/src/cache/wrapper.rs:139-217
current_generation() 会把 subtree generation 缓存在本地 self.generations,之后 generations_match() 也只会读这个本地值,不会再回 provider 拉最新 generation。结果是:多个 CachedFileSystem 共享同一个 provider + namespace 时,一个实例 bump 了 generation,另一个实例仍可能继续命中旧缓存。
我在最新 head 8ac5904 上加了一条临时复现测试验证过:两个 wrapper 共享同一个 MemoryCacheProvider,second.remove_all("/tree") 之后,first.read("/tree/leaf.txt") 仍返回旧值 old,不是新值 new。这会直接破坏“分布式缓存失效”的核心语义。

P2 .cargo/config.toml:1-5
这个 PR 把整个 workspace 的 crates.io 源强制改成了阿里云镜像。这个改动会影响所有开发者和 CI 的依赖解析行为,而且和“缓存层实现”本身没有直接关系,属于 repo 级环境策略,不适合跟功能 PR 一起合入。

P2 crates/ragfs-cache-yuanrong/src/client.rs:16-76 与 crates/ragfs-cache-yuanrong-sys/native/yuanrong_bridge.cpp:24-27,170,185,211,224,238,276,351,391,415
Rust 层暴露了 sdk_concurrency,看起来允许并发调用;但 C++ bridge 在 YrClientHandle 上挂了一个全局 call_mutex,所有 health/get/set/delete/exists/mget/mset/mdelete/shutdown 都被串行化了。也就是说,同一个 YuanrongProvider 实例实际是单通道过桥,sdk_concurrency 在 native 模式下并不能换来真正的后端并发。这个至少需要在代码或文档里讲清楚,否则性能预期会被高估。

Open Questions / Assumptions

PR 标题里写了 Redis/Mooncake/Yuanrong cache providers,但我在这版 diff 里没有看到 Redis adapter 或 Redis 测试代码;如果 Redis 还没进这个 PR,标题和描述现在是 over-claim。
我上次提过的 Mooncake/Yuanrong known_keys.lock().unwrap() 问题,作者这版已经修掉了,这条我不再算 finding。

@tuofang

tuofang commented Jun 12, 2026

Copy link
Copy Markdown
Author

Findings

P1 crates/ragfs/src/cache/wrapper.rs:139-217 current_generation() 会把 subtree generation 缓存在本地 self.generations,之后 generations_match() 也只会读这个本地值,不会再回 provider 拉最新 generation。结果是:多个 CachedFileSystem 共享同一个 provider + namespace 时,一个实例 bump 了 generation,另一个实例仍可能继续命中旧缓存。 我在最新 head 8ac5904 上加了一条临时复现测试验证过:两个 wrapper 共享同一个 MemoryCacheProvider,second.remove_all("/tree") 之后,first.read("/tree/leaf.txt") 仍返回旧值 old,不是新值 new。这会直接破坏“分布式缓存失效”的核心语义。

P2 .cargo/config.toml:1-5 这个 PR 把整个 workspace 的 crates.io 源强制改成了阿里云镜像。这个改动会影响所有开发者和 CI 的依赖解析行为,而且和“缓存层实现”本身没有直接关系,属于 repo 级环境策略,不适合跟功能 PR 一起合入。

P2 crates/ragfs-cache-yuanrong/src/client.rs:16-76 与 crates/ragfs-cache-yuanrong-sys/native/yuanrong_bridge.cpp:24-27,170,185,211,224,238,276,351,391,415 Rust 层暴露了 sdk_concurrency,看起来允许并发调用;但 C++ bridge 在 YrClientHandle 上挂了一个全局 call_mutex,所有 health/get/set/delete/exists/mget/mset/mdelete/shutdown 都被串行化了。也就是说,同一个 YuanrongProvider 实例实际是单通道过桥,sdk_concurrency 在 native 模式下并不能换来真正的后端并发。这个至少需要在代码或文档里讲清楚,否则性能预期会被高估。

Open Questions / Assumptions

PR 标题里写了 Redis/Mooncake/Yuanrong cache providers,但我在这版 diff 里没有看到 Redis adapter 或 Redis 测试代码;如果 Redis 还没进这个 PR,标题和描述现在是 over-claim。 我上次提过的 Mooncake/Yuanrong known_keys.lock().unwrap() 问题,作者这版已经修掉了,这条我不再算 finding。

1、current_generation() 改为 provider-first。provider 有 generation 时永远以 provider 为准,commit
2、已删除,commit
3、已在代码中增加注释,说明同一个 native YuanrongProvider 实例实际会在 C++ bridge 处串行化 SDK 调用,sdk_concurrency > 1 不会带来真正的 Yuanrong 后端并发,sdk_concurrency 确实能为native client 提供并发通道,可以通过nativeclientpool实现,但是不保证并发冲突写的全局顺序,commit
4、已增加RedisProvider实现,commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants