Skip to content

feat(retrieve): add knowledge-graph-aware scoring to retrieval pipeline#2555

Open
huang-yi-dae wants to merge 1 commit into
volcengine:mainfrom
huang-yi-dae:feature/kg-retrieval-scoring-integration
Open

feat(retrieve): add knowledge-graph-aware scoring to retrieval pipeline#2555
huang-yi-dae wants to merge 1 commit into
volcengine:mainfrom
huang-yi-dae:feature/kg-retrieval-scoring-integration

Conversation

@huang-yi-dae

Copy link
Copy Markdown

Summary

  • Add graph_alpha and graph_saturation_k config fields to RetrievalConfig, enabling optional graph-aware scoring
  • New graph_loader.py module loads relation data concurrently from two sources: .relations.json and MEMORY_FIELDS.links/backlinks
  • Integrate graph scoring into HierarchicalRetriever._convert_to_matched_contexts with lazy loading (only top candidates), blending graph_score = tanh(total_relations / graph_saturation_k) into final score
  • VikingFS passes viking_fs=self to both find() and search() retriever construction calls
  • Default graph_alpha=0 preserves full backward compatibility - no behavior change when disabled

Test plan

  • test_convert_to_matched_contexts_returns_empty_relations - backward compat, graph_alpha=0 keeps relations=[]
  • test_graph_alpha_zero_returns_empty_relations - explicit zero, VikingFS present but not invoked
  • test_graph_scoring_with_relations_json - .relations.json loading + tanh blending
  • test_graph_scoring_with_memory_file_links - MEMORY_FIELDS links/backlinks parsing from .md files
  • test_graph_lazy_loading - only top candidates trigger graph data I/O

All 14 tests pass (10 existing + 4 new).

Integrate graph connectivity (from .relations.json and MEMORY_FIELDS
links/backlinks) into the retrieval scoring pipeline. When graph_alpha > 0,
top candidates get a graph_score blended via tanh saturation, boosting
well-connected results. Default graph_alpha=0 preserves existing behavior.

🤖 Generated with [Qoder][https://qoder.com]
@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Redundant Truncation

The relations list is truncated twice: once in load_graph_data_for_uris (to max_relations_per_uri) and again in _convert_to_matched_contexts (to self.MAX_RELATIONS). This is redundant and could cause confusion if the two limits differ.

    max_relations_per_uri=self.MAX_RELATIONS,
)

for mc in graph_candidates:
    gd = uri_to_graph.get(mc.uri)
    if gd is None:
        graph_score = 0.0
        mc.relations = []
    else:
        graph_score = math.tanh(gd.total_count / self.graph_saturation_k)
        mc.relations = gd.relations[: self.MAX_RELATIONS]

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant