You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><strong>Instructions:</strong> To view the interactive diagram, please render the Mermaid diagram from <ahref="architecture.md">architecture.md</a> into an SVG file named <code>architecture.svg</code> and place it in the same directory as this HTML file. You can use the <ahref="https://mermaid.live" target="_blank">Mermaid Live Editor</a> for this.</p>
<h2>2. Narrative on Capabilities and Performance</h2>
21
+
22
+
<h3>Overview</h3>
23
+
<p>CodeGraph is a revolutionary, MCP-based codebase intelligence platform designed to transform any compatible Large Language Model (LLM) into a codebase expert. It achieves this through advanced semantic analysis, powered by the Qwen2.5-Coder-14B-128K model, providing deep insights into any given codebase. The system is built with a local-first philosophy, ensuring privacy and performance.</p>
24
+
25
+
<h3>Core Capabilities</h3>
26
+
<ul>
27
+
<li><strong>Semantic Intelligence</strong>: At its heart, CodeGraph leverages the Qwen2.5-Coder-14B model with a 128K context window for a complete and nuanced understanding of the codebase.</li>
28
+
<li><strong>Single-Pass Edge Processing</strong>: A revolutionary unified Abstract Syntax Tree (AST) parsing approach extracts both nodes (code symbols) and edges (relationships) in a single pass, significantly improving processing speed.</li>
29
+
<li><strong>AI-Enhanced Symbol Resolution</strong>: Achieves an impressive 85-90% success rate in linking code entities by using a multi-tiered approach that culminates in semantic similarity matching for otherwise unresolvable symbols.</li>
30
+
<li><strong>Conversational AI (RAG)</strong>: The system provides a Retrieval-Augmented Generation (RAG) engine, enabling users to interact with their codebase using natural language. This is exposed through tools like <code>codebase_qa</code> and <code>code_documentation</code>.</li>
31
+
<li><strong>Intelligent Caching</strong>: A sophisticated caching layer that uses semantic similarity matching to achieve high cache hit rates (50-80%+), dramatically speeding up subsequent queries.</li>
32
+
<li><strong>Pattern Detection</strong>: An advanced ML pipeline analyzes team conventions and coding patterns, providing insights into codebase health and consistency.</li>
33
+
<li><strong>MCP Protocol Integration</strong>: CodeGraph is compatible with any MCP-enabled agent, including Claude Code, Codex CLI, and Gemini CLI, allowing for seamless integration into existing developer workflows.</li>
34
+
</ul>
35
+
36
+
<h3>Architecture Deep Dive</h3>
37
+
<p>The CodeGraph system is a modular, multi-crate Rust workspace, designed for performance, maintainability, and scalability.</p>
38
+
39
+
<h4>Component Breakdown:</h4>
40
+
<ul>
41
+
<li><spanclass="component" data-component-id="A">`codegraph-core`</span>: The foundational crate of the entire system. It defines the core data structures, traits, and types that are used across all other components, ensuring a consistent data model. It has no internal dependencies.</li>
42
+
<li><spanclass="component" data-component-id="B">`codegraph-parser`</span>: Responsible for parsing source code into ASTs using Tree-sitter. It supports 11 programming languages and is responsible for the initial extraction of semantic nodes and their relationships (edges).</li>
43
+
<li><spanclass="component" data-component-id="C">`codegraph-graph`</span>: This component manages the storage and retrieval of the code graph data (nodes and edges) using RocksDB, a high-performance embedded key-value store. It provides the backbone for dependency analysis and architectural exploration.</li>
44
+
<li><spanclass="component" data-component-id="D">`codegraph-vector`</span>: Handles the creation of vector embeddings from code snippets and provides fast similarity search capabilities using FAISS. It supports multiple embedding providers, including local ONNX models and Ollama.</li>
45
+
<li><spanclass="component" data-component-id="E">`codegraph-ai`</span>: The intelligence layer of the system. It integrates with the Qwen model and uses the data from the graph and vector stores to provide advanced features like AI-powered symbol resolution, impact analysis, and semantic search.</li>
46
+
<li><spanclass="component" data-component-id="F">`codegraph-mcp`</span>: The main entry point for the command-line interface (CLI) and the primary MCP server. It orchestrates the other components to deliver the full suite of CodeGraph tools and functionalities.</li>
47
+
<li><spanclass="component" data-component-id="G">`codegraph-api`</span>: Provides a REST and GraphQL API server (using Axum) for programmatic access to CodeGraph's capabilities, allowing for integration with external tools and services.</li>
48
+
<li><spanclass="component" data-component-id="H">`core-rag-mcp-server`</span>: A dedicated, production-ready MCP server that exposes the RAG (Retrieval-Augmented Generation) functionality, enabling conversational AI features.</li>
49
+
<li><spanclass="component" data-component-id="I">`codegraph-cache`</span>: An AI-powered caching system that intelligently stores and retrieves results from vector operations, significantly improving performance for repeated or similar queries.</li>
50
+
<li><strong>Utility Crates</strong>:
51
+
<ul>
52
+
<li><spanclass="component" data-component-id="J">`codegraph-concurrent`</span>: Provides concurrent data structures and utilities for parallel processing.</li>
53
+
<li><spanclass="component" data-component-id="K">`codegraph-git`</span>: Integrates with Git repositories to enable features like incremental indexing based on file changes.</li>
54
+
<li><spanclass="component" data-component-id="L">`codegraph-queue`</span>: A priority queue system for managing tasks and operations.</li>
55
+
<li><spanclass="component" data-component-id="M">`codegraph-lb`</span>: An intelligent load balancer for distributing requests and managing resources.</li>
56
+
<li><spanclass="component" data-component-id="N">`codegraph-zerocopy`</span>: Implements zero-copy data structures and serialization for highly efficient data handling.</li>
57
+
</ul>
58
+
</li>
59
+
</ul>
60
+
61
+
<h4>Data Flow (Indexing):</h4>
62
+
<ol>
63
+
<li>The <code>codegraph index</code> command is initiated via the <code>codegraph-mcp</code> CLI.</li>
64
+
<li><code>codegraph-parser</code> recursively scans the target directory, parsing files for supported languages into ASTs.</li>
65
+
<li>In a single pass, it extracts semantic nodes (functions, classes, etc.) and edges (calls, imports).</li>
66
+
<li>The extracted nodes and edges are sent to <code>codegraph-graph</code>, which stores them in a RocksDB database.</li>
67
+
<li>The semantic nodes are also passed to <code>codegraph-vector</code>, which generates 384-dimensional vector embeddings using the configured provider (ONNX or Ollama).</li>
68
+
<li>These embeddings are stored in a FAISS index for fast similarity search.</li>
69
+
</ol>
70
+
71
+
<h3>Performance Analysis</h3>
72
+
<p>CodeGraph is engineered for high performance, especially on modern, high-memory systems.</p>
73
+
<ul>
74
+
<li><strong>Indexing Speed</strong>: The system can parse and index code at a remarkable speed. For instance, it can process over 170,000 lines of code in just under half a second. The single-pass extraction process contributes a 50% speed improvement over traditional two-phase methods.</li>
75
+
<li><strong>Embedding Performance</strong>: The choice of embedding provider offers a trade-off between speed and quality.
76
+
<ul>
77
+
<li><strong>ONNX (`all-MiniLM-L6-v2`)</strong>: Offers blazing-fast embedding generation, capable of indexing a 2.5 million line codebase in about 32 minutes. This is ideal for large codebases and rapid, iterative development.</li>
78
+
<li><strong>Ollama (`nomic-embed-code`)</strong>: Provides state-of-the-art, code-specialized embeddings for maximum retrieval accuracy, though at a slower pace.</li>
79
+
</ul>
80
+
</li>
81
+
<li><strong>High-Memory Optimization</strong>: The system automatically detects the available system memory and adjusts its performance parameters accordingly. On a 128GB M4 Max system, it can increase the number of workers to 16 and the batch size to 20,480, enabling ultra-high performance indexing.</li>
82
+
<li><strong>Query Latency</strong>: Vector searches with FAISS are typically completed in sub-second time, and the intelligent caching layer further reduces latency for repeated queries to milliseconds.</li>
83
+
</ul>
84
+
85
+
<h3>Conclusion</h3>
86
+
<p>CodeGraph's architecture is a well-designed, modular system that effectively combines modern AI capabilities with high-performance engineering. Its local-first approach, coupled with its powerful semantic analysis and conversational AI features, makes it a revolutionary tool for developers seeking to gain a deeper understanding of their codebases. The system is not only powerful but also highly configurable, allowing users to balance performance and accuracy to suit their specific needs.</p>
This document provides a detailed overview of the CodeGraph system architecture, including a component-level dependency diagram.
4
+
5
+
## 1. Interactive Architecture Diagram
6
+
7
+
An interactive and animated version of the architecture diagram is available in [`architecture.html`](architecture.html).
8
+
9
+
To view it, open the HTML file in your browser. For the interactivity to work, you will first need to generate the `architecture.svg` file by rendering the Mermaid diagram below using the [Mermaid Live Editor](https://mermaid.live).
10
+
11
+
## 2. Component-Level Dependency Architecture
12
+
13
+
The following diagram illustrates the dependencies between the various crates (components) in the CodeGraph system.
0 commit comments