Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

@kaiming-cheng kaiming-cheng commented Jan 27, 2026

This PR Introduces a hierarchical optimization database that stores GPU kernel optimization techniques and code examples for the RAG-based optimization.

Key components:

  • OptNode / OptHierarchy: Tree structure organizing optimizations by bottleneck type (latency, memory, utilization) → technique → code example
  • docs/: Optimization technique documentation (TMA, PID swizzling, persistence)
  • code_samples/: Reference Triton kernel implementations (matmul, matadd with various optimizations applied)

Optimization techniques covered:

  • Host-side and device-side Tensor Memory Accelerator (TMA)
  • PID swizzling for L2 cache locality
  • Persistent kernel programming style

This database enables the agent to retrieve relevant optimization strategies and reference implementations based on diagnosed performance bottlenecks.

Test

query = "use TMA for memory optimization"
prescriber = RAGPrescriber()
opt_node, similarities = prescriber.retrieve(query)
Retrieved: 
============================= On-Device Tensor Memory Accel... (similarity: 0.573)

Generated context (4620 chars):
--------------------------------------------------------------------------------
## Optimization Technique
...
context = prescriber.build_context(opt_node, max_code_examples=1, max_chars=2000)
## Code Examples
...
 
add_kernel[grid](
        x,
        y,
        output,
        M,
        N,
        x.stride(0),
        x.stride(1),
        BLOCK_SIZE_M=BLOCK_SIZE_M,
        BLOCK_SIZE_N=BLOCK_SIZE_N,
    )
    return output

(Showing 1 of 2 examples)

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 27, 2026
@kaiming-cheng kaiming-cheng changed the title [Optimization 7/n] Add Database in Kernel_opt [Optimization 7/n] Add Knowledge Database to Kernel optimization Jan 27, 2026
@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_7_clean branch from b9cb0d7 to 84708fd Compare January 28, 2026 00:29
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, is it hard to add the integration code using the RAG into this PR too?

Remember to cite for the code_samples/docs

  • Drop [Optimization 7/n] from the title just to avoid confusion

"""Adds a child node to the current node."""
self.opt_parents.extend(parent_nodes)

def remove_parents(self, parent_nodes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for any reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call - I've removed it in following commit

Comment on lines 105 to 109
level_1_opts = [optnode_latency, optnode_memory, optnode_utilization]
self.root.add_children(level_1_opts)
optnode_latency.add_parents([self.root])
optnode_memory.add_parents([self.root])
optnode_utilization.add_parents([self.root])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For legibility can we add a helper like add_relation or something that updates the child+parent symmetrically

It's easy to parse here, but level3 is a harder to parse

@kaiming-cheng kaiming-cheng changed the title [Optimization 7/n] Add Knowledge Database to Kernel optimization Add Knowledge Database to Kernel optimization Jan 28, 2026

# Default path
if database_path is None:
database_path = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

database_path seems wrong, use Path(__file__).resolve().parents[...] until you hit the project root (where pyproject.toml is)

return 0.0
return dot_product / (norm_vec1 * norm_vec2)

def retrieve(self, opt_prompt: str) -> tuple[OptNode | None, dict[OptNode, float]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieve() calls embeddings.embed_query(node.opt_desc) for every node each time. Some nodes include full code examples which is slow and costly.
Precompute embeddings once at init and cache them per node.
Or at least cache in-memory dict {OptNode: embedding} after first compute.
Also consider embedding only L1/L2 text nodes for retrieval, then traverse down for code examples. Embedding code blobs is noisy and expensive.


return best_node, opt_similarity

def build_context(self, opt_node: OptNode) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It traverses from the selected node down and concatenates every descendant’s opt_desc, including entire code files. That will quickly blow context limits and drown signal.
Put a max character/token budget and stop after N leaf examples.
Add separators between nodes (right now it just concatenates).
Optionally include only: (a) technique description + (b) top-k leaf code examples.


from pathlib import Path

from kernel_perf_agent.kernel_opt.database.docs import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see a kernel_perf_agent/kernel_opt/database/docs/__init__.py added in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, updated this in the c57e13c

pyproject.toml Outdated
"python-dotenv",
"gradio>=5.5.0",
"requests",
"langchain-openai",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding langchain-openai is a big dependency. If you only need embeddings, consider using the project’s existing LLM client (if any) or a thinner dependency.

If you keep it, I’d suggest pinning compatible versions or adding it as an optional dependency for the RAG feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We can use OpenAI's text-embedding model for simplicity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants