Agent Discussions: Context Engineering & Context Rot Research

A framework for running multi-turn conversations between two AI agents, built to test different context engineering strategies and investigate context decay -- the tendency for LLM performance to degrade as the context window fills up.

Research Questions

How do different context management strategies affect conversation quality over extended dialogues? Sliding window vs. truncation vs. full history -- when does each break down?
At what point does context decay set in? As conversation history grows, do agents start repeating themselves, losing coherence, or hallucinating more?
Can context engineering mitigate degradation? Does summarizing older context, compressing history, or selectively pruning messages preserve quality better than naive truncation?

How It Works

Two AI agents (configurable roles, system prompts, and models) take turns responding to each other through a LangGraph-based orchestrator. The system tracks:

Token usage per turn -- how context consumption grows over time
Response latency -- does generation slow as context fills?
Context utilization -- what percentage of the window is used at each turn?
Conversation quality signals -- repetition, coherence drift, topic abandonment

Config (YAML) ──> Orchestrator ──> LangGraph State Machine
                                         │
              Agent A ◄── Context Manager ──► Agent B
                                │
                          Telemetry Logger
                                │
                    Perf Logs / Token Tracking / Analytics

Context Strategies Tested

Strategy	Description	Hypothesis
Full history	Send entire conversation each turn	Best quality early, degrades or fails at token limits
Sliding window	Keep only the last N tokens of history	Maintains recency but loses early context
Truncation	Drop oldest messages when limit is reached	Abrupt context loss may cause incoherence
Summarization	Compress older turns into summaries	Best quality retention, but summaries lose nuance (planned)

Strategies are configured per-experiment via YAML:

conversation:
  max_turns: 20
  context_window_strategy: "sliding"  # or "truncation", "full"
  context_window_size: 8000

model:
  model_name: "gpt-4"
  temperature: 0.7
  max_tokens: 2000

Example Configurations

Pre-built experiment configs in examples/:

Config	Use Case
`research-discussion.yaml`	Deep research dialogue -- tests sustained reasoning
`creative-collaboration.yaml`	Creative writing -- tests coherence and style drift
`technical-architecture.yaml`	System design debate -- tests factual consistency
`performance-optimized.yaml`	Tuned for throughput and token efficiency

Running an Experiment

# Install
pip install -e .

# Set API key
export OPENAI_API_KEY="your-key"

# Run a conversation
agentic-conversation run --config examples/research-discussion.yaml

# Analyze results
agentic-conversation analyze ./logs --format detailed

Architecture

src/agentic_conversation/
├── orchestrator.py       # Turn management, termination conditions
├── conversation_graph.py # LangGraph state machine
├── context_manager.py    # Context window strategies (core research component)
├── token_counter.py      # Accurate token counting via tiktoken
├── agents.py             # Base agent interface
├── langchain_agent.py    # LangChain-based agent implementation
├── circuit_breaker.py    # API resilience (exponential backoff, failure detection)
├── telemetry.py          # Performance monitoring, structured logging
├── config.py             # YAML config loading, env var substitution
└── models.py             # Type-safe data structures

tests/                    # Unit, integration, and performance tests
perf_logs/                # Benchmarking results (token scaling, throughput, concurrency)
examples/                 # Pre-built experiment configurations

Key Components

Context Manager (context_manager.py) -- The core research component. Implements pluggable strategies for managing what goes into the context window at each turn. This is where the context decay mitigation logic lives.

Token Counter (token_counter.py) -- Accurate token counting using tiktoken. Tracks per-turn and cumulative token usage to map when degradation correlates with context saturation.

Telemetry (telemetry.py) -- Structured JSON logs capturing response times, token counts, context utilization, and error rates per turn. Enables post-hoc analysis of conversation quality over time.

Circuit Breaker (circuit_breaker.py) -- Handles API failures gracefully so long-running multi-turn experiments don't crash mid-conversation.

Performance Benchmarks

The perf_logs/ directory contains benchmarking results across:

Single conversation runs at varying context sizes
Concurrent multi-conversation scaling (2, 5, 10, 20 simultaneous)
Throughput under sustained load
Context scaling tests (measuring latency as context grows)

Tech Stack

Python, LangGraph, LangChain, tiktoken, OpenAI/Anthropic APIs, YAML configuration

Status

This is an active research POC. The framework is functional and instrumented. Next steps:

Implement summarization-based context strategy
Build automated quality scoring (repetition detection, coherence metrics)
Run systematic comparisons across context strategies with controlled topics

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
docs		docs
examples		examples
perf_logs		perf_logs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent_communication_schema.md		agent_communication_schema.md
config.yaml		config.yaml
debate-config.yaml		debate-config.yaml
dual_agent_conversation.yaml		dual_agent_conversation.yaml
my-custom-config.yaml		my-custom-config.yaml
orchestrator_explanation.md		orchestrator_explanation.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Discussions: Context Engineering & Context Rot Research

Research Questions

How It Works

Context Strategies Tested

Example Configurations

Running an Experiment

Architecture

Key Components

Performance Benchmarks

Tech Stack

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Discussions: Context Engineering & Context Rot Research

Research Questions

How It Works

Context Strategies Tested

Example Configurations

Running an Experiment

Architecture

Key Components

Performance Benchmarks

Tech Stack

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages