Agentic Inference: Iterative Self-Reflection in Small Language Models

This project investigates the impact of inference-time compute on the logical reasoning capabilities of Small Language Models (SLMs). By implementing a critique-then-correct loop, the framework evaluates whether agentic workflows can mitigate the inherent reasoning limitations of lower-parameter models.

Core Objective

The primary aim was to determine if an iterative self-reflection cycle could improve accuracy on mathematical reasoning tasks without requiring model fine-tuning. The hypothesis was that providing a model with the opportunity to evaluate its own initial reasoning path would lead to the discovery and correction of arithmetic and logical errors.

Evaluation Results

Figure 1: Comparative analysis of Task Accuracy and Inference Latency for the Qwen 1.5B model. While the success rate remained consistent at 50%, the agentic overhead resulted in a 12.5% increase in computational steps per question.

System Architecture

The framework is built upon a modular architecture designed for local execution and reproducible benchmarking.

Model Integration

The system utilizes Ollama for local model orchestration, specifically targeting the Qwen 2.5 Coder series.

Primary Testbed: Qwen 2.5 Coder 1.5B (Optimized for high-speed local inference).
Comparative Baseline: Qwen 3.5 9B (Utilized to analyze reasoning density at scale).

Agentic Design Patterns

Two distinct inference strategies were implemented to isolate the effect of the reflection loop.

Basic Agent: Performs standard single-pass inference. This serves as the control group, representing the raw reasoning ceiling of the model under zero-shot conditions.
Reflexive Agent: Implements a multi-step "Internal Monologue" pattern. Upon generating an initial solution, the agent invokes a secondary reflection call to analyze the logic for fallacies. If an error is detected, the agent utilizes the critique to regenerate the solution.

Technical Implementation and Challenges

Robust Output Parsing

One of the most significant engineering hurdles was the non-deterministic nature of model outputs. High-capability models often utilize LaTeX formatting or Markdown bolding for final answers. I engineered a hierarchical regex parser capable of extracting values from:

LaTeX \boxed{} notation.
Markdown bold headers.
Terminally located numeric values following specific keywords like "result" or "total."
Numerical strings while filtering out model version identifiers.

Data Sanitization

Initial testing revealed a critical failure mode where currency symbols like the dollar sign triggered shell interpolation errors during local execution. The evaluation pipeline was updated to include a pre-processing layer that strips special characters and currency markers, ensuring the model focuses exclusively on the mathematical logic.

Floating-Point Validation

Traditional string matching is insufficient for mathematical evaluation. The system implements an epsilon-based float comparison to ensure that equivalent values (e.g., 100.8 vs 100.80) are correctly identified as matches despite varying precision in model output.

Research Findings and Observations

The Self-Correction Paradox

Testing on the 1.5B parameter model yielded a 50% accuracy rate for both the Basic and Reflexive agents. This result reveals a "Self-Correction Paradox" in smaller models. While the Reflexive agent successfully corrected complex errors in several instances, it also introduced "Hallucinated Critiques." In these cases, the model incorrectly identified a correct initial reasoning step as an error and subsequently changed a right answer to a wrong one.

Inference Latency vs. Accuracy

The Reflexive agent introduced a 12.5% increase in computational overhead (1.12x average inference steps). The data suggests that at the 1.5B parameter scale, the internal critic is not significantly more capable than the generator, resulting in a performance plateau.

Emergent Capabilities at Scale

Preliminary testing with the 9B model indicated that self-reflection becomes a more reliable emergent property as parameter counts increase. Larger models possess the world-modeling depth required to effectively critique their own logic without succumbing to the hallucination loops observed in smaller models.

Reproduction Steps

Environment Setup

Ensure Ollama is installed and the Qwen 2.5 Coder 1.5B model is pulled locally.
Install dependencies: pip install requests matplotlib numpy.

Execution

Set the Python path to the root directory: export PYTHONPATH=$PYTHONPATH:.
Execute the evaluation suite: python3 benchmarks/evaluate.py
Generate research-grade visualizations: python3 plots/plot_results.py

Conclusion

This project successfully established a functioning benchmark for agentic inference. While the reflexive loop did not yield a net accuracy gain at the 1.5B scale, the experiment provided vital insights into the reasoning-latency trade-off and the necessity of a minimum parameter threshold for reliable autonomous self-correction.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
benchmarks		benchmarks
models		models
plots		plots
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Inference: Iterative Self-Reflection in Small Language Models

Core Objective

Evaluation Results

System Architecture

Model Integration

Agentic Design Patterns

Technical Implementation and Challenges

Robust Output Parsing

Data Sanitization

Floating-Point Validation

Research Findings and Observations

The Self-Correction Paradox

Inference Latency vs. Accuracy

Emergent Capabilities at Scale

Reproduction Steps

Environment Setup

Execution

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Inference: Iterative Self-Reflection in Small Language Models

Core Objective

Evaluation Results

System Architecture

Model Integration

Agentic Design Patterns

Technical Implementation and Challenges

Robust Output Parsing

Data Sanitization

Floating-Point Validation

Research Findings and Observations

The Self-Correction Paradox

Inference Latency vs. Accuracy

Emergent Capabilities at Scale

Reproduction Steps

Environment Setup

Execution

Conclusion

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages