This project investigates the impact of inference-time compute on the logical reasoning capabilities of Small Language Models (SLMs). By implementing a critique-then-correct loop, the framework evaluates whether agentic workflows can mitigate the inherent reasoning limitations of lower-parameter models.
The primary aim was to determine if an iterative self-reflection cycle could improve accuracy on mathematical reasoning tasks without requiring model fine-tuning. The hypothesis was that providing a model with the opportunity to evaluate its own initial reasoning path would lead to the discovery and correction of arithmetic and logical errors.
Figure 1: Comparative analysis of Task Accuracy and Inference Latency for the Qwen 1.5B model. While the success rate remained consistent at 50%, the agentic overhead resulted in a 12.5% increase in computational steps per question.
The framework is built upon a modular architecture designed for local execution and reproducible benchmarking.
The system utilizes Ollama for local model orchestration, specifically targeting the Qwen 2.5 Coder series.
- Primary Testbed: Qwen 2.5 Coder 1.5B (Optimized for high-speed local inference).
- Comparative Baseline: Qwen 3.5 9B (Utilized to analyze reasoning density at scale).
Two distinct inference strategies were implemented to isolate the effect of the reflection loop.
- Basic Agent: Performs standard single-pass inference. This serves as the control group, representing the raw reasoning ceiling of the model under zero-shot conditions.
- Reflexive Agent: Implements a multi-step "Internal Monologue" pattern. Upon generating an initial solution, the agent invokes a secondary reflection call to analyze the logic for fallacies. If an error is detected, the agent utilizes the critique to regenerate the solution.
One of the most significant engineering hurdles was the non-deterministic nature of model outputs. High-capability models often utilize LaTeX formatting or Markdown bolding for final answers. I engineered a hierarchical regex parser capable of extracting values from:
- LaTeX
\boxed{}notation. - Markdown bold headers.
- Terminally located numeric values following specific keywords like "result" or "total."
- Numerical strings while filtering out model version identifiers.
Initial testing revealed a critical failure mode where currency symbols like the dollar sign triggered shell interpolation errors during local execution. The evaluation pipeline was updated to include a pre-processing layer that strips special characters and currency markers, ensuring the model focuses exclusively on the mathematical logic.
Traditional string matching is insufficient for mathematical evaluation. The system implements an epsilon-based float comparison to ensure that equivalent values (e.g., 100.8 vs 100.80) are correctly identified as matches despite varying precision in model output.
Testing on the 1.5B parameter model yielded a 50% accuracy rate for both the Basic and Reflexive agents. This result reveals a "Self-Correction Paradox" in smaller models. While the Reflexive agent successfully corrected complex errors in several instances, it also introduced "Hallucinated Critiques." In these cases, the model incorrectly identified a correct initial reasoning step as an error and subsequently changed a right answer to a wrong one.
The Reflexive agent introduced a 12.5% increase in computational overhead (1.12x average inference steps). The data suggests that at the 1.5B parameter scale, the internal critic is not significantly more capable than the generator, resulting in a performance plateau.
Preliminary testing with the 9B model indicated that self-reflection becomes a more reliable emergent property as parameter counts increase. Larger models possess the world-modeling depth required to effectively critique their own logic without succumbing to the hallucination loops observed in smaller models.
- Ensure Ollama is installed and the Qwen 2.5 Coder 1.5B model is pulled locally.
- Install dependencies:
pip install requests matplotlib numpy.
- Set the Python path to the root directory:
export PYTHONPATH=$PYTHONPATH:. - Execute the evaluation suite:
python3 benchmarks/evaluate.py - Generate research-grade visualizations:
python3 plots/plot_results.py
This project successfully established a functioning benchmark for agentic inference. While the reflexive loop did not yield a net accuracy gain at the 1.5B scale, the experiment provided vital insights into the reasoning-latency trade-off and the necessity of a minimum parameter threshold for reliable autonomous self-correction.
