Skip to content
View Aawegg's full-sized avatar

Block or report Aawegg

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Aawegg/README.md

Aaweg Bhaladhare

I build evaluation harnesses and agent infrastructure. Pune, India.

What I'm working on

priorityjudge — a multi-pass agent that grades plans and PRDs on priority-definition quality. Extractor → 5 independent dimension scorers → deterministic citation verifier → synthesizer. Built for the Cursor-based APO recruiting quest.

9887 / 10000 on a hand-rated calibration set. Spearman ρ = 1.0 against the human ranking. 96.5% citation precision (the verifier caught the 3.5% the LLM hallucinated). Test-retest CV = 0.4%.

The win comes from architecture, not the model. A single-shot Gemini baseline with the same rubric and same calibration set is the head-to-head ablation, run by benchmarks/run_baseline.py.

Selected work

  • priorityjudge — APO-native plan scorer with verified citations. Python, Gemini, OpenClaw skill, GitHub Actions CI, Docker.
  • nibblecore — 4-bit quantization kernels for Apple Silicon, benchmarked against llama.cpp. C++ / SIMD / Metal.
  • Story-Character-Extractor — RAG pipeline that pulls structured character profiles from short stories. Embeddings + vector DB + LLM extraction.
  • Melody-Generation-using-LSTM — LSTM trained on monophonic MIDI to generate new melodies in the same idiom.

Background

Turing AI evaluation team-lead — designed grading rubrics, spotted reward-hacking, and rewrote prompts that bumped agreement from "throw darts" to "actually scores reasoning". I care most about the boring half of agent work: making outputs measurable, falsifiable, and hard to fake.

Hardware-friendly software is the other half. nibblecore is what happens when "make inference cheap on the laptop you already own" turns into a side quest.

Stack I reach for

Python (pydantic, pytest, click) · Gemini / Anthropic / OpenAI APIs · OpenClaw multi-agent orchestration · Cursor with custom rules · C++17 / SIMD / Metal · Next.js for the rare time a UI is the right answer.

Reach me

LinkedIn·

Pinned Loading

  1. codeforces-notebook-generator codeforces-notebook-generator Public

    Python

  2. Melody-Generation-using-LSTM Melody-Generation-using-LSTM Public

    Implemented Long Short-Term Memory (LSTM) networks for melody generation

    Python

  3. nibblecore nibblecore Public

    4-bit quantization kernels for Apple Silicon, benchmarked against llama.cpp

    C++

  4. priorityjudge priorityjudge Public

    Multi-pass agent that grades plans/PRDs on priority-definition quality. Line-cited evidence, deterministic citation verifier, 9887/10000 on a hand-rated calibration set.

    Python

  5. Review-Portal Review-Portal Public

    Python

  6. Story-Character-Extractor Story-Character-Extractor Public

    RAG pipeline that extracts structured character profiles, relationships, and roles from short stories using embeddings, a vector DB, and an LLM.

    Python