Plain C inference for PrismML's 1-bit Bonsai Qwen3-based GGUF models. Reads Q1_0_g128 quantized weights and runs the forward pass. No dependencies beyond libc and libm.
Nothing here is new. This is just a from scratch C implementation written to learn how inference works. The architecture is Qwen3, the file format is GGUF, the quantization scheme is Q1_0_g128, RoPE uses YaRN scaling. All of this exists in llama.cpp and is the source of truth. If you want a real engine, use that.
gcc -O3 -o bitc main.c gguf.c model.c inference.c tokenizer.c -lm
Drop a Bonsai-1.7B.gguf into models/ and run:
./bitc
It will prompt for input and stream a reply.
I have plans to learn more about making inferences and figuring out other weird things i can try.
gguf.{c,h}: GGUF parsermodel.{c,h}: Qwen3 weight loadinginference.{c,h}: forward pass, RoPE, attention, SwiGLU, samplingtokenizer.{c,h}: BPE merges + byte-level encodingmain.c: chat loop with the<|im_start|>/<|im_end|>template