πΊπΈ English | π§π· PortuguΓͺs
Submission for Rinha de Backend 2026: a fraud detection API using approximate nearest-neighbour vector search over 3 million labelled transactions.
This project was intentionally built in Python, even knowing that the competition is heavily dominated by lower-level languages such as Rust, Go and C, which naturally have advantages in raw throughput and latency.
The goal here is not necessarily to outperform those ecosystems, but to explore how far Python can be pushed when every millisecond matters.
The project combines low-overhead architectural decisions, Rust-powered dependencies, memory-conscious optimizations, vector-search compression techniques and careful hot-path tuning to make Python as performant as possible within the competition constraints.
At first, I tried to use hnswlib with an HNSW index.
The problem was discovered only after multiple local stress tests and full Docker orchestration runs: the memory footprint was fundamentally incompatible with the competition limits.
The competition rules require:
- at least 2 API instances
- 1 load balancer
- maximum total budget of:
- 1 CPU
- 350 MB RAM
The original HNSW approach became impossible because HNSW stores the entire graph structure in memory for every vector.
With:
- 3 million vectors
- 14 dimensions
- HNSW
M=8
the index consumed approximately:
~721 MB RAM
No amount of parameter tuning could make two API containers coexist inside the allowed memory budget.
The bottleneck was not Python itself.
It was the memory model of the vector index.
Faiss (Facebook AI Similarity Search) is a high-performance vector similarity search library developed by Meta AI and widely used in recommendation systems, semantic search, embeddings retrieval and large-scale nearest-neighbour workloads.
For this project, the chosen index type was:
faiss.IndexIVFScalarQuantizer(...)which combines two complementary techniques:
- IVF (Inverted File Index)
- SQ8 (8-bit Scalar Quantization)
The reason for this choice was memory efficiency.
While HNSW provides excellent recall and latency, its graph-based structure becomes extremely memory-intensive at large scale because the full graph connectivity must remain resident in RAM.
Faiss IVF+SQ8 trades a small amount of recall for a massive reduction in memory usage while still keeping sub-millisecond query times.
This made it possible to fit 3 million vectors inside the competition's strict container memory limits.
The architecture was redesigned around two complementary Faiss techniques:
The dataset is partitioned into 1000 clusters during build time.
At query time, only 10 clusters are searched (nprobe=10), reducing the search space to roughly 1% of the full dataset.
Instead of:
O(N)
queries become approximately:
O(N / nlist * nprobe)
This dramatically reduces search cost while maintaining good recall.
Each float32 dimension is compressed:
float32 (4 bytes) β uint8 (1 byte)
This provides:
- ~4Γ memory reduction
- minimal accuracy degradation
- measured ~97.4% recall compared to IVFFlat
Because the vectors are already normalized into [0,1], scalar quantization works extremely well for this dataset.
The migration from HNSW to Faiss IVF+SQ8 completely changed the feasibility of the architecture.
| Component | RAM |
|---|---|
| Faiss IVF+SQ8 index | ~64 MB |
| labels.npy | ~3 MB |
| Python + granian | ~50 MB |
| Total per container | ~117 MB |
Compared to the original HNSW approach:
| Index type | Approx RAM |
|---|---|
| hnswlib HNSW | ~721 MB |
| Faiss IVF+SQ8 | ~64 MB |
This reduction made the competition constraints achievable.
Client
β POST /fraud-score (port 9999)
βΌ
ββββββββββββββββββββββββββββββββββββ
β nginx 1.27-alpine β
β Round-robin load balancer β
β 0.20 CPU Β· 30 MB β
ββββββββββββββ¬ββββββββββββββββββββββ
β round-robin
βββββββββ΄ββββββββ
βΌ βΌ
βββββββββββ βββββββββββ
β api1 β β api2 β
β granian β β granian β
β 1 workerβ β 1 workerβ
β 0.50CPU β β 0.40CPU β
β 280 MB β β 160 MB β
βββββββββββ βββββββββββ
Faiss IVF+SQ8 index pre-built during docker build
Total declared resources: 1.00 CPU Β· 350 MB RAM
Fully compliant with the competition rules.
After solving the memory footprint problem, another issue appeared:
api1 started correctly, but api2 frequently crashed while loading the same index simultaneously.
The root cause was memory pressure during concurrent index loading.
Even though the steady-state RAM usage fit comfortably inside the limit, Docker briefly experienced a memory spike while both containers loaded their own copy of the index at the same time.
The operating system still had the first container's index pages warm in memory while the second container started reading the same file.
The solution was intentionally simple and deterministic:
api2 waits before starting.
command:
[
"sh",
"-c",
"sleep 15 && python -m granian ..."
]This allows:
- api1 to fully initialize
- the OS page cache to stabilize
- I/O pressure to drop
- memory spikes to disappear
After introducing startup sequencing, both containers became stable simultaneously under the competition limits.
FastAPI adds dependency injection and validation layers that cost measurable latency per request.
Using raw Starlette removes that overhead while preserving ASGI ergonomics.
orjson is Rust-based and significantly faster than Python's standard JSON implementation for both serialization and deserialization.
The competition guarantees valid payloads.
Avoiding schema validation removes unnecessary allocations and CPU overhead.
All fields are accessed directly from raw dictionaries.
Each worker allocates a single:
(1, 14) float32buffer and reuses it for every request.
Benefits:
- zero per-request NumPy allocations
- cache-friendly contiguous memory
- exact layout expected by Faiss
granian uses a Rust runtime and consistently performs better in throughput and p99 latency than pure-Python ASGI servers.
The Faiss index is built during Docker build time.
Containers only perform:
faiss.read_index(...)during startup.
Benefits:
- no runtime training cost
- deterministic startup
- avoids massive transient RAM spikes
The runtime image contains:
- no compiler
- no build-essential
- no pip install step
Only pre-built Python packages and the generated Faiss index are copied from the builder stage.
This reduces:
- image size
- startup complexity
- runtime dependencies
The load balancer is configured as a pure pass-through proxy with:
- upstream keepalive
tcp_nodelay- disabled buffering
- disabled access logs
The rules explicitly forbid business logic in the load balancer.
| Component | Choice | Why |
|---|---|---|
| ASGI server | granian 2.7.4 | Rust runtime, lower p99 latency |
| Web framework | Starlette 0.41.3 | Minimal overhead |
| JSON | orjson 3.11.9 | Rust-based JSON |
| Vector search | Faiss IVF+SQ8 | Massive RAM reduction |
| Numerics | numpy 2.4.4 | Vectorized math |
| Load balancer | nginx 1.27-alpine | Lightweight reverse proxy |
| Containerization | Docker Compose | Competition orchestration |
.
βββ app/
β βββ main.py
β βββ requirements.txt
βββ scripts/
β βββ build_index.py
βββ resources/
β βββ references.json.gz
βββ Dockerfile
βββ docker-compose.yml
βββ nginx.conf
βββ test_vectorize.py
βββ README.md
βββ README.pt-BR.md
βββ .gitignore
| Method | Path | Description |
|---|---|---|
| GET | /ready |
Health check |
| POST | /fraud-score |
Returns fraud decision and score |