AI-assisted apartment search for Berlin. Describe what you want in plain language — "I work near Alexanderplatz, budget €1,500, quiet area, good transport, cafes nearby" — and WohnIQ parses it, searches, ranks, explains each match, and flags suspicious listings.
Portfolio project. Listings are synthetic by design — intentionally, so we hold ground-truth scam labels for measurable evals (see ADR-0003). The goal is to demonstrate production-grade AI engineering: reliable structured LLM output, clean language-vs-logic separation, a measurable scam detector, tests, evals, and a deployable system.
Berlin's rental market is scarce, fast-moving, and scam-prone. Traditional portals show you listings but don't help with the decision. WohnIQ's bet: the value isn't fancier filters, it's an AI reasoning layer on top of good data — one that synthesizes commute, neighborhood, budget fit, and risk into an explained recommendation.
Primary flow:
- User types a natural-language query.
- Gemini Flash parses it into a validated
SearchCriteriaobject (budget, rooms, work location, amenities, etc.). - Hard constraints filter the listing DB; pgvector cosine search surfaces semantically relevant candidates.
- A deterministic scoring function ranks results by commute fit, budget fit, neighborhood match, and amenity match.
- Real-time enrichment fetches public-transport commute times (BVG/VBB API) and POI counts (OSM Overpass).
- A hybrid scam engine runs on every listing: price z-score, pHash duplicate detection, and an LLM text pass produce a fused 0–100 risk score with evidence.
- Results render as cards with a natural-language explanation grounded in the score breakdown, a risk badge, and a neighborhood map.
- Users can select 2–4 listings for a side-by-side comparison.
The architecture reflects a single principle: LLMs handle language; code handles logic. Here's how that played out in practice.
The tempting shortcut is to hand the model a list of listings and ask it to rank them. We rejected this: LLMs are inconsistent at ordering, unreliable at arithmetic, and produce opaque reasoning. Instead, ranking is a pure, weighted scoring function in code — same input always produces the same output. The LLM's roles are limited to three single-shot calls: parse a query, read fuzzy scam text, write an explanation from an already-computed breakdown.
This decision is load-bearing for everything else. Because ranking is deterministic, explanations are trustworthy (they're generated from the same numbers that produced the rank). Because the LLM is stateless, we don't need an agent framework. → ADR-0001
A pure LLM classifier ("is this a scam?") would be opaque, inconsistent, and unconvincing in a demo. Instead, the risk engine fuses three signal sources into an explained 0–100 score:
- Deterministic rule signals (€0): price z-score vs. Kiez+size median; metadata rules (off-platform contact, no-Anmeldung clauses, etc.).
- Image signals (local pHash, €0): perceptual-hash duplicate detection across the listing DB catches reused/stolen photo sets.
- LLM text pass (
scam_text.v1): extracts fuzzy language signals (urgency, landlord-abroad, payment pressure) as validated structured output with verbatim evidence quotes — not a verdict.
The fusion step emits contributing signals with their evidence, so the UI can show exactly why a listing was flagged. This "explainable by construction" design is the single best demo screenshot. → ADR-0002
Real listings drag in legal, privacy, and freshness problems irrelevant to a portfolio project — and, critically, don't give us ground-truth scam labels. We generate ~100 synthetic listings with realistic Berlin distributions (Kieze, rents per m², sizes), including a labeled ~15% scam subset spanning four scam types plus hard negatives (cheap-but-legit, landlord-travels-but-legit). Photos are sourced via the Pexels API by room type and hotlinked (zero storage). The seeder deliberately reuses photo sets across scam listings to create realistic duplicate-photo fraud. → ADR-0003
Reliability risk with LLMs is not capability but boundary integrity: free-form text breaks downstream code. Every Gemini call goes through a central client (core/llm.py) that: requests JSON constrained to a Pydantic model, validates the response, re-prompts once on failure, then falls back to a deterministic path (keyword extractor for parsing, rules-only for scam, template for explanation). No unvalidated model text reaches business logic or the user. The client logs prompt id, token counts, latency, and validation result for observability. → ADR-0004
One free service covering relational data, vector similarity search, and storage. pgvector keeps semantic search inside Postgres — simpler, one query path, credible production story. Cold-start risk (free-tier projects pause after ~1 week) is mitigated by a GitHub Actions cron keep-alive. → ADR-0005
Typed Mapped[...] models are the single source of truth for the schema; Alembic handles migrations. But the ORM is not used everywhere — pgvector similarity search and ranking queries stay raw SQL (text()), because that's where the ORM abstraction fights clarity. Knowing where not to use the ORM is part of the decision. → ADR-0006
WohnIQ's LLM usage is three stateless, single-shot calls. LangGraph is an orchestration framework for stateful, cyclic, multi-actor agent workflows — using it here would be over-engineering. We call the Gemini SDK directly, behind a thin injectable client. Provider portability is achieved by an injectable transport boundary in ~10 lines — no framework dependency, readable reliability logic, and easy to swap later. → ADR-0007
| Layer | Tech |
|---|---|
| Frontend | Next.js + TypeScript, Tailwind CSS, shadcn/ui, Leaflet + OSM tiles |
| Backend | FastAPI, Pydantic, Uvicorn, Docker |
| AI | Gemini Flash (gemini-2.0-flash) — parsing, scam text, explanations; Gemini embedding-001 — semantic search |
| Database | Supabase Postgres + pgvector |
| ORM / migrations | SQLAlchemy 2.0, Alembic |
| Scam signals | imagehash (pHash), Pillow |
| Enrichment | BVG/VBB transport.rest (commute), OSM Overpass (POIs), Nominatim (geocoding) |
| Images | Pexels API (hotlinked, attributed in-app) |
| Deploy | Backend → Railway (Docker); Frontend → Vercel |
| Cost | €0 incremental — Gemini free tier, Supabase free, Railway existing, Vercel free |
Evaluated against 20 canonical queries covering budget-only, location-only, vague vibe queries, and adversarial under-specified inputs.
| Metric | Result |
|---|---|
| Cases fully correct | 20 / 20 (100%) |
| Fields correct | 57 / 57 (100%) |
Evaluated against the labeled synthetic set. Precision was the tuning target — avoid crying wolf on cheap-but-legit listings.
| Predicted scam | Predicted legit | |
|---|---|---|
| Actual scam | 12 TP | 3 FN |
| Actual legit | 2 FP | 83 TN |
| Metric | Result |
|---|---|
| Precision | 86% |
| Recall | 80% |
| F1 | 0.83 |
| Accuracy | 95% |
| Hard negatives cleared | 6 / 8 |
Recall by scam type:
| Type | Caught | Recall |
|---|---|---|
| advance_fee | 4 / 4 | 100% |
| overseas_landlord | 4 / 4 | 100% |
| price_bait | 4 / 4 | 100% |
| photo_reuse | 0 / 3 | 0% |
Photo-reuse recall is 0% because pHash matching against hotlinked Pexels URLs requires fetching image bytes, which the test environment does not do. The signal works in production (the seeder fetches bytes at seed time); the eval isolates this as a known gap.
| Path | Purpose |
|---|---|
docs/SPEC.md |
Product spec — what WohnIQ does, features F1–F8, acceptance criteria |
docs/adr/ |
Architecture decision records — why we chose things |
docs/system_prompts.md |
Versioned product LLM prompts (parser.v1, scam_text.v1, explain.v1) |
AGENTS.md |
Operating manual — conventions, golden rules, working loop |
TASKS.md |
Ordered task plan with milestone status |
backend/core/llm.py |
Central LLM client with validation, retry, fallback, and observability |
backend/search/ |
Query parser, retrieval (pgvector), deterministic ranking, explanation |
backend/scam/ |
Rule signals, pHash duplicate detection, LLM text pass, fusion |
backend/enrich/ |
Commute (BVG/VBB) and neighborhood (OSM) enrichment, both cached |
backend/evals/ |
Parser eval harness, ranking sanity eval, scam eval + confusion matrix |
frontend/ |
Next.js app — search box, result cards, risk badge, comparison, Leaflet map |
Backend uses uv. Install it once: curl -LsSf https://astral.sh/uv/install.sh | sh (or brew install uv).
# Backend
cd backend
uv sync # creates .venv + installs deps
cp ../.env.example ../.env # fill in GEMINI_API_KEY, DATABASE_URL
uv run alembic upgrade head # apply schema to Supabase
uv run python -m data.seed_listings
uv run python -m data.seed_images
uv run python -m data.embed_listings
uv run python -m scam.score # pre-compute risk assessments
uv run uvicorn app.main:app --reload
# Frontend
cd frontend
npm install
npm run dev
# Evals
cd backend
uv run python -m evals.parser_eval # or: make eval-parser
uv run python -m evals.ranking_eval # or: make eval-ranking
uv run python -m evals.scam_eval # or: make eval-scamFrom the repo root: make install, make dev, make test, make lint.
Apartment photos sourced via the Pexels API; attribution shown in-app per their terms.



