Platform for collection, persistence, and statistical analysis of how LLMs cite Brazilian companies across 4 economic sectors.
Longitudinal study (target: 90+ days, ~25,920 observations) focused on citation patterns, visibility, and source attribution by generative search engines (Generative Engine Optimization — GEO).
Following Paper 4 ("Three Ways to Fail to Conclude", doi.org/10.5281/zenodo.19712217), which documented a triple methodological failure of v1 (H1 RAG underpower, H2 fictitious-probe design-null, H3 asymmetric instrumentation), the codebase was rebooted across 5 waves. The v2 infrastructure closes each failure mode with hardened algorithms, a balanced 128-entity cohort, a 192-query balanced battery, and a pre-registered decision rule. 78/78 tests passing.
Canonical pillars:
- NER v2 — NFC+NFKD dual-pass, word-boundary matching, alias and stop-context tables. Dry-run on 2,000 rows: -45% false positives.
src/analysis/entity_extraction.py(24 tests) - Cluster-robust CR1 — sandwich estimator with cross-group covariance.
src/analysis/cluster_robust.py(6 tests) - Null simulation (Monte Carlo) — empirical null distribution replaces the arbitrary Jaccard 0.30 threshold.
src/analysis/null_simulation.py(8 tests) - Power analysis — Rule-of-3 inverse, Cohen's h, design effect,
reboot_roadmap().src/analysis/power_analysis.py(10 tests) - Mixed-effects GLMM —
BinomialBayesMixedGLMwith random intercepts per entity and per query.src/analysis/mixed_effects.py - Cohort v2 — 80 Brazilian entities + 32 international anchors + 16 fictional decoys (128 total).
src/config_v2.py(16 tests) - Query battery v2 — 192 balanced queries across verticals, framings, and directives.
- Hypothesis engine — BH-FDR correction with a pre-registered decision rule.
src/analysis/hypothesis_engine.py(14 tests) - Forward-only migrations —
0005NER v2 columns,0006SHA-256 response hashes,0007fictitious-probe flag. - Reproducibility —
Dockerfile,requirements-lock.txt,scripts/reproduce.sh. - Pipeline —
collect validate-run --since-minutes Nstandalone, fail-loud per mandatory LLM,routed_outvsapi_failuredistinction, updateddaily-collect.yml. - Canonical docs —
docs/METHODOLOGY_V2.md(source of truth) andCHANGELOG.md.
| Dimension | Value |
|---|---|
| Verticals | 4 (Fintech, Retail, Healthcare, Technology) |
| Entities | 69 (61 real + 8 fictional for calibration) |
| LLM Models | 4 (GPT-4o-mini, Claude Haiku 4.5, Gemini 2.5 Flash, Perplexity Sonar) |
| Queries per vertical | 12 specific + 6 cross-vertical = 18 |
| Daily observations | ~288 (18 queries x 4 models x 4 verticals) |
| Observations collected | 653 citations, 172 contexts, 11 runs |
| Code | 7,010 lines Python, 35 files, 91 commits |
| Schema | 21 tables (citations, contexts, finops, interventions, snapshots, model_versions) |
| Collection | Automated daily (GitHub Actions, 06:00 UTC) |
| Persistence | SQLite WAL (canonical ledger) + Supabase (read projection) |
| Publication target | 3 papers (ArXiv, SIGIR/WWW, Information Sciences Q1) |
Real (14): Nubank, PagBank, Cielo, Stone, Banco Inter, Mercado Pago, Itau, Bradesco, C6 Bank, PicPay, Neon, Safra, BTG Pactual, XP Investimentos Cross-market (5): Revolut, Monzo, N26, Chime, Wise Fictional (2): Banco Floresta Digital, FinPay Solutions
Real (14): Magazine Luiza, Casas Bahia, Americanas, Amazon Brasil, Mercado Livre, Shopee Brasil, Renner, Riachuelo, C&A Brasil, Leroy Merlin, Centauro, Netshoes, Via Varejo, Grupo Pao de Acucar Fictional (2): MegaStore Brasil, ShopNova Digital
Real (14): Dasa, Hapvida, Unimed, Fleury, Rede D'Or, Einstein, Sirio-Libanes, Raia Drogasil, Eurofarma, Ache, EMS, Hypera Pharma, NotreDame Intermedica, SulAmerica Saude Fictional (2): HealthTech Brasil, Clinica Horizonte Digital
Real (14): Totvs, Stefanini, Tivit, CI&T, Locaweb, Linx, Movile, iFood, Vtex, RD Station, Conta Azul, Involves, Accenture Brasil, IBM Brasil Fictional (2): TechNova Solutions, DataBridge Brasil
| Criterion | Target | Current |
|---|---|---|
| Total observations | >= 25,920 (288/day x 90 days) | 397 (1.5%) |
| N per LLM | >= 1,000 | 30-136 |
| Collection days | >= 90 continuous | 2 |
| Pre-registered hypotheses | >= 3 | 0 |
| A/B experiments | >= 2 | 0 |
| Fictional entity validation | 8 (false positive rate) | 0 queries |
- Effective N < Gross N: 54% of observations are cache hits (identical responses reused). N_eff ~181
- Sample imbalance: Gemini Flash has N=3 in 3 of 4 verticals (API failures in early rounds)
- Directive queries: Categories like "fintech_trust" produce 100% citation rate by design — do not represent spontaneous citation
- Non-stationarity: LLMs update models without notice.
model_versionstable exists but is not being populated - Non-independent observations: Similar queries to the same model in the same session share internal state
Source of truth:
docs/METHODOLOGY_V2.md(v2.0.0-reboot, 2026-04-23). The legacydocs/METHODOLOGY.mdis preserved as historical reference for v1 and Paper 4's failure analysis.
| Test | Use | Implementation |
|---|---|---|
| Chi-squared | Association between query category and citation | scipy.stats.chi2_contingency + Cramer's V |
| Kruskal-Wallis | Comparison of rates across 4+ LLM models (non-parametric) | scipy.stats.kruskal + eta-squared |
| ANOVA one-way | Group comparison (when Levene p > 0.05) | scipy.stats.f_oneway + eta-squared |
| Mann-Whitney U | Citation position (ordinal, non-normal) | scipy.stats.mannwhitneyu + rank-biserial r |
| T-test (ind/paired) | Mean comparison pre/post intervention | scipy.stats.ttest_ind/rel + Cohen's d |
| Logistic regression | Citation predictors (schema, word count, etc.) | statsmodels.Logit + pseudo R-squared, AIC, BIC, odds ratios |
| Correlation | Spearman (default) / Pearson | scipy.stats.spearmanr/pearsonr |
| Method | Application |
|---|---|
| Bonferroni | Family-wise comparisons (across verticals) |
| Benjamini-Hochberg FDR | Per-entity tests (controls false discovery rate) |
| Metric | Associated Test | Classification |
|---|---|---|
| Cohen's d | t-test | 0.2 small, 0.5 medium, 0.8 large |
| Cramer's V | chi-squared | sqrt(chi2 / (n * (min_dim-1))) |
| Eta-squared | ANOVA/KW | SS_between / SS_total |
| Rank-biserial r | Mann-Whitney | 1 - (2U)/(n1*n2) |
| Pseudo R-squared | Logistic | McFadden |
Each detected citation undergoes analysis of:
| Field | Method |
|---|---|
| Sentiment | Regex against 16 positive + 12 negative signals (PT-BR + EN), 200-char window |
| Attribution | Hierarchy: linked (URL present) > named (entity in text) > paraphrased |
| Factual accuracy | Verification against canonical facts (founding year, CEO, HQ) for 5 key entities |
| Hedging | 16 regular patterns ("according to", "reportedly", "possivelmente") |
| Position | Tertile: 1 (first third), 2 (middle), 3 (last third) of response |
| # | Title | Venue | Status | Main Methodology |
|---|---|---|---|---|
| 1 | How LLMs Cite Entities Across Industry Verticals | ArXiv | planned | Multi-vertical tracking, ANOVA/KW across models, time series |
| 2 | GEO vs SEO: Source Divergence | SIGIR/WWW | planned | Weekly Jaccard index (top-10 Google vs LLM sources), 12+ weeks |
| 3 | Industry-Specific Patterns in AI Citation | Information Sciences (Q1) | planned | Fisher exact test, odds ratios, 95% CI, 2 A/B experiments |
| 4 | Three Ways to Fail to Conclude: A Null-Triad Post-Mortem | SSRN / arXiv / SIGIR 2027 | submitted (10.5281/zenodo.19712217) | Null-triad decomposition: H1 underpower, H2 design-null, H3 instrumentation asymmetry |
| 5 | (in preparation) | Elsevier (target) | in preparation — v2 infrastructure operational, 90-day collection window pending OSF preregistration v2 | Balanced 128-entity cohort, 192-query battery, cluster-robust CR1, Monte Carlo null, GLMM, BH-FDR |
src/
config.py # Central configuration, cohorts per vertical, LLM configs
cli.py # Main CLI (click)
collectors/
base.py # Multi-provider LLM client + cache + FinOps tracking
citation_tracker.py # Module 1: Citation Tracker (4 LLMs x 4 verticals)
competitor.py # Module 2: Multi-Vertical Benchmark
serp_overlap.py # Module 3: SERP vs AI Overlap
intervention.py # Module 4: A/B Testing
context_analyzer.py # Module 7: Sentiment, attribution, accuracy
db/
schema.sql # Complete schema (21 tables)
client.py # SQLite/Supabase persistence
persistence/
timeseries.py # Module 5: Daily snapshots
analysis/
statistical.py # Module 6: 7 tests + corrections + effect sizes
visualization.py # Charts with 95% CI (matplotlib/seaborn)
finops/
tracker.py # Cost per token, 4 providers, budget control
monitor.py # Dashboard, alerts, security audit
secrets.py # Key rotation, leak scanning, health checks
api/
main.py # FastAPI (endpoints per vertical)
.github/workflows/
daily-collect.yml # Daily collection 06:00 UTC
weekly-benchmark.yml # Weekly benchmark (Sunday 08:00 UTC)
| Provider | Model | Cost/MTok (in/out) | Monthly Budget |
|---|---|---|---|
| OpenAI | gpt-4o-mini-2024-07-18 | $0.15 / $0.60 | $10 |
| Anthropic | claude-haiku-4-5 | $0.80 / $4.00 | $10 |
| gemini-2.5-flash | $0.15 / $0.60 | $5 | |
| Perplexity | sonar | $1.00 / $1.00 | $10 |
| Global | $70 (hard stop 95%) |
pip install -e ".[dev]"
cp .env.example .env # Configure API keys
python -m src.cli db migrate
python -m src.cli collect all# Collection
python -m src.cli collect all # All verticals
python -m src.cli collect all --vertical fintech # Fintech only
python -m src.cli collect citation # Citation Tracker only
# Analysis
python -m src.cli analyze --report # Full report
python -m src.cli analyze --report --vertical saude # Healthcare only
python -m src.cli analyze --visualize # Charts per vertical
# Database
python -m src.cli db migrate # Apply schema
python -m src.cli db export --format csv # Export data
python -m src.cli db health # Health per vertical
# Migrations (one-time, idempotent)
python -m src.db.migrate_normalize_models # Normalize GPT model strings
python -m src.db.migrate_cited_entity # Backfill cited_entity
python -m src.db.migrate_0003_eficacia_consistencia # query_type, composite indexes, backfills
# Consolidated export (replaces data/extract_*.py — Onda 4 refactor 2026-04-19)
python scripts/export_data.py --format text
python scripts/export_data.py --format json --output data/dashboard.json
python scripts/export_data.py --format csv --vertical fintech
python scripts/export_data.py --format htmlquery_type(directivevsexploratory) isolates framing bias in Paper 1 ANOVA- Fictional entities (
FICTIONAL_ENTITIESinsrc/config.py) calibrate false-positive rate — 8 entities, activatable via envINCLUDE_FICTIONAL_ENTITIES=true - Mandatory LLMs (env
MANDATORY_LLMS) enforce balanced cohort — pipeline fails loud if any mandatory provider drops - 5 composite indexes on
citations(vertical, cited),(vertical, llm),(timestamp, vertical),(llm, model_version),(query_type)prevent table scans at N > 10K - Backfills applied on 940 legacy rows (model_version NULL → model)
See docs/ARCHITECTURE.md for the full flow and docs/audits/2026-03-26/ for historical audit context.
The v2.0.0-reboot ships a fully pinned, container-reproducible environment:
# One-shot container build + test suite + sample analysis
./scripts/reproduce.sh
# Or, manual Docker workflow
docker build -t papers-v2 .
docker run --rm -v "$PWD":/workspace papers-v2 pytest -q| Artifact | Purpose |
|---|---|
Dockerfile |
Python 3.11 image with system deps and pinned requirements |
requirements-lock.txt |
Fully pinned dependency set (hashes) used by CI and Docker |
scripts/reproduce.sh |
End-to-end: build, migrate, run 78-test suite, emit analysis sample |
CHANGELOG.md |
Version history from v1 through v2.0.0-reboot |
| Document | Description |
|---|---|
| docs/METHODOLOGY_V2.md | Current statistical methodology (v2.0.0-reboot) — source of truth |
| CHANGELOG.md | Top-level version history (v1 through v2.0.0-reboot) |
| docs/ARCHITECTURE.md | Full pipeline flow, schema layers (core vs future), operational commands |
| docs/METHODOLOGY.md | Historical methodology (v1, pre-reboot) — kept for Paper 4 context |
| docs/REQUIREMENTS.md | Formal specification (functional/non-functional) |
| docs/GOVERNANCE.md | Spending policies, ADRs, roadmap |
| docs/MANUAL.md | Operational manual |
| docs/CHANGELOG.md | Legacy per-docs change log |
| docs/audits/2026-03-26/ | Archived statistical audit (N=397 snapshot) |
| output/critica_estatistica_panel.md | Critical review by panel of 7 specialists |
MIT
Author: Alexandre Caramaschi — CEO of Brasil GEO, former CMO at Semantix (Nasdaq), co-founder of AI Brasil.
| Property | Stack | Status |
|---|---|---|
| alexandrecaramaschi.com | Next.js 16 + React 19 + Supabase | Production — 35 courses, 25 insights, 122K+ lines |
| brasilgeo.ai | Cloudflare Workers | Production — 14 articles |
| geo-orchestrator | Python + 5 LLMs | Active — multi-LLM pipeline |
| curso-factory | Python + Jinja2 | Active — course generation pipeline |
| geo-checklist | Markdown | Open-source — GEO audit checklist |
| llms-txt-templates | Markdown + JSON | Open-source — llms.txt standard |
| geo-taxonomy | JSON + CSV + Markdown | Open-source — 60+ GEO terms |
| entity-consistency-playbook | Markdown | Open-source — entity consistency |
| papers | Python + Supabase | Research — LLM citation study |