Add benchmarks and optimize Gmail analysis + policy batch processing (concurrency, retries, chunking) by Mightyman14386 · Pull Request #9 · Mightyman14386/DataMap

Mightyman14386 · 2026-03-01T15:47:21Z

Improve Gmail analysis throughput and reliability by introducing concurrent fetching, retry/backoff logic, and metadata-only requests to reduce rate-limit pressure.
Make privacy-policy LLM analysis more robust and efficient by normalizing domains, chunking large batches, bounding prompt sizes, and falling back to cached/default values when LLMs are unavailable.
Provide lightweight benchmarking tools to measure parser, pipeline, and LLM-analysis optimization impacts.
Clean up domain/service normalization and logging to reduce duplicate work and improve observability.

Added three benchmark scripts and npm scripts: scripts/benchmark-gmail-parse.mjs, scripts/benchmark-gmail-pipeline.mjs, and scripts/benchmark-llm-analysis-optimization.mjs, and exposed them via package.json (bench:gmail-parse, bench:gmail-pipeline, bench:llm-analysis).
Rewrote the Gmail analyze route to paginate with LIST_PAGE_SIZE, fetch message metadata in parallel with FETCH_CONCURRENCY, use withRetry (exponential backoff + jitter) around Gmail API calls, extract headers with extractRelevantHeaders, and stream progress logging while building parsedEmails for downstream processing.
Improved discovered-services analysis by normalizing domains, trimming service names, deduplicating by domain, fetching policies and breach checks in parallel, and using a DEFAULT_POLICY_ANALYSIS when LLM analysis is unavailable; results are still persisted in batches.
Overhauled batch privacy-policy analysis: normalize policy domains, chunk policies into bounded LLM batches (chunkPoliciesForBatch) with per-item and per-batch char limits, added analyzePolicyChunks to iterate chunks against providers, reduced max_tokens/maxOutputTokens to 1800, and map/merge results by domain (with logging); also made extractJSON more defensive.

No automated test suite was executed as part of this rollout; no existing tests were modified.
The change adds runnable benchmark scripts via npm run bench:gmail-parse, npm run bench:gmail-pipeline, and npm run bench:llm-analysis for local performance validation.

Optimize server LLM analysis batching and dedupe pipeline

2136249

Mightyman14386 added the codex label Mar 1, 2026 — with ChatGPT Codex Connector

Provide feedback