ProductNormaliser is an open product-intelligence engine for turning messy retail and manufacturer page data into clean, canonical, comparable product records. It crawls source pages, extracts structured product evidence, normalises attributes into a category schema, resolves identity across sources, merges competing claims into a canonical product, and keeps learning over time from quality history, disagreement patterns, and page volatility.
Milestone 1 is centered on an end-to-end operator workflow for three rollout categories: tv, monitor, and laptop. The platform still keeps category and normalisation extension points broad enough for future electrical-goods expansion, but the completed milestone scope is the crawl, management, product, and quality experience for those three categories.
Product data on the public web is inconsistent:
- retailers describe the same product differently
- specifications appear in different units and formats
- some sources omit attributes while others contradict them
- offers change faster than technical specifications
- a page that was accurate last week may be stale today
ProductNormaliser addresses that by maintaining both:
- source-level truth: what each source said, when it said it, and how reliable it has been
- canonical truth: the best current merged view of the product with supporting evidence and merge confidence
ProductNormaliser overlaps with commercial product-intelligence platforms in outcome, but not in delivery model or scope.
GfK and NIQ are typically associated with large-scale market measurement, retail intelligence, panel data, and syndicated commercial datasets. ProductNormaliser is not a replacement for that kind of market infrastructure. It is better thought of as a transparent product-record intelligence layer:
- ProductNormaliser focuses on entity resolution, specification normalisation, source trust, change tracking, and canonical product construction.
- GfK and NIQ typically operate higher in the commercial stack with broader market coverage, proprietary data assets, and enterprise reporting products.
- ProductNormaliser gives you explainable product records and evidence trails that can feed your own analytics, catalog, pricing, or monitoring workflows.
CNET-style experiences are usually editorial, shopper-facing, and review-driven. ProductNormaliser does not generate consumer reviews or editorial verdicts. Instead, it provides the structured evidence layer such experiences can consume:
- canonical specs
- source comparisons
- change history
- offer history
- conflict and disagreement visibility
In other words, CNET-style product insight sits closer to presentation and interpretation; ProductNormaliser sits closer to data capture, reconciliation, and explainable product truth.
Euromonitor and Mintel are commonly used for macro market research, category strategy, and consumer or industry trend reporting. ProductNormaliser is much more operational and granular:
- it monitors live source pages rather than producing broad market reports
- it tracks product-level evidence rather than consumer-survey or strategy narratives
- it is designed for ingestion into internal systems, not mainly for analyst reports
The simplest way to position ProductNormaliser is:
an open, explainable product-record intelligence engine rather than a full syndicated research platform
It is most useful where you want to own the data pipeline, inspect merge decisions, adapt category logic, and integrate the output into your own downstream services.
The solution now contains ten projects:
- ProductNormaliser.Domain: domain models, category schema, normalisation contracts, merge logic, and intelligence interfaces
- ProductNormaliser.Application: application-layer seam for use cases, orchestration, and future category-agnostic workflow composition
- ProductNormaliser.Infrastructure: MongoDB persistence, crawl queue, fetch/robots services, delta detection, extraction, trust, stability, and disagreement services
- ProductNormaliser.AdminApi: operational and intelligence read API for queue state, crawl logs, conflicts, product history, and quality analytics
- ProductNormaliser.Worker: background processing host that executes the discovery, crawl, and merge pipeline
- ProductNormaliser.Web: web UI host that will consume backend APIs rather than talking to persistence directly
- ProductNormaliser.Domain.Tests: focused test project for domain-level rules and models
- ProductNormaliser.Application.Tests: current broad integration and orchestration test suite while responsibilities are being split by layer
- ProductNormaliser.AdminApi.Tests: focused test project for API host and controller-facing behavior
- ProductNormaliser.Web.Tests: focused test project for the web UI host
- Operators register or enable managed crawl sources and assign categories such as
tv,monitor, andlaptop. - Each source carries a discovery profile with category entry pages, sitemap hints, allow or deny path rules, URL patterns, depth limits, and per-run budgets.
- A category crawl job now seeds deterministic discovery from eligible managed sources instead of relying only on pre-known targets.
- The discovery worker fetches sitemaps and listing pages while respecting robots rules, source throttling, depth limits, and URL budgets.
- Discovered URLs are classified, persisted, and promoted into crawl targets only when they look like valid product candidates.
- The crawl worker fetches confirmed product pages and extracts structured product evidence.
- A source product is built from extracted data and normalised into the canonical schema.
- Identity resolution decides whether the source product matches an existing canonical product.
- Merge logic computes evidence-weighted attribute winners, while conflicts, change events, trust snapshots, disagreement analytics, and discovery progress are persisted.
- Related-link expansion can feed nearby product and listing links back into discovery after successful product fetches.
- The admin API and Razor Pages console expose the operational and analytical view of source setup, discovery progress, product crawl progress, and catalogue quality.
The solution currently includes:
- category metadata and schema discovery for electrical-goods families
- category registry support for the Milestone 1 rollout set: TVs, Monitors, and Laptops
- schema-driven attribute normalisation with category-specific providers for TVs, Monitors, and Laptops
- alias handling and measurement parsing
- structured data extraction from HTML and JSON-LD
- MongoDB persistence for source and canonical records
- MongoDB persistence for managed crawl sources and per-source throttling policy
- source discovery profiles with category entry pages, sitemap hints, allow or deny rules, URL patterns, depth limits, and run budgets
- MongoDB persistence for discovered URLs and the discovery queue
- deterministic discovery infrastructure for sitemap parsing, listing traversal, product-page confirmation, and discovery link policy evaluation
- identity resolution across sources
- explainable merge weighting and conflict detection
- semantic delta detection for product changes
- worker orchestration with dedicated discovery and crawl workers, retry handling, and related-link expansion after successful product fetches
- admin endpoints for operational observability
- admin endpoints for category catalog management and crawl-source management
- admin endpoints for crawl job launch and tracking
- admin endpoints for product list, product detail, and product history inspection
- discovery-aware job, source, and dashboard views showing queue depth, discovered URL counts, confirmed product counts, failures, and per-category or source coverage
- quality analytics for coverage, unmapped attributes, source quality, and merge insights
- temporal intelligence for source trust history, attribute stability, and product change timelines
- adaptive crawl scheduling based on volatility, stability, freshness, and source behavior
- per-source disagreement tracking that feeds back into trust and merge decisions
- a Razor Pages operator console with source registration, category selection, seeded crawl launch, discovery-progress monitoring, product exploration, product detail explainability, quality dashboards, and source management
The current Milestone 1 flow is intentionally "boot and populate":
- boot the API, web host, and worker against MongoDB
- register or enable sources from the source registry
- choose the active categories in the operator console
- launch a seeded category crawl
- watch discovery queue depth, discovered URL counts, confirmed product targets, crawl failures, and canonical product counts update in the dashboard and crawl-job detail views
- .NET SDK 10.0.x
- MongoDB running locally or a reachable MongoDB instance
One quick local option for MongoDB is Docker:
docker run -d --name productnormaliser-mongo -p 27017:27017 mongo:7From the repository root:
dotnet restore ProductNormaliser.slnx
dotnet build ProductNormaliser.slnx
dotnet test ProductNormaliser.slnxWorker runtime configuration lives in ProductNormaliser.Worker/appsettings.json.
Key settings:
Mongo:ConnectionStringMongo:DatabaseNameCrawl:UserAgentCrawl:DefaultHostDelayMillisecondsCrawl:TransientRetryCountCrawl:WorkerIdleDelayMillisecondsCrawl:HostDelayMilliseconds
Admin API configuration lives in ProductNormaliser.AdminApi/appsettings.json.
The worker is the engine that runs both deterministic discovery and product crawling.
dotnet run --project ProductNormaliser.WorkerThe worker:
- scans eligible managed sources for discovery work
- expands sitemaps and listing pages into bounded discovery queues
- promotes confirmed product URLs into crawl targets
- fetches and extracts source data from product pages
- builds or updates source products
- merges into canonical products
- records crawl logs, conflicts, trust signals, change events, disagreement data, and discovery progress
- reschedules future attempts using adaptive backoff and discovery budgets
The platform now exposes observability at two layers:
- structured lifecycle logs for crawl job creation, start, cancellation, per-target outcome recording, and terminal completion
- runtime telemetry via the
ProductNormaliser.OperationsActivitySourceandMeter - persisted operational summary data via
GET /api/stats - operator-facing health panels on the web landing page for queue pressure, retry backlog, failure volume, at-risk sources, and category hotspots
The Meter emits counters and histograms for:
- crawl jobs created, started, and completed
- crawl job target outcomes by category and status
- queue dequeues, retries, and terminal outcomes
- processed crawl targets and extracted product counts
- crawl target duration and job target counts
The Admin API stats payload now includes:
- queue depth, retry depth, failed-queue depth, and active job count
- throughput and failure counts for the trailing 24 hours
- source-level health metrics derived from quality snapshots, queue state, and recent crawl logs
- category-level crawl pressure metrics derived from jobs, queue state, and recent crawl logs
Verified by automated tests:
- crawl job lifecycle logging is emitted during create, start, and completion flows
- stats aggregation includes the new operational summary from persisted jobs, queue items, crawl logs, sources, and quality snapshots
- the operator landing page renders the operational health panel and updated contract shape
Observed operationally rather than end-to-end tested:
ActivitySourcetraces from the worker and crawl servicesMetercounters and histograms emitted at runtime for external collection- the usefulness of the dashboard health summary under real crawl load patterns
The admin API is a read-side service over the same MongoDB database.
dotnet run --project ProductNormaliser.AdminApiThe included HTTP scratch file suggests a local development base address of http://localhost:5209, although the final URL depends on your local ASP.NET Core launch configuration.
OpenAPI is mapped in development builds.
GET /api/stats: high-level counts and operational summaryGET /api/queue: current queue stateGET /api/queue/priorities: queue items with computed priority signals and next-attempt timingsGET /api/crawl/logs: recent crawl logsGET /api/crawl/logs/{id}: individual crawl log detailGET /api/conflicts: merge conflicts requiring review or analysisGET /api/products/{id}: canonical product detailGET /api/products/{id}/history: product change timeline
GET /api/categories: list known categoriesGET /api/categories/families: list category families for dashboard groupingGET /api/categories/enabled: list enabled crawlable categoriesGET /api/categories/{categoryKey}: get category metadataGET /api/categories/{categoryKey}/schema: get category schemaGET /api/categories/{categoryKey}/detail: get metadata and schema in one payloadGET /api/sources: list managed crawl sourcesGET /api/sources/{sourceId}: get one managed sourcePOST /api/sources: register a managed sourcePUT /api/sources/{sourceId}: update display name, base URL, and descriptionPOST /api/sources/{sourceId}/enable: enable a sourcePOST /api/sources/{sourceId}/disable: disable a sourcePUT /api/sources/{sourceId}/categories: update assigned category keysPUT /api/sources/{sourceId}/throttling: update host throttling policy
GET /api/crawljobs: list crawl jobs with filter and paging supportPOST /api/crawljobs: create a crawl job for categories, sources, or productsGET /api/crawljobs/{jobId}: inspect one crawl job and its progressGET /api/products: list canonical products with quality-aware filtering and sortingGET /api/products/{id}: canonical product detailGET /api/products/{id}/history: product change timeline
GET /api/quality/coverage/detailed: category coverage against the schemaGET /api/quality/unmapped: backlog of unmapped or unknown attributesGET /api/quality/sources: source quality scoresGET /api/quality/merge-insights: merge and evidence quality summaryGET /api/quality/source-history: historical source trust snapshotsGET /api/quality/attribute-stability: per-attribute stability analyticsGET /api/quality/source-disagreements: per-source disagreement metrics
The quality endpoints default to the tv category, which remains the current first-class category schema, but the scoring and completeness model are now category-aware.
- Start MongoDB.
- Run the worker against a configured database.
- Seed crawl queue items through application code, scripts, or tests.
- Run the admin API against the same database.
- Inspect queue state, crawl logs, canonical products, and quality endpoints.
- The current admin API is primarily an internal operational interface, not a public hardened product API.
- The worker and API assume a shared MongoDB database.
- The system is designed to preserve evidence rather than flattening source data into a single opaque record.
- Crawl behavior is adaptive: successful, volatile, stable, or failure-prone pages will naturally drift toward different revisit cadences.
- Trust is temporal: source quality is treated as a changing signal, not a fixed source rank.
- TV remains the deepest category implementation; monitor and laptop support are included in the Milestone 1 rollout but still need broader extraction coverage and richer normalisation rules over time
- queue write flows are currently aimed at internal operator use rather than a public ingestion API
- the admin surface now has API-key authentication and role-based operator access, but production identity, secret rotation, and perimeter hardening are still not fully formalized
- production deployment concerns such as distributed workers, secret management, and externalized observability are not yet formalized in the repo
The separation between Domain, Application, Infrastructure, Worker, and AdminApi is deliberate:
- Domain stays focused on product intelligence rules, contracts, and shared models
- Application owns orchestration and workflow validation
- Infrastructure implements persistence and external-system adapters
- Worker owns the write-side discovery and crawl pipeline
- AdminApi owns the read-side operational and analytical experience
That split keeps the domain logic explainable and testable while still allowing the runtime services to evolve independently.