You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Datrix is designed as a cloud-native, horizontally scalable, multi-tenant platform. Every architectural decision is made against six guiding principles:
Stateless processing: all compute services are stateless and horizontally scalable
Hard tenant isolation: cross-customer data access is architecturally impossible, not just policy
Async by default: all long-running operations are asynchronous with real-time progress
Idempotency: every operation can be safely retried without side effects
Observability first: every service emits structured logs, metrics, and traces from day one
API-first: every platform capability exposed via stable, versioned API before any UI is built
1.2 High-Level System Architecture
Datrix uses a microservices architecture on Kubernetes, with an event-driven backbone (Kafka), object storage for all data (S3-compatible), a metadata store (PostgreSQL + Redis), and a GPU compute pool for ML-intensive operations. All services communicate via gRPC internally and REST/GraphQL externally.
Layer
Technology & Responsibility
Client Layer
Web app (React), Python SDK, REST API, WebSocket for real-time updates
API Gateway
Kong — rate limiting, auth, routing, request logging, SSL termination
Application Services
Microservices (Node.js/Python) — business logic, orchestration, state management
All permissions including billing, user management, org configuration
Admin
All permissions except billing; can manage users and roles
Data Engineer
Manage datasets, pipelines, connectors; no compliance module access
Data Scientist
Quality scans, synthetic generation, active learning; read-only pipelines
Annotator
Annotation queue access only; no data access beyond assigned tasks
Viewer
Read-only access to dashboards, reports, quality scores
API User
Programmatic access matching configured permissions; no UI
2.3 Core API Endpoints
Datasets
POST /v1/datasets Upload or register a dataset
GET /v1/datasets List all datasets with pagination
GET /v1/datasets/{id} Get dataset metadata and quality score
DELETE /v1/datasets/{id} Delete dataset and all derived data
POST /v1/datasets/{id}/scan Trigger quality scan (async)
GET /v1/datasets/{id}/scan/{scanId} Get scan status and results
GET /v1/datasets/{id}/columns Get per-column quality profiles
GET /v1/datasets/{id}/versions List all versions of this dataset
Quality
GET /v1/quality/{datasetId}/score Get current quality score with breakdown
GET /v1/quality/{datasetId}/issues List all issues sorted by impact
POST /v1/quality/{datasetId}/fix Apply automated fixes (specify issue IDs)
POST /v1/quality/{datasetId}/fix/preview Preview fixes before applying
DELETE /v1/quality/{datasetId}/fix/{fixId} Rollback a specific fix
GET /v1/quality/{datasetId}/history Quality score history over time
Pipelines
POST /v1/pipelines Create pipeline (NLP or visual definition)
GET /v1/pipelines/{id} Get pipeline definition
PUT /v1/pipelines/{id} Update pipeline definition
POST /v1/pipelines/{id}/run Execute pipeline (async)
POST /v1/pipelines/{id}/dry-run Dry run on 1K row sample (async)
GET /v1/pipelines/{id}/runs List execution history
GET /v1/runs/{runId}/status Get run status and progress
GET /v1/runs/{runId}/output Get run output metadata and download URL
Synthetic Data
POST /v1/synthetic/analyze Analyze gaps in dataset (async)
GET /v1/synthetic/analyze/{jobId} Get gap analysis results
POST /v1/synthetic/generate Start generation job (async)
GET /v1/synthetic/jobs/{jobId} Get generation job status
GET /v1/synthetic/jobs/{jobId}/results Get generated dataset
POST /v1/synthetic/validate Validate a synthetic dataset
POST /v1/synthetic/blend Blend real and synthetic datasets
Active Learning
POST /v1/active-learning/register Register a model for uncertainty tracking
POST /v1/active-learning/score Submit predictions to get uncertainty scores
POST /v1/active-learning/select Select examples from unlabeled pool
GET /v1/active-learning/queue Get annotation queue for current user
POST /v1/active-learning/label Submit labels for selected examples
POST /v1/active-learning/retrain Trigger model retraining with new labels
Compliance
GET /v1/compliance/dashboard Get compliance posture summary
GET /v1/compliance/regulations List applicable regulations for this org
GET /v1/compliance/{regulation}/controls Get control status for a regulation
POST /v1/compliance/reports/{regulation} Generate compliance report
GET /v1/compliance/audit-trail Query audit trail with filters
POST /v1/compliance/incidents Report a compliance incident
2.4 API Standards
Pagination: cursor-based for all list endpoints; page size default 20, max 100
Filtering: all list endpoints support field-based filtering via query params
Sorting: all list endpoints support sort field and direction
CREATETABLEpipelines (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULLREFERENCES organisations(id) ON DELETE CASCADE,
name TEXTNOT NULL,
description TEXT,
definition JSONB NOT NULL, -- DAG: nodes, edges, config per node
input_dataset UUID REFERENCES datasets(id),
task_type TEXT,
output_format TEXT,
version INTNOT NULL DEFAULT 1,
template_id UUID,
is_template BOOLEANNOT NULL DEFAULT FALSE,
created_by UUID NOT NULLREFERENCES users(id),
created_at TIMESTAMPTZNOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZNOT NULL DEFAULT NOW()
);
3.3 Key Indexes
-- Quality scans: fast lookup by dataset, ordered by recencyCREATEINDEXidx_quality_scans_datasetON quality_scans(dataset_id, created_at DESC);
-- Issues: fast filtering by severity and statusCREATEINDEXidx_issues_scan_severityON quality_issues(scan_id, severity, status);
CREATEINDEXidx_issues_impactON quality_issues(dataset_id, impact_score DESC);
-- Datasets: org-scoped queries with status filterCREATEINDEXidx_datasets_orgON datasets(org_id, status, created_at DESC);
-- Pipeline runs: status monitoringCREATEINDEXidx_runs_statusON pipeline_runs(org_id, status, created_at DESC);
4. Quality Engine — Technical Design
4.1 Architecture
The quality engine runs as a Spark job for datasets >100MB and as a Polars job for smaller datasets. The core algorithm is embarrassingly parallel — each quality dimension computed independently across columns, then aggregated.
Performance Target: 1M row, 50-column dataset: full quality scan in < 5 minutes on standard worker pool (8 CPU cores, 32GB RAM per worker, 4 workers).
defclassify_missingness(df, col) ->str:
# MCAR test: compare null rate across random splits# MAR test: check if nulls correlate with other column values# MNAR test: model the column, check if residuals predict nullnessnull_mask=df[col].is_null()
ifis_independent_of_all_columns(df, null_mask): return'MCAR'ifcorrelates_with_other_columns(df, null_mask): return'MAR'return'MNAR'