Skip to content

Add MS-SQL (SQL Server) support #109

Description

@myyong

This issue tracks the work done on the mssql branch to make Datafaker work end-to-end against Microsoft SQL Server databases.

Background

Datafaker was originally written targeting PostgreSQL and DuckDB. MS-SQL differs in several ways that required systematic fixes across the codebase: different SQL dialect (no RANDOM()/LIMIT, no EXTRACT, no STDDEV), mandatory schema qualification when a non-default schema is in use, different type names, identity column behaviour, and stricter rules around foreign-key cascade paths.


Changes by area

Foundation and initial driver support

  • b2f14ab — Add initial MS-SQL support: pyodbc/aioodbc dependencies, async DSN rewriting, schema routing via SQLAlchemy schema_translate_map
  • d872fd9 — Document macOS ODBC driver setup and add .env.example for MS-SQL connection strings
  • fc97b2c — Add TrustServerCertificate=yes to MS-SQL DSN examples

Type system

Schema and foreign keys

Primary keys

Dialect-correct SQL in generators and interactive shell

MS-SQL does not support RANDOM() / LIMIT n or EXTRACT(… FROM …) / STDDEV. Every code path that emits these had to be updated:

Schema qualification for SQL stored in src-stats

Queries written by configure-generators into config.yaml's src-stats section are later executed by make-stats via raw text() — SQLAlchemy's schema_translate_map does not apply. Each code path that writes these strings needed to embed the schema-qualified table name explicitly:

  • 91578da — Fix missing schema qualification in interactive shell SELECT statements (do_peek, do_counts, print_column_data, _get_column_data)
  • 4d03aa3 — Fix unqualified table names in raw SQL during configure-generators (ContinuousLogDistributionGeneratorFactory, MultivariateNormalGeneratorFactory)
  • 173f6da — Fix unqualified table names in src-stats queries written by configure-generators (_get_aggregate_query, PredefinedGenerator.SELECT_AGGREGATE_RE)
  • 1132dc6 — Fix schema-qualified table name in ChoiceGenerator stored queries (closes ChoiceGenerator stores unqualified table name in custom_queries SQL #108): use text(schema_qualified_name(...)) so the compiled SQL string is not bracket-quoted as [mimic100.person]

Tests

Dialect-correctness tests were added alongside each fix. A dedicated test module tests/test_generators_dialect.py covers:

  • DATEPART vs EXTRACT in MimesisDateTimeGenerator
  • STDEV vs STDDEV in Buckets
  • NEWID()/TOP vs RANDOM()/LIMIT in ChoiceGeneratorFactory, CovariateQuery, MissingnessType
  • Schema qualification in Buckets, ContinuousLogDistributionGeneratorFactory, ChoiceGeneratorFactory, _get_aggregate_query, and PredefinedGenerator

tests/test_interactive_dialect.py covers the interactive shell commands (do_peek, do_counts, _get_column_data, print_column_data) for both MS-SQL and PostgreSQL dialects.

Examples

  • 40fc121 — Rename mimic_omop example to omop-mssql
  • 1857e76 — Add omop-postgresql example

Known limitations / deferred work


Summary

The mssql branch adds end-to-end MS-SQL support to Datafaker. The main categories of change were: (1) driver and connection plumbing, (2) SQL dialect translation (RANDOMNEWID, LIMITTOP, EXTRACTDATEPART, STDDEVSTDEV), (3) schema qualification in stored SQL strings that bypass schema_translate_map, and (4) MS-SQL-specific schema/type/FK constraints. All changes are covered by unit tests using mocked engines and dialect instances.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions