Skip to content
Open

Mssql #110

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
c0897d4
Update to ignore opiates
myyong Apr 10, 2026
834a20b
Merge branch 'main' of https://github.com/SAFEHR-data/datafaker
myyong Apr 10, 2026
a053742
Template omop orm.yaml, config.yaml and df.py files
myyong Apr 13, 2026
675dd92
Fix ordered list formatting
myyong Apr 13, 2026
54d13b5
Fix pre-commit hooks.
myyong Apr 13, 2026
1c42c49
Fix headers that broke pre-commit hook
myyong Apr 13, 2026
b311776
Organised UI commands
myyong Apr 13, 2026
b2f14ab
Add initial MS-SQL support: driver deps, async DSN rewriting, schema …
myyong May 19, 2026
76fec75
Extend type parser for MS-SQL column types (issue #96)
myyong May 19, 2026
da4a69b
Strip SERIAL/IDENTITY for MS-SQL target databases (issue #97)
myyong May 19, 2026
fabb0cc
Add commit link to challenge 5 in MS-SQL README
myyong May 19, 2026
dfb698f
Add issue #98 link for challenge 6 (UUID type mapping) to MS-SQL README
myyong May 19, 2026
39709d8
Add tests for challenge 7 dialect-agnostic error handling
myyong May 19, 2026
0925a93
Add commit links for challenge 7 to MS-SQL README
myyong May 19, 2026
7183d05
Document challenge 8 as deferred in README and issue #99
myyong May 19, 2026
d872fd9
Document macOS ODBC driver setup and add .env.example for MS-SQL
myyong May 19, 2026
3c3c20a
Merge branch 'ui-update' into mssql
myyong May 19, 2026
db2e48b
Fix Steps paths in MS-SQL README to run from example directory
myyong May 20, 2026
fc97b2c
Add TrustServerCertificate=yes to MS-SQL DSN examples
myyong May 20, 2026
81a65eb
Fix schema-qualified FK resolution and MS-SQL multiple cascade paths …
myyong May 20, 2026
84a209f
Let database generate integer primary keys on MS-SQL (closes #104)
myyong May 21, 2026
40fc121
Rename mimic_omop example to omop-mssql
myyong May 21, 2026
1857e76
Add omop-postgresql example and missing test fixture
myyong May 21, 2026
3637258
Fix dialect-specific SQL in generator commands (closes #105)
myyong May 21, 2026
ad98acf
Update omop-postgresql README to use relative paths
myyong May 21, 2026
7c0add6
Fix RANDOM()/LIMIT dialect incompatibility in ChoiceGeneratorFactory …
myyong May 21, 2026
2bcea2b
Fix schema-qualified table missing from live queries in ChoiceGenerat…
myyong May 21, 2026
41b96f6
Fix schema-missing FROM clause in Buckets queries
myyong May 21, 2026
29e7889
Fix remaining RANDOM()/LIMIT dialect incompatibilities (closes #107)
myyong May 22, 2026
91578da
Fix missing schema qualification in interactive shell SELECT statements
myyong May 22, 2026
4d03aa3
Fix unqualified table names in raw SQL during configure-generators
myyong May 22, 2026
173f6da
Fix unqualified table names in src-stats queries written by configure…
myyong May 22, 2026
1132dc6
Fix schema-qualified table name in ChoiceGenerator stored queries (cl…
myyong May 22, 2026
3aefaa2
Config files for omop-mssql and omop-postgresl
myyong May 27, 2026
086e2d2
Merge branch 'main' of https://github.com/SAFEHR-data/datafaker
myyong Jun 19, 2026
3b09619
Add example source stats file
myyong Jun 29, 2026
cb4df26
Merge branch 'main' of https://github.com/SAFEHR-data/datafaker
myyong Jun 29, 2026
6ed6f97
Merge branch 'main' into mssql
myyong Jun 29, 2026
530ed2f
Add MS-SQL end-to-end test infrastructure
myyong Jun 29, 2026
f67c61e
Fix stale datafaker.generators imports in test_generators_dialect.py
myyong Jun 29, 2026
39d4fc5
Fix AttributeError in ContinuousLogDistributionProposerFactory
myyong Jun 30, 2026
5ce81ff
Changed name of dodgy test person
myyong Jun 30, 2026
2f1cb91
Merge branch 'main' of https://github.com/SAFEHR-data/datafaker
myyong Jun 30, 2026
a6efbc1
Merge branch 'main' into mssql
myyong Jun 30, 2026
78d2504
Fix MSSQL end-to-end tests: connection handling and type parsing
myyong Jun 30, 2026
c67b6cc
Remove debug logging of INSERT statements from create.py
myyong Jul 1, 2026
75731c7
Remove test_duckdb_serial_hook_still_works: remove_serial does not exist
myyong Jul 1, 2026
7a98a1c
Quick fixes
Jul 1, 2026
476fb42
Fix intermittent TCP reset errors in MSSQL tests
myyong Jul 1, 2026
f7ff564
Dispose engines before closing database in RequiresDBTestCase.tearDown
myyong Jul 1, 2026
7150945
Removed non-existing file
myyong Jul 1, 2026
178a394
Fix test_float: use assertIsInstance after FLOAT changed to numeric_type
myyong Jul 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions .github/workflows/tests-mssql.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
name: mssql integration tests
on:
pull_request:
workflow_dispatch:

env:
PYTHON_VERSION: "3.10"
MSSQL_SA_PASSWORD: "Datafaker!Test123"

jobs:
mssql-tests:
runs-on: ubuntu-latest

services:
mssql:
image: mcr.microsoft.com/mssql/server:2022-latest
env:
ACCEPT_EULA: "Y"
MSSQL_SA_PASSWORD: "Datafaker!Test123"
ports:
- 1433:1433
options: >-
--health-cmd "/opt/mssql-tools18/bin/sqlcmd -S localhost -U sa -P 'Datafaker!Test123' -Q 'SELECT 1' -No"
--health-interval 10s
--health-timeout 5s
--health-retries 12
--health-start-period 30s

steps:
- name: Checkout Code
uses: actions/checkout@v6

- name: Install ODBC Driver 18 for SQL Server
shell: bash
run: |
curl -fsSL https://packages.microsoft.com/keys/microsoft.asc \
| sudo gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg
curl -fsSL "https://packages.microsoft.com/config/ubuntu/$(lsb_release -rs)/prod.list" \
| sudo tee /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
ACCEPT_EULA=Y sudo apt-get install -y msodbcsql18

- name: Install poetry
shell: bash
run: |
sudo apt install python3-poetry

- name: Configure poetry
shell: bash
run: |
python -m poetry config virtualenvs.in-project true

- name: Install dependencies (with mssql extras)
shell: bash
run: |
python -m poetry install --extras mssql

- name: Run MS-SQL integration tests
shell: bash
env:
MSSQL_TEST_DSN: "mssql+pyodbc://sa:Datafaker!Test123@localhost:1433/master?driver=ODBC+Driver+18+for+SQL+Server&TrustServerCertificate=yes"
run: |
poetry run python -m unittest tests.test_functional_mssql -v
16 changes: 16 additions & 0 deletions datafaker/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,22 @@
serial_re = re.compile(r"\bSERIAL\b")



@compiles(CreateTable, "mssql")
def remove_mssql_on_delete_cascade(element: CreateTable, compiler: Any, **kw: Any) -> str:
"""
Strip ON DELETE CASCADE from MS-SQL table DDL.

MS-SQL rejects multiple cascading FK paths to the same table (error 1785).
OMOP-style schemas commonly have many FK columns on one table all pointing at
the same vocabulary table, which triggers this limit. Dropping CASCADE is
safe for datafaker because referential integrity is enforced by insert order,
not by the database engine.
"""
text: str = compiler.visit_create_table(element, **kw)
return text.replace(" ON DELETE CASCADE", "")


@compiles(CreateTable, "duckdb")
def remove_on_delete_cascade(element: CreateTable, compiler: Any, **kw: Any) -> str:
"""
Expand Down
10 changes: 7 additions & 3 deletions datafaker/db_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
get_ignored_table_names,
get_vocabulary_table_names,
logger,
make_async_dsn,
make_foreign_key_name,
)

Expand Down Expand Up @@ -139,10 +140,10 @@ def create_db_engine(
**kwargs: Any,
) -> MaybeAsyncEngine:
"""Create a SQLAlchemy Engine."""
kwargs.setdefault("pool_pre_ping", True)
try:
if use_asyncio:
async_dsn = db_dsn.replace("postgresql://", "postgresql+asyncpg://")
engine: MaybeAsyncEngine = create_async_engine(async_dsn, **kwargs)
engine: MaybeAsyncEngine = create_async_engine(make_async_dsn(db_dsn), **kwargs)
else:
engine = create_engine(db_dsn, **kwargs)
except NoSuchModuleError as exc:
Expand All @@ -155,7 +156,10 @@ def create_db_engine(

settings = {}
if schema_name is not None:
settings["search_path"] = schema_name
if get_sync_engine(engine).dialect.name == "mssql":
engine = engine.execution_options(schema_translate_map={None: schema_name})
else:
settings["search_path"] = schema_name
if parquet_dir is not None:
joined = ",".join(_find_parquet_directories(parquet_dir))
# double up single quotes
Expand Down
42 changes: 21 additions & 21 deletions datafaker/interactive/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

import sqlalchemy
from prettytable import PrettyTable
from sqlalchemy import Engine, ForeignKey, MetaData, Table
from sqlalchemy import Engine, ForeignKey, MetaData, Table, func, literal_column, or_, select
from sqlalchemy.exc import DatabaseError, SQLAlchemyError
from typing_extensions import Self

Expand Down Expand Up @@ -352,17 +352,13 @@ def do_counts(self, _arg: str) -> None:
return
table_name = self.table_name()
nullable_columns = self.get_nullable_columns(table_name)
colcounts = [f', COUNT("{nnc}") AS "{nnc}"' for nnc in nullable_columns]
tbl = self.table_metadata()
count_exprs = [func.count().label("row_count")] + [
func.count(tbl.c[col]).label(col) for col in nullable_columns
]
stmt = select(*count_exprs).select_from(tbl)
with self.sync_engine.connect() as connection:
result = (
connection.execute(
sqlalchemy.text(
f'SELECT COUNT(*) AS row_count{"".join(colcounts)} FROM "{table_name}"'
)
)
.mappings()
.first()
)
result = connection.execute(stmt).mappings().first()
if result is None:
self.print("Could not count rows in table {0}", table_name)
return
Expand Down Expand Up @@ -415,19 +411,23 @@ def do_peek(self, arg: str) -> None:
col_names = arg.split()
if not col_names:
col_names = self._get_column_names()
nonnulls = [f'"{cn}" IS NOT NULL' for cn in col_names]
random_fn = (
func.newid() if self.sync_engine.dialect.name == "mssql" else func.random()
)
col_exprs = [literal_column(f'"{cn}"') for cn in col_names]
nonnull_clauses = [literal_column(f'"{cn}"').isnot(None) for cn in col_names]
stmt = (
select(*col_exprs)
.select_from(self.table_metadata())
.where(or_(*nonnull_clauses))
.order_by(random_fn)
.limit(max_peek_rows)
)
with self.sync_engine.connect() as connection:
cols = ", ".join(f'"{cn}"' for cn in col_names)
where = "WHERE" if nonnulls else ""
nonnull = " OR ".join(nonnulls)
query = sqlalchemy.text(
f'SELECT {cols} FROM "{table_name}" {where} {nonnull}'
f" ORDER BY RANDOM() LIMIT {max_peek_rows}"
)
try:
result = connection.execute(query)
result = connection.execute(stmt)
except SQLAlchemyError as exc:
self.print(self.ERROR_FAILED_SQL, exc=exc, query=query)
self.print(self.ERROR_FAILED_SQL, exc=exc, query=stmt)
return
self.print_table(list(result.keys()), result.fetchmany(max_peek_rows))

Expand Down
27 changes: 17 additions & 10 deletions datafaker/interactive/generators.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from typing import Any, Callable, Optional, cast

import sqlalchemy
from sqlalchemy import Column
from sqlalchemy import Column, and_, func, literal_column, select

from datafaker.db_utils import MaybeAsyncEngine, primary_private_fks, table_is_private
from datafaker.interactive.base import DbCmd, TableEntry, fk_column_name, or_default
Expand All @@ -16,6 +16,7 @@
get_columns_assigned,
get_row_generators,
logger,
schema_qualified_name,
split_column_full_name,
)

Expand Down Expand Up @@ -61,8 +62,9 @@ def get_aggregate_query(
]
if not clauses:
return None
qualified = schema_qualified_name(table_name, engine)
alias = f' AS "{table_name}"' if engine.dialect.name == "duckdb" else ""
return f'SELECT {", ".join(clauses)} FROM "{table_name}"{alias}'
return f'SELECT {", ".join(clauses)} FROM "{qualified}"{alias}'


# pylint: disable=too-many-public-methods
Expand Down Expand Up @@ -779,15 +781,20 @@ def _get_column_data(
self, count: int, to_str: Callable[[Any], str] = repr
) -> list[list[str]]:
columns = self._get_column_names()
columns_string = ", ".join(columns)
pred = " AND ".join(f"{column} IS NOT NULL" for column in columns)
random_fn = (
func.newid() if self.sync_engine.dialect.name == "mssql" else func.random()
)
col_exprs = [literal_column(col) for col in columns]
nonnull_clauses = [literal_column(col).isnot(None) for col in columns]
stmt = (
select(*col_exprs)
.select_from(self.table_metadata())
.where(and_(*nonnull_clauses))
.order_by(random_fn)
.limit(count)
)
with self.sync_engine.connect() as connection:
result = connection.execute(
sqlalchemy.text(
f'SELECT {columns_string} FROM "{self.table_name()}"'
f" WHERE {pred} ORDER BY RANDOM() LIMIT {count}"
)
)
result = connection.execute(stmt)
return [[to_str(x) for x in xs] for xs in result.all()]

def do_propose(self, _arg: str) -> None:
Expand Down
16 changes: 15 additions & 1 deletion datafaker/interactive/missingness.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,32 @@ class MissingnessType:
columns: list[str]

@classmethod
def sampled_query(cls, table: str, count: int, column_names: Iterable[str]) -> str:
def sampled_query(
cls,
table: str,
count: int,
column_names: Iterable[str],
dialect_name: str = "",
) -> str:
"""
Construct a query to make a sampling of the named rows of the table.

:param table: The name of the table to sample.
:param count: The number of samples to get.
:param column_names: The columns to fetch.
:param dialect_name: The SQLAlchemy dialect name (e.g. ``"mssql"``).
:return: The SQL query to do the sampling.
"""
result_names = ", ".join([f"{c}__is_null" for c in column_names])
column_is_nulls = ", ".join(
[f"{c} IS NULL AS {c}__is_null" for c in column_names]
)
if dialect_name == "mssql":
return (
f"SELECT COUNT(*) AS row_count, {result_names} FROM "
f"(SELECT TOP {count} {column_is_nulls} FROM {table} ORDER BY NEWID())"
f" AS __t GROUP BY {result_names}"
)
return cls.SAMPLED_QUERY.format(
result_names=result_names,
column_is_nulls=column_is_nulls,
Expand Down Expand Up @@ -330,6 +343,7 @@ def do_sampled(self, arg: str) -> None:
entry.name,
count,
self.get_nullable_columns(entry.name),
dialect_name=self.sync_engine.dialect.name,
),
[
"The missingness patterns and how often they appear in a"
Expand Down
39 changes: 26 additions & 13 deletions datafaker/interactive/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import Any, cast

import sqlalchemy
from sqlalchemy import func, literal_column, select, text

from datafaker.interactive.base import (
TYPE_LETTER,
Expand Down Expand Up @@ -477,16 +478,23 @@ def print_column_data(self, column: str, count: int, min_length: int) -> None:
:param count: The number of rows to sample.
:param min_length: The minimum length of text to choose from (0 for any text).
"""
where = f"WHERE {column} IS NOT NULL"
random_fn = (
func.newid() if self.sync_engine.dialect.name == "mssql" else func.random()
)
col_expr = literal_column(column)
if 0 < min_length:
where = f"WHERE LENGTH({column}) >= {min_length}"
where_clause = func.length(col_expr) >= min_length
else:
where_clause = col_expr.isnot(None)
stmt = (
select(col_expr)
.select_from(self.table_metadata())
.where(where_clause)
.order_by(random_fn)
.limit(count)
)
with self.sync_engine.connect() as connection:
result = connection.execute(
sqlalchemy.text(
f'SELECT {column} FROM "{self.table_name()}"'
f" {where} ORDER BY RANDOM() LIMIT {count}"
)
)
result = connection.execute(stmt)
self.columnize([str(x[0]) for x in result.all()])

def print_row_data(self, count: int) -> None:
Expand All @@ -495,12 +503,17 @@ def print_row_data(self, count: int) -> None:

:param count: The number of rows to report.
"""
random_fn = (
func.newid() if self.sync_engine.dialect.name == "mssql" else func.random()
)
stmt = (
select(text("*"))
.select_from(self.table_metadata())
.order_by(random_fn)
.limit(count)
)
with self.sync_engine.connect() as connection:
result = connection.execute(
sqlalchemy.text(
f'SELECT * FROM "{self.table_name()}" ORDER BY RANDOM() LIMIT {count}'
)
)
result = connection.execute(stmt)
if result is None:
self.print("No rows in this table!")
return
Expand Down
Loading
Loading