Skip to content

Commit 6ec0949

Browse files
FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 (#495)
### Work Item / Issue Reference <!-- IMPORTANT: Please follow the PR template guidelines below. For mssql-python maintainers: Insert your ADO Work Item ID below For external contributors: Insert Github Issue number below Only one reference is required - either GitHub issue OR ADO Work Item. --> <!-- mssql-python maintainers: ADO Work Item --> > [AB#43177](https://sqlclientdrivers.visualstudio.com/c6d89619-62de-46a0-8b46-70b92a84d85e/_workitems/edit/43177) <!-- External contributors: GitHub Issue --> > GitHub Issue: #468 ------------------------------------------------------------------- ### Summary <!-- Insert your summary of changes below. Minimum 10 characters required. --> This pull request updates the default handling of SQL `CHAR`/`VARCHAR` columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior. **Default Encoding and Decoding Changes:** * The default decoding for SQL `CHAR` columns is now set to use `"utf-16le"` encoding and the `SQL_WCHAR` ctype, replacing the previous `"utf-8"`/`SQL_CHAR` defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (`mssql_python/connection.py`, [mssql_python/connection.pyR264-R271](diffhunk://#diff-29bb94de45aae51c23a6426d40133c28e4161e68769e08d046059c7186264e90R264-R271)) * All cursor fetch methods (`fetchone`, `fetchmany`, `fetchall`) are updated to request UTF-16 decoding and pass the correct ctype when fetching `CHAR` data, ensuring consistent behavior across platforms. (`mssql_python/cursor.py`, [[1]](diffhunk://#diff-deceea46ae01082ce8400e14fa02f4b7585afb7b5ed9885338b66494f5f38280L2371-R2373) [[2]](diffhunk://#diff-deceea46ae01082ce8400e14fa02f4b7585afb7b5ed9885338b66494f5f38280L2437-R2440) [[3]](diffhunk://#diff-deceea46ae01082ce8400e14fa02f4b7585afb7b5ed9885338b66494f5f38280L2495-R2499) **C++ Binding and Processing Updates:** * The `ColumnInfoExt` struct now tracks whether wide character (UTF-16) fetching is used for a column, and the `ProcessChar` function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (`mssql_python/pybind/ddbc_bindings.h`, [[1]](diffhunk://#diff-85167a2d59779df18704284ab7ce46220c3619408fbf22c631ffdf29f794d635R671) [[2]](diffhunk://#diff-85167a2d59779df18704284ab7ce46220c3619408fbf22c631ffdf29f794d635R795-R798) [[3]](diffhunk://#diff-85167a2d59779df18704284ab7ce46220c3619408fbf22c631ffdf29f794d635R816-R846) **Test Adjustments:** * Tests are updated to expect `"utf-16le"` and `SQL_WCHAR` as the default decoding settings for `SQL_CHAR` columns, and to validate the new default behavior. (`tests/test_013_encoding_decoding.py`, [[1]](diffhunk://#diff-97f01c2139fb5a0dc283aacb2982e014f2f3cd8bb88079451eba91362e2fb3f1L607-R614) [[2]](diffhunk://#diff-97f01c2139fb5a0dc283aacb2982e014f2f3cd8bb88079451eba91362e2fb3f1L4924-R4927) [[3]](diffhunk://#diff-97f01c2139fb5a0dc283aacb2982e014f2f3cd8bb88079451eba91362e2fb3f1L5651-R5656) <!-- ### PR Title Guide > For feature requests FEAT: (short-description) > For non-feature requests like test case updates, config updates , dependency updates etc CHORE: (short-description) > For Fix requests FIX: (short-description) > For doc update requests DOC: (short-description) > For Formatting, indentation, or styling update STYLE: (short-description) > For Refactor, without any feature changes REFACTOR: (short-description) > For release related changes, without any feature changes RELEASE: #<RELEASE_VERSION> (short-description) ### Contribution Guidelines External contributors: - Create a GitHub issue first: https://github.com/microsoft/mssql-python/issues/new - Link the GitHub issue in the "GitHub Issue" section above - Follow the PR title format and provide a meaningful summary mssql-python maintainers: - Create an ADO Work Item following internal processes - Link the ADO Work Item in the "ADO Work Item" section above - Follow the PR title format and provide a meaningful summary --> --------- Co-authored-by: gargsaumya <saumyagarg.100@gmail.com>
1 parent 6209889 commit 6ec0949

5 files changed

Lines changed: 1227 additions & 350 deletions

File tree

mssql_python/connection.py

Lines changed: 24 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -261,10 +261,14 @@ def __init__(
261261
}
262262

263263
# Initialize decoding settings with Python 3 defaults
264+
# SQL_CHAR default uses SQL_WCHAR ctype so the ODBC driver returns
265+
# UTF-16 data for VARCHAR columns. This avoids encoding mismatches on
266+
# Windows where the driver returns raw bytes in the server's native
267+
# code page (e.g. CP-1252) that may fail to decode as UTF-8.
264268
self._decoding_settings = {
265269
ConstantsDDBC.SQL_CHAR.value: {
266-
"encoding": "utf-8",
267-
"ctype": ConstantsDDBC.SQL_CHAR.value,
270+
"encoding": "utf-16le",
271+
"ctype": ConstantsDDBC.SQL_WCHAR.value,
268272
},
269273
ConstantsDDBC.SQL_WCHAR.value: {
270274
"encoding": "utf-16le",
@@ -643,9 +647,13 @@ def setdecoding(
643647
sqltype (int): The SQL type being configured: SQL_CHAR, SQL_WCHAR, or SQL_WMETADATA.
644648
SQL_WMETADATA is a special flag for configuring column name decoding.
645649
encoding (str, optional): The Python encoding to use when decoding the data.
646-
If None, uses default encoding based on sqltype.
650+
If None, defaults to ``'utf-16le'`` for all sqltypes (SQL_CHAR,
651+
SQL_WCHAR, and SQL_WMETADATA), matching the connection-level
652+
defaults set in ``Connection.__init__``. Passing ``encoding=None``
653+
therefore resets the sqltype to its initial default.
647654
ctype (int, optional): The C data type to request from SQLGetData:
648-
SQL_CHAR or SQL_WCHAR. If None, uses default based on encoding.
655+
SQL_CHAR or SQL_WCHAR. If None, uses default based on encoding
656+
(SQL_WCHAR for UTF-16 variants, SQL_CHAR otherwise).
649657
650658
Returns:
651659
None
@@ -655,7 +663,10 @@ def setdecoding(
655663
InterfaceError: If the connection is closed.
656664
657665
Example:
658-
# Configure SQL_CHAR to use UTF-8 decoding
666+
# Reset SQL_CHAR to the connection default (utf-16le + SQL_WCHAR ctype)
667+
cnxn.setdecoding(mssql_python.SQL_CHAR)
668+
669+
# Configure SQL_CHAR to use UTF-8 decoding (opt-in, non-default)
659670
cnxn.setdecoding(mssql_python.SQL_CHAR, encoding='utf-8')
660671
661672
# Configure column metadata decoding
@@ -691,12 +702,15 @@ def setdecoding(
691702
),
692703
)
693704

694-
# Set default encoding based on sqltype if not provided
705+
# Set default encoding based on sqltype if not provided.
706+
# All sqltypes default to UTF-16LE to match Connection.__init__ defaults.
707+
# SQL_CHAR uses utf-16le + SQL_WCHAR ctype so the ODBC driver returns
708+
# UTF-16 data for VARCHAR columns, avoiding encoding mismatches on
709+
# Windows where the driver may otherwise return raw bytes in the
710+
# server's native code page (e.g. CP-1252). This makes
711+
# ``setdecoding(SQL_CHAR)`` with no arguments a true reset-to-defaults.
695712
if encoding is None:
696-
if sqltype == ConstantsDDBC.SQL_CHAR.value:
697-
encoding = "utf-8" # Default for SQL_CHAR in Python 3
698-
else: # SQL_WCHAR or SQL_WMETADATA
699-
encoding = "utf-16le" # Default for SQL_WCHAR in Python 3
713+
encoding = "utf-16le"
700714

701715
# Validate encoding using cached validation for better performance
702716
if not _validate_encoding(encoding):

mssql_python/cursor.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2462,8 +2462,9 @@ def fetchone(self) -> Union[None, Row]:
24622462
ret = ddbc_bindings.DDBCSQLFetchOne(
24632463
self.hstmt,
24642464
row_data,
2465-
char_decoding.get("encoding", "utf-8"),
2465+
char_decoding.get("encoding", "utf-16le"),
24662466
wchar_decoding.get("encoding", "utf-16le"),
2467+
char_decoding.get("ctype", ddbc_sql_const.SQL_WCHAR.value),
24672468
)
24682469

24692470
if self.hstmt:
@@ -2528,8 +2529,9 @@ def fetchmany(self, size: Optional[int] = None) -> List[Row]:
25282529
self.hstmt,
25292530
rows_data,
25302531
size,
2531-
char_decoding.get("encoding", "utf-8"),
2532+
char_decoding.get("encoding", "utf-16le"),
25322533
wchar_decoding.get("encoding", "utf-16le"),
2534+
char_decoding.get("ctype", ddbc_sql_const.SQL_WCHAR.value),
25332535
)
25342536

25352537
if self.hstmt:
@@ -2586,8 +2588,9 @@ def fetchall(self) -> List[Row]:
25862588
ret = ddbc_bindings.DDBCSQLFetchAll(
25872589
self.hstmt,
25882590
rows_data,
2589-
char_decoding.get("encoding", "utf-8"),
2591+
char_decoding.get("encoding", "utf-16le"),
25902592
wchar_decoding.get("encoding", "utf-16le"),
2593+
char_decoding.get("ctype", ddbc_sql_const.SQL_WCHAR.value),
25912594
)
25922595

25932596
# Check for errors

0 commit comments

Comments
 (0)