fix: support CHAR/VARCHAR types in SQL DDL write path#456
Open
LuciferYang wants to merge 5 commits intolance-format:mainfrom
Open
fix: support CHAR/VARCHAR types in SQL DDL write path#456LuciferYang wants to merge 5 commits intolance-format:mainfrom
LuciferYang wants to merge 5 commits intolance-format:mainfrom
Conversation
- Add BaseCharVarcharRoundtripTest with SQL DDL E2E tests through V2 catalog (CHAR, VARCHAR, and mixed columns with NULL handling) - Add CharVarcharRoundtripTest subclasses for Spark 3.4 and 3.5 modules - Add unit tests for CharType/VarcharType → Arrow Utf8/LargeUtf8 mapping in LanceArrowUtilsSuite - Remove old char/varchar tests from BaseSparkDataTypeRoundtripTest that used the DataFrame API path (which normalizes types to StringType) - Fix inaccurate comment about V2 write path behavior
The LanceArrowWriter.createFieldWriter bug (value-equality pattern `case (StringType, ...)` does not match CharType/VarcharType) affects all Spark versions, not just 3.4/3.5. Add E2E test subclasses for the 4.0 and 4.1 modules to match.
…ules These modules already compile 3.5's test sources via build-helper-maven-plugin (add-java-test-source), so adding a second copy causes "already defined" errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
CREATE TABLE ... (v CHAR(n))orVARCHAR(n)via a V2 catalog blows up on the write path. Two separate match misses, different blast radius.Why
Two Scala pattern styles with different semantics:
toArrowTypeusescase _: StringType(type test)On Spark 4.0+,
CharType/VarcharTypeextendStringType, so they match fine.On 3.4/3.5 they extend
AtomicTypeinstead — no match, throwsunsupportedDataTypeError.createFieldWriterusescase (StringType, ...)(value comparison)CharType(10).equals(StringType)isfalseon every Spark version:CharTypecarries aFixedLengthconstraint while theStringTypesingleton hasNoConstraintSo even when
toArrowTypesucceeds, the writer still fails.Summary:
toArrowType(_: StringType)createFieldWriter(StringType)How
LanceArrowUtils.toArrowType: addcase _: CharType | _: VarcharTypebranches mapping to Utf8 / LargeUtf8 (respectinglargeVarTypes). On 4.0+ these are dead code.LanceArrowWriter.createFieldWriter: addcase (_: CharType | _: VarcharType, vector)branches dispatching to StringWriter / LargeStringWriter.Length constraints are discarded — Arrow has no varchar concept.
Tests
LanceArrowUtilsSuite: 4 unit tests covering CharType/VarcharType → Arrow type mapping withlargeVarTypeson/offBaseCharVarcharRoundtripTest: E2E via V2 catalog — DDL create + insert + read back, covering CHAR, VARCHAR, mixed columns, and NULLsCharVarcharRoundtripTestsubclass in Spark 3.4 and 3.5 modules; Spark 4.0 and 4.1 inherit it automatically via shared test-source configuration