Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery with kafka input format wrapping avro_stream
Affected Version
31.0.0
Description
When ingesting Kafka data using the kafka input format wrapping avro_stream with Schema Registry, nested Avro fields lose their type information during auto-flattening and become VARCHAR,
even with useSchemaDiscovery: true.
Setup:
- Single-node Druid 31.0.0 cluster (Helm chart, MiddleManager mode)
- Kafka supervisor with kafka input format + avro_stream value format + Confluent Schema Registry
- dimensionsSpec: { useSchemaDiscovery: true }
- No flattenSpec defined
Avro schema (simplified):
{"name": "items", "type": {"type": "array", "items": {"type": "record", "fields": [
{"name": "price", "type": ["null", "double"]},
{"name": "quantity", "type": ["null", "long"]}
]}}}
Expected behaviour:
With useSchemaDiscovery: true, auto-flattened fields like items_0_price_ should be detected as DOUBLE, and items_0_quantity_ as BIGINT.
Actual behaviour:
- Top-level Avro fields are typed correctly (e.g. createdNanos → BIGINT)
- Nested/array fields are auto-flattened with underscore notation (e.g. items_0_price_, items_0_quantity_, metadata_someField_) but typed as VARCHAR
- Values are string representations: "217000.0" instead of 217000.0
- This affects ~110 out of 118 columns in our table — only 4 are BIGINT (all top-level)
Verified with:
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'my_datasource'
SELECT "items_0_price_" FROM "my_datasource" LIMIT 1
-- Returns: "217000.0" (string, not numeric)
Impact:
All arithmetic on nested numeric fields requires explicit CAST:
CAST("items_0_quantity_" AS DOUBLE) * CAST("items_0_price_" AS DOUBLE) AS notional
Supervisor config (relevant parts):
{
"inputFormat": {
"type": "kafka",
"headerFormat": { "type": "string" },
"keyFormat": { "type": "json" },
"valueFormat": {
"type": "avro_stream",
"avroBytesDecoder": {
"type": "schema_registry",
"url": "http://schema-registry:8081/"
}
}
},
"dimensionsSpec": {
"useSchemaDiscovery": true
}
}
Workaround:
Define explicit flattenSpec with useFieldDiscovery: true for top-level fields and manually list every nested path. Impractical for wide schemas with 100+ nested fields.
Questions:
- Is this expected — does auto-flattening serialise nested values to strings before schema discovery runs?
- Is there a configuration to preserve Avro types for auto-flattened nested fields without a manual flattenSpec?
- Would a PR to propagate Avro schema types through the flattening step be welcome?
Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery with kafka input format wrapping avro_stream
Affected Version
31.0.0
Description
When ingesting Kafka data using the kafka input format wrapping avro_stream with Schema Registry, nested Avro fields lose their type information during auto-flattening and become VARCHAR,
even with useSchemaDiscovery: true.
Setup:
Avro schema (simplified):
Expected behaviour:
With useSchemaDiscovery: true, auto-flattened fields like items_0_price_ should be detected as DOUBLE, and items_0_quantity_ as BIGINT.
Actual behaviour:
Verified with:
Impact:
All arithmetic on nested numeric fields requires explicit CAST:
CAST("items_0_quantity_" AS DOUBLE) * CAST("items_0_price_" AS DOUBLE) AS notional
Supervisor config (relevant parts):
Workaround:
Define explicit flattenSpec with useFieldDiscovery: true for top-level fields and manually list every nested path. Impractical for wide schemas with 100+ nested fields.
Questions: