Skip to content

Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery #19332

@greenwich

Description

@greenwich

Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery with kafka input format wrapping avro_stream

Affected Version

31.0.0

Description

When ingesting Kafka data using the kafka input format wrapping avro_stream with Schema Registry, nested Avro fields lose their type information during auto-flattening and become VARCHAR,
even with useSchemaDiscovery: true.

Setup:

  • Single-node Druid 31.0.0 cluster (Helm chart, MiddleManager mode)
  • Kafka supervisor with kafka input format + avro_stream value format + Confluent Schema Registry
  • dimensionsSpec: { useSchemaDiscovery: true }
  • No flattenSpec defined

Avro schema (simplified):

  {"name": "items", "type": {"type": "array", "items": {"type": "record", "fields": [
    {"name": "price", "type": ["null", "double"]},                                                                                                                                              
    {"name": "quantity", "type": ["null", "long"]}                                                 
  ]}}}      

Expected behaviour:
With useSchemaDiscovery: true, auto-flattened fields like items_0_price_ should be detected as DOUBLE, and items_0_quantity_ as BIGINT.

Actual behaviour:

  • Top-level Avro fields are typed correctly (e.g. createdNanos → BIGINT)
  • Nested/array fields are auto-flattened with underscore notation (e.g. items_0_price_, items_0_quantity_, metadata_someField_) but typed as VARCHAR
  • Values are string representations: "217000.0" instead of 217000.0
  • This affects ~110 out of 118 columns in our table — only 4 are BIGINT (all top-level)

Verified with:

  SELECT COLUMN_NAME, DATA_TYPE                                                                    
  FROM INFORMATION_SCHEMA.COLUMNS                                                                                                                                                               
  WHERE TABLE_NAME = 'my_datasource'
                                                                                                                                                                                                
  SELECT "items_0_price_" FROM "my_datasource" LIMIT 1                                             
  -- Returns: "217000.0" (string, not numeric)                                                                                                                                                  

Impact:
All arithmetic on nested numeric fields requires explicit CAST:
CAST("items_0_quantity_" AS DOUBLE) * CAST("items_0_price_" AS DOUBLE) AS notional

Supervisor config (relevant parts):

  {
    "inputFormat": {                                                                                                                                                                            
      "type": "kafka",                                                                             
      "headerFormat": { "type": "string" },
      "keyFormat": { "type": "json" },     
      "valueFormat": {                
        "type": "avro_stream",                                                                                                                                                                  
        "avroBytesDecoder": { 
          "type": "schema_registry",                                                                                                                                                            
          "url": "http://schema-registry:8081/"                                                    
        }
      }                                                                                                                                                                                         
    },
    "dimensionsSpec": {                                                                                                                                                                         
      "useSchemaDiscovery": true                                                                   
    }
  }

Workaround:
Define explicit flattenSpec with useFieldDiscovery: true for top-level fields and manually list every nested path. Impractical for wide schemas with 100+ nested fields.

Questions:

  1. Is this expected — does auto-flattening serialise nested values to strings before schema discovery runs?
  2. Is there a configuration to preserve Avro types for auto-flattened nested fields without a manual flattenSpec?
  3. Would a PR to propagate Avro schema types through the flattening step be welcome?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions