Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery

Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery with kafka input format wrapping avro_stream

### Affected Version
31.0.0 

### Description

  When ingesting Kafka data using the kafka input format wrapping avro_stream with Schema Registry, nested Avro fields lose their type information during auto-flattening and become VARCHAR,   
  even with useSchemaDiscovery: true.
                                                                                                                                                                                                
  **Setup:**                                                                                           
  - Single-node Druid 31.0.0 cluster (Helm chart, MiddleManager mode)
  - Kafka supervisor with kafka input format + avro_stream value format + Confluent Schema Registry
  - dimensionsSpec: { useSchemaDiscovery: true }                                              
  - No flattenSpec defined                      

  **Avro schema (simplified):**       
```                                                                                                                                                              
  {"name": "items", "type": {"type": "array", "items": {"type": "record", "fields": [
    {"name": "price", "type": ["null", "double"]},                                                                                                                                              
    {"name": "quantity", "type": ["null", "long"]}                                                 
  ]}}}      
```                                                                                                                                                                                    
   
  **Expected behaviour:**                                                                                                                                                                            
  With useSchemaDiscovery: true, auto-flattened fields like items_0_price_ should be detected as DOUBLE, and items_0_quantity_ as BIGINT.
                                                                                                                                                                                                
  **Actual behaviour:**
  - Top-level Avro fields are typed correctly (e.g. createdNanos → BIGINT)                                                                                                                      
  - Nested/array fields are auto-flattened with underscore notation (e.g. items_0_price_, items_0_quantity_, metadata_someField_) but typed as VARCHAR                                          
  - Values are string representations: "217000.0" instead of 217000.0
  - This affects ~110 out of 118 columns in our table — only 4 are BIGINT (all top-level)                                                                                                       
                                                                                                   
  **Verified with:**      
```                                                                                                                                                                          
  SELECT COLUMN_NAME, DATA_TYPE                                                                    
  FROM INFORMATION_SCHEMA.COLUMNS                                                                                                                                                               
  WHERE TABLE_NAME = 'my_datasource'
                                                                                                                                                                                                
  SELECT "items_0_price_" FROM "my_datasource" LIMIT 1                                             
  -- Returns: "217000.0" (string, not numeric)                                                                                                                                                  
```
  **Impact:**                                                                                                                                                                                       
  All arithmetic on nested numeric fields requires explicit CAST:                                  
  CAST("items_0_quantity_" AS DOUBLE) * CAST("items_0_price_" AS DOUBLE) AS notional

  Supervisor config (relevant parts):   
```                                                                                                                                                        
  {
    "inputFormat": {                                                                                                                                                                            
      "type": "kafka",                                                                             
      "headerFormat": { "type": "string" },
      "keyFormat": { "type": "json" },     
      "valueFormat": {                
        "type": "avro_stream",                                                                                                                                                                  
        "avroBytesDecoder": { 
          "type": "schema_registry",                                                                                                                                                            
          "url": "http://schema-registry:8081/"                                                    
        }
      }                                                                                                                                                                                         
    },
    "dimensionsSpec": {                                                                                                                                                                         
      "useSchemaDiscovery": true                                                                   
    }
  }
```
  **Workaround:**
  Define explicit flattenSpec with useFieldDiscovery: true for top-level fields and manually list every nested path. Impractical for wide schemas with 100+ nested fields.
                                                                                                                                                                                                
  **Questions:**
  1. Is this expected — does auto-flattening serialise nested values to strings before schema discovery runs?                                                                                   
  2. Is there a configuration to preserve Avro types for auto-flattened nested fields without a manual flattenSpec?                                                                             
  3. Would a PR to propagate Avro schema types through the flattening step be welcome?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery #19332

Affected Version

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery #19332

Description

Affected Version

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions