Skip to content

Improve validation in structured generation columns #182

@andreatgretel

Description

@andreatgretel

Is your feature request related to a problem? Please describe.

We would like to store a Pydantic model inside the LLMStructuredColumnConfig, since it allows us to better validate the output. However, column configs need to be serializable. Because of that, we end up converting that to a JSON schema, and using gsonschema to do the validation.

Unfortunately, validating using the JSON schema is not as good as using the Pydantic model. Recently, for instance, we had an issue where models would generate either "price": "12.34" (converted to string) or "price": 12.34 (converted to float), both of which passed validation (price being a Decimal), and later would have issues writing to Parquet. See #171 for more details.

Describe the solution you'd like

It would be nice to be able to serialize the Pydantic model keeping features such as complex types, validators etc. One possibility would be creating our own serialization.

Describe alternatives you've considered

Alternative solutions include adding more info to the prompt, depending on the type; and standardizing types later, before writing to Parquet.

Additional context

See notebook attached (by @nabinchha) for one possibility on how to ser/de Pydantic models.

test_serialization_pydantic_core_schema (1).ipynb

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions