Skip to content

SchemaTransformProcessor json serde can be flaky #227

@nabinchha

Description

@nabinchha

Priority Level

Medium (Annoying but has workaround)

Describe the bug

SchemaTransformProcessor tends to run into json serde issue intermittently depending on LLM output in upstream columns used in the processor template. The exception is raised here. It's a JSONDecodeError

ColumnWiseDatasetBuilder._run_processors(self, stage, dataframe, current_batch_number)
    298     except Exception as e:
--> 299         raise DatasetProcessingError(
    300             f"🛑 Failed to process dataset with processor {processor.name} in stage {stage}: {e}"
    301         ) from e
    302 return dataframe

DatasetProcessingError: 🛑 Failed to process dataset with processor SchemaTransformProcessor in stage BuildStage.POST_BATCH: Expecting ',' delimiter: line 1 column 313 (char 312)

Steps/Code to reproduce bug

Here's an example SDG config where I was running into this intermittently

config_builder.add_column(
    SamplerColumnConfig(
        name="language",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["English", "French", "Spanish", "German"],
        ),
        drop=True
    )
)
config_builder.add_column(
    LLMTextColumnConfig(
        name="greetings",
        model_alias="nvidia-text",
        prompt="""
        Write a casual and formal response greeting in '{{language}}' language.
        """,
    )
)
config_builder.add_column(
    LLMTextColumnConfig(
        name="greetings_response",
        model_alias="nvidia-text",
        prompt="""
        Write a follow up natural response to the greeting in '{{greetings}}' said in '{{language}}' language.
        """,
    )
)

# preview_results = data_designer.preview(config_builder=config_builder)

config_builder.add_processor(
    SchemaTransformProcessorConfig(
        name="chat_format",
        template={
            "messages": [
                {
                    "role": "user",
                    "content": "Say hello in {{language}}"
                },
                {
                    "role": "assistant",
                    "content": "{{greeting}}"
                },
                {
                    "role": "user",
                    "content": "{{greetings_response}}"
                }
            ]
        }
    )
)

Expected behavior

This behavior shouldn't be flaky.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions