Skip to content

How to set token budget for Gemma4 with Ollama #651

@somthing3000

Description

@somthing3000

The Gemma4 documentation by Google suggests setting the token budget to adjust the resolution quality of images for more accurate OCR tasks

https://ai.google.dev/gemma/docs/capabilities/vision/image#variable_resolution_token_budget

However, I do not see any configuration options to set it when running AsyncClient.chat(). Adding max_soft_tokens to options does not seem to do anything.

response = await AsyncClient(host=OLLAMA_HOST).chat(
            model=MODEL,
            messages=messages,
            format=self.result_model.model_json_schema(),
            options={
                "num_ctx": 32 * 1024,
                "temperature": 0.0,
                "max_soft_tokens": 560,    # does nothing
            }
        )

By default, Gemma4 models runs at 280 token budget which is not enough for my OCR task; I am testing with Gemma4-e4b.

To verify, I had a vehicle license plate image uploaded to the my ollama endpoint, the result had a missing letter at the end. Then I swapped to using pure transformers library to load an unquantized gemma4-e4b that also produces the same result. However, the transformers AutoProcessor library had an option to set max_soft_tokens where I set it to 560 and it produced the correct expected result.

// ollama default (q4_k_m) and unquantized gemma4
{
    "license_plate_number": "YRSGNB",    // expected YRSGNBY
    "license_plate_state": "California"
}

// unquantized gemma4 with max_soft_token=560
{
    "license_plate_number": "YRSGNBY",    // expected YRSGNBY
    "license_plate_state": "California"
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions