Fix: Resolve multimodal BOA/EOA tokens dynamically from config.json#104
Conversation
|
Thanks for the PR! I've added a missing test file ( |
There was a problem hiding this comment.
Pull request overview
Updates SwiftLM’s multimodal audio token handling to avoid hardcoded BOA/EOA token IDs by resolving them from a model’s config.json (with audio_config fallback), improving compatibility with newer multimodal models.
Changes:
- Replace hardcoded BOA/EOA token IDs in model factory wiring with dynamically extracted values from
config.json. - Expand config parsing from “num audio embeddings only” to “num audio + BOA/EOA tokens” via a new helper.
- Add unit tests covering defaults, top-level config extraction, and
audio_configfallback extraction.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
Sources/SwiftLM/Server.swift |
Uses dynamically extracted BOA/EOA + audio embedding counts when constructing ALM/Omni processors; introduces extractMultimodalTokens. |
tests/SwiftLMTests/MultimodalTokenExtractionTests.swift |
Adds coverage for token extraction defaults and config-driven overrides. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Description
Currently, the
boaTokenandeoaTokenvalues are hardcoded to255010and255011respectively. This causes compatibility issues with newer multimodal models (like Qwen 2-VL) that use different vocab IDs for their vision encoders.This PR fixes this by dynamically extracting
boa_token_idandeoa_token_idfrom the model'sconfig.jsonor itsaudio_configfallback, gracefully defaulting to the old values if not present.Motivation
Improves compatibility with dynamic and diverse vision-language models without requiring hardcoded token updates.
Testing
swift build -c release.