fix: normalise Kleborate column names in import_kleborate() to prevent silent data loss#108
Closed
efosternyarko wants to merge 2 commits into
Closed
Conversation
Different Kleborate versions (and the same version with different --strain vs default output) produce inconsistently capitalised column names, e.g. 'Agly_acquired' vs 'AGly_acquired' and 'Bla_Chr' vs 'Bla_chr'. The downstream select(any_of(...)) call silently drops any column whose name does not exactly match kleborate_classes$Kleborate_Class, causing entire drug classes (e.g. aminoglycosides, beta-lactam chromosomal) to be absent from the returned genotype table with no warning. Add a case-insensitive rename step immediately after the sample-column rename so that any column whose name differs only in capitalisation is corrected before the any_of() selection. Emit an informative message() when a rename occurs so users are aware of the mismatch.
R CMD check treats non-ASCII characters in source files as a WARNING, which fails CI. Replace the UTF-8 right-arrow (U+2192) with ASCII '->'.
Collaborator
Author
|
Closing — the capitalisation mismatch (Agly_acquired / Bla_Chr) was in a manually prepared Excel file used in our analysis, not from Kleborate or Pathogenwatch output. The kleborate_classes lookup is correct. Will fix the column names in our data file locally. Sorry for the noise! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Different versions of Kleborate (and the same version depending on whether
--strainis used) produce inconsistently capitalised column names. For example:--strain)Agly_acquiredAGly_acquiredBla_ChrBla_chrBecause
import_kleborate()usesselect(any_of(kleborate_class_table$Kleborate_Class)), columns whose names differ only in capitalisation are silently dropped — no warning, no error. Entire drug classes disappear from the returned genotype table.Discovered while running the function on real Klebsiella pneumoniae data from The Gambia: Aminoglycosides showed 0 markers and 0% sensitivity, tracing back to
Agly_acquiredbeing silently skipped.Fix
Add a case-insensitive rename step immediately after the
sample_colrename (line 400) and before theselect(any_of(...))call. Any input column whose name matches an expected column name case-insensitively — but not exactly — is renamed to the expected capitalisation. An informativemessage()is emitted when this happens so users are aware of the discrepancy in their Kleborate output.Testing
Verified against:
Agly_acquired/Bla_Chr(older capitalisation): aminoglycosides now correctly returned with 15 markers across 87 isolatesAGly_acquired/Bla_chr(expected capitalisation): no rename, no message, behaviour unchanged