Skip to content

Generate a 500-concept SNOMED-based dataset with humanized (LM-adapted) terms #2472

@filiperochalopes

Description

@filiperochalopes

User story

As a researcher, I want a dataset of 500 randomized active SNOMED CT concepts with a more natural, human-friendly phrasing so we can test mapping and matching workflows using less technical input terms.

Use case

Sample 500 SNOMED concepts (proportionally by semantic tag), then use an LM to rewrite the terms into a more natural phrasing while preserving meaning.

Requirements

  • Generate a reproducible random sample of 500 active SNOMED CT concepts from the RF2 release ZIP:
    • Use fixed seed (42)
    • Proportional sampling by semantic tag (default: disorder + finding)
    • Output CSV containing concept_id, fsn, preferred_term, semantic_tag
  • Produce a second dataset version where each term is adapted by an LM to be more humanized/natural and less technical.
  • Keep both versions available for comparison (original vs humanized) using the same 500 concept_ids.

Acceptance criteria

  • A CSV exists with 500 sampled active SNOMED CT concepts (concept_id, fsn, preferred_term, semantic_tag) generated reproducibly from the RF2 ZIP using the provided sampling logic.
  • A second CSV exists for the same 500 concept_ids containing LM-humanized terms aligned to the original entries.
  • The humanized output is less technical while preserving the original clinical meaning.

Metadata

Metadata

Labels

signal/low-riskSafe to execute with minimal downsidesignal/small-scopeLimited to a small part of the codebasesignal/well-specifiedClear requirements and acceptance criteriastage/triagedAI triage complete — scored and classifiedtype/featureNew or improved functionality

Type

Projects

Status

UAT

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions