Generate a 500-concept SNOMED-based dataset with humanized (LM-adapted) terms

## User story
As a researcher, I want a dataset of 500 randomized active SNOMED CT concepts with a more natural, human-friendly phrasing so we can test mapping and matching workflows using less technical input terms.

## Use case
Sample 500 SNOMED concepts (proportionally by semantic tag), then use an LM to rewrite the terms into a more natural phrasing while preserving meaning.

## Requirements
- [ ] Generate a reproducible random sample of 500 active SNOMED CT concepts from the RF2 release ZIP:
  - [ ] Use fixed seed (42)
  - [ ] Proportional sampling by semantic tag (default: disorder + finding)
  - [ ] Output CSV containing concept_id, fsn, preferred_term, semantic_tag
- [ ] Produce a second dataset version where each term is adapted by an LM to be more humanized/natural and less technical.
- [ ] Keep both versions available for comparison (original vs humanized) using the same 500 concept_ids.

## Acceptance criteria
- [ ] A CSV exists with 500 sampled active SNOMED CT concepts (concept_id, fsn, preferred_term, semantic_tag) generated reproducibly from the RF2 ZIP using the provided sampling logic.
- [ ] A second CSV exists for the same 500 concept_ids containing LM-humanized terms aligned to the original entries.
- [ ] The humanized output is less technical while preserving the original clinical meaning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a 500-concept SNOMED-based dataset with humanized (LM-adapted) terms #2472

User story

Use case

Requirements

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generate a 500-concept SNOMED-based dataset with humanized (LM-adapted) terms #2472

Description

User story

Use case

Requirements

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions