Linkml conversion tooling#387
Draft
yarikoptic wants to merge 47 commits intomasterfrom
Draft
Conversation
Specify Hatch-managed env for auto converting `dandischema.models` to LinkML schema and back to Pydantic models
Provide script to translate `dandischema.models` in to a LinkML schema and overly it with definition provided by an overlay file.
Provide script to translate `dandischema/models.yaml` back to Pydantic models and store them in `dandischema/models.py`
These prefixes are copied from https://github.com/dandi/schema/blob/master/releases/0.7.0/context.json
The previous BRE pattern used `\+` (GNU sed extension) which silently fails on macOS BSD sed. Switch to `-E` (extended regex) with POSIX character class `[^[:space:]]` instead of `\S` (also unsupported by BSD sed), making the normalization work on both macOS and Linux. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Expand comment for linkml-auto-converted hatch env with usage instructions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There is no prefix defined as `dandi_default`. The intended default prefix is `dandi`
…ed and some symbols from _orig for now we do it so it does not overlay models.py since then git is unable to track renames
…into linkml-auto-converted
we had to maintain original filename for models.py to apply patches easily
yarikoptic
commented
Mar 20, 2026
| # Poor man patch queue implementation | ||
| # Edit this list if you want to merge or drop PRs branches to be patched with. | ||
| # Order matters | ||
| branches_to_merge=( remove-discriminated-unions ) |
Member
Author
There was a problem hiding this comment.
that is where we define branches from PRs to merge!
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #387 +/- ##
==========================================
- Coverage 97.92% 97.83% -0.09%
==========================================
Files 18 19 +1
Lines 2405 2407 +2
==========================================
Hits 2355 2355
- Misses 50 52 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Provide a partial schema to be merged with the generated schema
59b0587 to
c0fbd02
Compare
…nator `dandischema.models` use `schemaKey` in each Pydantic as a de facto type designator in LinkML. However, director translation to LinkML based on individual model's defintion is not possible. This override provided in the merge file completes the translation
yarikoptic
commented
Mar 31, 2026
Introduce `tools/sanitize-yaml` as an extensible stdin→stdout sanitization aggregator, and `tools/remove_impossible_slot_usage_notes.py` to strip pydantic2linkml-generated notes about impossible `schemaKey` slot usage entries. The `2linkml` script now pipes through `tools/sanitize-yaml` so future sanitization steps can be added there without modifying `pyproject.toml`. Both `kislyuk/yq` and `mikefarah/yq` were tried but neither could preserve the original YAML sequence indentation. The Python script uses `ruamel.yaml` instead, which supports round-trip YAML editing that preserves the original formatting. It is invoked via `hatch run linkml-auto-converted:python` to ensure `ruamel.yaml` from the hatch env is used. Note: ruamel.yaml currently introduces some unwanted line wrapping changes. Setting `yaml.width` was attempted numerous times but did not improve the situation, so it is left unset for now. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This ensures each data records provide a value for the `schemaKey`. This is solution to remedy the fact that `schemaKey` is a type designator and doesn't have a default value, unlike the setup in Pydantic models.
- Generalize the hardcoded `dciteCOLON` pattern to `[a-z]+COLON` to strip any lowercase-prefixed COLON artifact, not just `dciteCOLON` - Use `-E` flag with plain `+` instead of GNU-specific `\+` so the pattern works on both macOS (BSD sed) and Linux (GNU sed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a diagnostic tool that identifies Pydantic models where the schemaKey field default differs from the containing class name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This note is specified in the `models_merge.yaml` file to be added to the resulting LinkML translation of `dandischema.models`.
…tools Move `remove_impossible_slot_usage_notes.py` and `sanitize-yaml` into `tools/linkml_conversion_tools/` alongside the other linkml conversion helpers, and update the `2linkml` script in `pyproject.toml` accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tterns Rename remove_impossible_slot_usage_notes.py to remove_notes_by_pattern.py and switch from a single substring match to a list of regex patterns matched via re.search, so additional note families can be stripped in the same pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rings The translation of the max length constraint on strings by specifying the constraint in the pattern of the string is the best expression available in LinkML. There is no direct expression in LinkML for a max length constraint for string range
… definitions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ensures `enums.LicenseType.permissible_values` entries are sorted by key for stable, readable diffs in the auto-converted LinkML YAML output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ATM needed by genschemata helper. potentially we could just patch there.
Lock to the versions currently resolved via pydantic2linkml to prevent unintended changes when new linkml versions are released. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
These three Typer scripts under .claude/skills/dandi-linkml-validation-report/scripts/ make up a reproducible pipeline for validating dandiset metadata against dandischema/models.yaml using the linkml-validate CLI: - fetch_metadata.py: download raw metadata for the draft and every published version of every dandiset on a DANDI instance, plus a small info.json per version capturing schemaVersion / status / modified. - validate_metadata.py: shell out to linkml-validate per version with the right target class (Dandiset for drafts, PublishedDandiset otherwise), writing validation.txt, validation.json and SUMMARY.md alongside each metadata.json. - generate_report.py: aggregate the per-version JSON records into a top-level REPORT.md grouped by class x schemaVersion, with most-common problem patterns and links to per-version summaries. The dandi client is added as a dependency of the linkml-auto-converted hatch env so the fetch script can run inside it.
Previously a partial run could leave a corrupted metadata.json or info.json behind, and a subsequent run with the default refresh=False would happily skip the version because both files looked present. Now the function does all network work first, pre-renders both JSON payloads, and then writes them via temp files plus os.replace() so the destination paths only appear once both have been written successfully -- even under SIGKILL or other abrupt termination. Drive-by simplifications addressed inline review notes: - drop the parent.mkdir() side effect from the JSON writer; the caller now creates version_dir explicitly, - drop json.dumps(default=str) since we already isoformat() datetimes before stashing them into info, - use VersionStatus.value directly and skip the modified-is-None branch (per the dandi client's Version model both fields are non-optional), - use PEP 604 `int | None` for the --limit option type.
Renames _normalise_problem -> _normalize_problem (and updates the caller), and replaces 'organised' with 'organized' in a comment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an extract with amends from
which (branch
linkml-auto-converted) would keep merging this branch into itself while reflecting on changes in the branch which could be rebased or gain merges from the master, and also can accumulate or drop "patch branches" from within its script defining what to patch with.This way
linkml-auto-convertedwould represent reflection of current state of conversionTODO/PLAN
hatch ... TODOmodels.pyintodandischema/models.yamland overlaid with an [dandischema/models_overlay.yaml] overlay file.tools/linkml_conversionto convert into ‘linkml-auto-converted’model_instances.yaml(or alike) which would define pre-populated records such as standards (bids, nwb, ...). aim for potentially multiple classes there.