fix(seq): return spliced cDNA for transcripts instead of genomic (#187) by Elarwei001 · Pull Request #227 · scverse/gget

Elarwei001 · 2026-06-24T14:03:00Z

Resolves #187

Summary

Ensembl's sequence/id endpoint returns the genomic span by default, which for a transcript ID includes introns rather than the spliced transcript sequence. As a result gget seq returned the wrong sequence for any ENST/transcript query.

What it does

Requests type=cdna from the Ensembl sequence/id endpoint for transcript IDs so transcript queries return the spliced cDNA. Gene IDs are unaffected and still return the genomic sequence.

Changes

Non-isoform bulk request: classify IDs via lookup/id and split into a cDNA batch (transcripts) and a genomic batch (genes/other), falling back to the previous genomic-only behaviour if the lookup fails.
Isoform gene branch: fetch each transcript as cDNA.
Isoform non-gene branch: cDNA for transcripts, genomic otherwise.
The cDNA response carries no desc field, so a missing desc is coerced to an empty string when building the FASTA header.
Regenerated the test7/test8 fixtures to the spliced-transcript output.

Testing

Verified against the live Ensembl API: ENST00000392653 now returns cDNA (len 677, no chromosome desc) instead of the genomic span (len 1393). Mixed gene+transcript batches correctly split (transcript→cDNA, gene→genomic).
pytest tests/test_seq.py — 10 passed.

…erse#187) Ensembl's sequence/id endpoint returns the genomic span by default, which for a transcript ID includes introns rather than the spliced transcript sequence. gget seq therefore returned the wrong sequence for any ENST/transcript query. Request type=cdna for transcript IDs across all three code paths: - non-isoform bulk request: classify IDs via lookup/id and split into a cDNA batch (transcripts) and a genomic batch (genes/other), falling back to the previous genomic-only behaviour if the lookup fails - isoform gene branch: fetch each transcript as cDNA - isoform non-gene branch: cDNA for transcripts, genomic otherwise The cDNA response carries no "desc" field, so coerce a missing desc to an empty string when building the FASTA header. Regenerate the test7/ test8 fixtures to the spliced-transcript output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-24T14:25:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.63%. Comparing base (5cf607f) to head (38d2e7b).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #227      +/-   ##
==========================================
+ Coverage   56.14%   56.63%   +0.48%     
==========================================
  Files          29       29              
  Lines        9244     9260      +16     
==========================================
+ Hits         5190     5244      +54     
+ Misses       4054     4016      -38

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add network-free, mocked tests covering the scverse#187 fix (gget seq returns the spliced cDNA for transcript/ENST IDs instead of the genomic span): - versioned ENST id (.N) is version-stripped and fetched as type=cdna - non-coding/ncRNA transcript uses the same cDNA path - mixed gene+transcript batch splits into a genomic and a cDNA request - isoforms=True for both a gene (cDNA per transcript) and a transcript - translate=True for a transcript queries UniProt with the transcript ID - graceful handling when an entry has no desc, and when an ID is absent from the Ensembl response All Ensembl/UniProt calls are mocked so the new tests are deterministic and offline; existing live tests are left unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

for more information, see https://pre-commit.ci

Add targeted, network-free tests for the previously-uncovered branches of the scverse#187 fix (patch coverage was 82.6%, 4 lines in gget/gget_seq.py): - lookup/id classification failure -> falls back to a genomic request - isoforms=True on a gene where one transcript's cDNA fetch fails -> the error is logged and the remaining transcripts are still returned - isoforms=True on a transcript whose cDNA fetch fails -> error logged, no sequence returned These exercise the three `except RuntimeError` handlers that were not hit before, bringing patch coverage of the fix to ~100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Elarwei001 and others added 3 commits June 24, 2026 23:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

f82382c

for more information, see https://pre-commit.ci

Elarwei001 marked this pull request as draft June 25, 2026 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(seq): return spliced cDNA for transcripts instead of genomic (#187)#227

fix(seq): return spliced cDNA for transcripts instead of genomic (#187)#227
Elarwei001 wants to merge 4 commits into
scverse:devfrom
Elarwei001:feature/seq-transcript-187

Elarwei001 commented Jun 24, 2026

Uh oh!

codecov-commenter commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Elarwei001 commented Jun 24, 2026

Summary

What it does

Changes

Testing

Uh oh!

codecov-commenter commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Jun 24, 2026 •

edited

Loading