Skip to content

fix(seq): return spliced cDNA for transcripts instead of genomic (#187)#227

Draft
Elarwei001 wants to merge 4 commits into
scverse:devfrom
Elarwei001:feature/seq-transcript-187
Draft

fix(seq): return spliced cDNA for transcripts instead of genomic (#187)#227
Elarwei001 wants to merge 4 commits into
scverse:devfrom
Elarwei001:feature/seq-transcript-187

Conversation

@Elarwei001

Copy link
Copy Markdown
Contributor

Resolves #187

Summary

Ensembl's sequence/id endpoint returns the genomic span by default, which for a transcript ID includes introns rather than the spliced transcript sequence. As a result gget seq returned the wrong sequence for any ENST/transcript query.

What it does

Requests type=cdna from the Ensembl sequence/id endpoint for transcript IDs so transcript queries return the spliced cDNA. Gene IDs are unaffected and still return the genomic sequence.

Changes

  • Non-isoform bulk request: classify IDs via lookup/id and split into a cDNA batch (transcripts) and a genomic batch (genes/other), falling back to the previous genomic-only behaviour if the lookup fails.
  • Isoform gene branch: fetch each transcript as cDNA.
  • Isoform non-gene branch: cDNA for transcripts, genomic otherwise.
  • The cDNA response carries no desc field, so a missing desc is coerced to an empty string when building the FASTA header.
  • Regenerated the test7/test8 fixtures to the spliced-transcript output.

Testing

  • Verified against the live Ensembl API: ENST00000392653 now returns cDNA (len 677, no chromosome desc) instead of the genomic span (len 1393). Mixed gene+transcript batches correctly split (transcript→cDNA, gene→genomic).
  • pytest tests/test_seq.py — 10 passed.

…erse#187)

Ensembl's sequence/id endpoint returns the genomic span by default,
which for a transcript ID includes introns rather than the spliced
transcript sequence. gget seq therefore returned the wrong sequence for
any ENST/transcript query.

Request type=cdna for transcript IDs across all three code paths:
- non-isoform bulk request: classify IDs via lookup/id and split into a
  cDNA batch (transcripts) and a genomic batch (genes/other), falling
  back to the previous genomic-only behaviour if the lookup fails
- isoform gene branch: fetch each transcript as cDNA
- isoform non-gene branch: cDNA for transcripts, genomic otherwise

The cDNA response carries no "desc" field, so coerce a missing desc to
an empty string when building the FASTA header. Regenerate the test7/
test8 fixtures to the spliced-transcript output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.63%. Comparing base (5cf607f) to head (38d2e7b).
⚠️ Report is 1 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #227      +/-   ##
==========================================
+ Coverage   56.14%   56.63%   +0.48%     
==========================================
  Files          29       29              
  Lines        9244     9260      +16     
==========================================
+ Hits         5190     5244      +54     
+ Misses       4054     4016      -38     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Elarwei001 and others added 3 commits June 24, 2026 23:19
Add network-free, mocked tests covering the scverse#187 fix (gget seq returns the
spliced cDNA for transcript/ENST IDs instead of the genomic span):

- versioned ENST id (.N) is version-stripped and fetched as type=cdna
- non-coding/ncRNA transcript uses the same cDNA path
- mixed gene+transcript batch splits into a genomic and a cDNA request
- isoforms=True for both a gene (cDNA per transcript) and a transcript
- translate=True for a transcript queries UniProt with the transcript ID
- graceful handling when an entry has no desc, and when an ID is absent
  from the Ensembl response

All Ensembl/UniProt calls are mocked so the new tests are deterministic
and offline; existing live tests are left unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add targeted, network-free tests for the previously-uncovered branches of
the scverse#187 fix (patch coverage was 82.6%, 4 lines in gget/gget_seq.py):

- lookup/id classification failure -> falls back to a genomic request
- isoforms=True on a gene where one transcript's cDNA fetch fails -> the
  error is logged and the remaining transcripts are still returned
- isoforms=True on a transcript whose cDNA fetch fails -> error logged,
  no sequence returned

These exercise the three `except RuntimeError` handlers that were not hit
before, bringing patch coverage of the fix to ~100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Elarwei001 Elarwei001 marked this pull request as draft June 25, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants