Merge master and fix lint for fix_metatraits#535
Open
turbomam wants to merge 9 commits intofix_metatraitsfrom
Open
Merge master and fix lint for fix_metatraits#535turbomam wants to merge 9 commits intofix_metatraitsfrom
turbomam wants to merge 9 commits intofix_metatraitsfrom
Conversation
Replace sequential per-medium API loops with a 20-worker thread pool, reducing ~3,333 serial requests per phase (~30 min each) to ~1-2 min. Applies to both detailed recipe and medium-strain association downloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per Mark's comments: - Lower DEFAULT_MAX_WORKERS from 20 → 5 (polite for small academic API at DSMZ) - Add USER_AGENT constant identifying kg-microbe so API operator can reach out - Respect Retry-After header on 429 responses instead of fixed retry_delay - Switch from futures-dict to executor.map in both download functions - Tighten return type annotations on _fetch_* helpers to tuple[str, dict] Per Marcin's comments: - Make max_workers, retry_count, retry_delay parameters on download_detailed_media, download_medium_strains, and download_mediadive_bulk so callers can tune them - Pass retry_count and retry_delay through _fetch_* helpers into get_json_from_api (previously helpers always used defaults, ignoring caller overrides) - Add tests/test_mediadive_bulk_download.py: 8 tests covering default values, Retry-After behaviour, retry parameter propagation, and concurrency bounds 136 passed, 25 skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidate chemical mappings into unified utility
The bulk downloader creates mediadive_bulk_cache.sqlite (50MB) but only mediadive_cache.sqlite was gitignored. Also ignore the generated output JSONs in data/raw/mediadive/. See #533 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ownload Parallelize MediaDive bulk download (~20x faster)
Accept master's version of mediadive_bulk_download.py which includes _make_session() helper and expanded get_json_from_api signature from the parallelization PR. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix 3x S110 try-except-pass with noqa suppressions (cleanup code) - Fix E501 line too long in parallel processing message - Fix D203 blank line before class docstring in test file (ruff --fix) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR merges recent master updates into the fix_metatraits branch and resolves CI lint failures, including adopting the updated parallelized MediaDive bulk download utilities and adding/adjusting lint suppressions and formatting fixes.
Changes:
- Merge conflict resolution bringing in the newer
mediadive_bulk_download.pyAPI (session helper, retry/header handling, parallel fetching). - Lint fixes in MetaTraits cleanup code (
# noqa: S110) and a long f-string split. - Minor lint-related adjustments (zip strict arg,
.gitignoreupdates) and a new MediaDive bulk download test file.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
kg_microbe/utils/mediadive_bulk_download.py |
Introduces parallel download, session helper, retry behavior (incl. Retry-After), and new function parameters. |
tests/test_mediadive_bulk_download.py |
Adds unit tests for defaults, retry behavior, and concurrency bounds. |
kg_microbe/utils/robot_utils.py |
Lint-driven update to zip(..., strict=...) in ROBOT CLI call construction. |
kg_microbe/transform_utils/metatraits/metatraits.py |
Adds noqa suppressions for intentional cleanup exception silencing and splits a long print f-string. |
.gitignore |
Ignores MediaDive bulk cache DB and raw output directory. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix 429 exhaustion bug: all-429 retry loop now returns {} instead of
falling off the end and returning None
- Guard Retry-After parsing: handle HTTP-date values that can't be
cast to float, fall back to retry_delay
- Fix docstring: session param creates new session per call, not
module-level
- Remove unused requests_per_second parameter from both download
functions (was documented but never implemented)
- Fix isinstance(terms, List) -> isinstance(terms, list) in
robot_utils.py for Python 3.10+ compatibility
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fix_metatraitsto resolve conflict from PR Parallelize MediaDive bulk download (~20x faster) #527 (MediaDive parallelization)Changes
mediadive_bulk_download.pywith_make_session()helper and expandedget_json_from_apisignaturenoqa: S110suppressions for cleanup code where silencing exceptions is intentionalContext
CI on
fix_metatraitshas been failing (lint) and the branch has a merge conflict with master after #527 and #530 were merged. This PR brings the branch up to date so it can pass CI and be evaluated for merge.