Skip to content

Merge master and fix lint for fix_metatraits#535

Open
turbomam wants to merge 9 commits intofix_metatraitsfrom
fix_metatraits-rebase-attempt
Open

Merge master and fix lint for fix_metatraits#535
turbomam wants to merge 9 commits intofix_metatraitsfrom
fix_metatraits-rebase-attempt

Conversation

@turbomam
Copy link
Copy Markdown
Collaborator

Summary

Changes

  • Merge conflict resolution: Accept master's mediadive_bulk_download.py with _make_session() helper and expanded get_json_from_api signature
  • 3x S110 try-except-pass: Added noqa: S110 suppressions for cleanup code where silencing exceptions is intentional
  • E501 line too long: Split f-string in parallel processing message
  • D203 blank line before class docstring: Auto-fixed in test file

Context

CI on fix_metatraits has been failing (lint) and the branch has a merge conflict with master after #527 and #530 were merged. This PR brings the branch up to date so it can pass CI and be evaluated for merge.

crocodile27 and others added 8 commits March 19, 2026 12:18
Replace sequential per-medium API loops with a 20-worker thread pool,
reducing ~3,333 serial requests per phase (~30 min each) to ~1-2 min.
Applies to both detailed recipe and medium-strain association downloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per Mark's comments:
- Lower DEFAULT_MAX_WORKERS from 20 → 5 (polite for small academic API at DSMZ)
- Add USER_AGENT constant identifying kg-microbe so API operator can reach out
- Respect Retry-After header on 429 responses instead of fixed retry_delay
- Switch from futures-dict to executor.map in both download functions
- Tighten return type annotations on _fetch_* helpers to tuple[str, dict]

Per Marcin's comments:
- Make max_workers, retry_count, retry_delay parameters on download_detailed_media,
  download_medium_strains, and download_mediadive_bulk so callers can tune them
- Pass retry_count and retry_delay through _fetch_* helpers into get_json_from_api
  (previously helpers always used defaults, ignoring caller overrides)
- Add tests/test_mediadive_bulk_download.py: 8 tests covering default values,
  Retry-After behaviour, retry parameter propagation, and concurrency bounds

136 passed, 25 skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidate chemical mappings into unified utility
The bulk downloader creates mediadive_bulk_cache.sqlite (50MB) but only
mediadive_cache.sqlite was gitignored. Also ignore the generated output
JSONs in data/raw/mediadive/.

See #533

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ownload

Parallelize MediaDive bulk download (~20x faster)
Accept master's version of mediadive_bulk_download.py which includes
_make_session() helper and expanded get_json_from_api signature from
the parallelization PR.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Fix 3x S110 try-except-pass with noqa suppressions (cleanup code)
- Fix E501 line too long in parallel processing message
- Fix D203 blank line before class docstring in test file (ruff --fix)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 27, 2026 19:01
@turbomam turbomam requested a review from realmarcin March 27, 2026 19:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges recent master updates into the fix_metatraits branch and resolves CI lint failures, including adopting the updated parallelized MediaDive bulk download utilities and adding/adjusting lint suppressions and formatting fixes.

Changes:

  • Merge conflict resolution bringing in the newer mediadive_bulk_download.py API (session helper, retry/header handling, parallel fetching).
  • Lint fixes in MetaTraits cleanup code (# noqa: S110) and a long f-string split.
  • Minor lint-related adjustments (zip strict arg, .gitignore updates) and a new MediaDive bulk download test file.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
kg_microbe/utils/mediadive_bulk_download.py Introduces parallel download, session helper, retry behavior (incl. Retry-After), and new function parameters.
tests/test_mediadive_bulk_download.py Adds unit tests for defaults, retry behavior, and concurrency bounds.
kg_microbe/utils/robot_utils.py Lint-driven update to zip(..., strict=...) in ROBOT CLI call construction.
kg_microbe/transform_utils/metatraits/metatraits.py Adds noqa suppressions for intentional cleanup exception silencing and splits a long print f-string.
.gitignore Ignores MediaDive bulk cache DB and raw output directory.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix 429 exhaustion bug: all-429 retry loop now returns {} instead of
  falling off the end and returning None
- Guard Retry-After parsing: handle HTTP-date values that can't be
  cast to float, fall back to retry_delay
- Fix docstring: session param creates new session per call, not
  module-level
- Remove unused requests_per_second parameter from both download
  functions (was documented but never implemented)
- Fix isinstance(terms, List) -> isinstance(terms, list) in
  robot_utils.py for Python 3.10+ compatibility

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants