Harden deployment readiness#9
Open
majdabd wants to merge 34 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens TrialMatchAI’s deployment/readiness story by centralizing Elasticsearch configuration, adding preflight validation, tightening JSON extraction from model outputs, and introducing a containerized (Docker Compose) runtime path while removing committed secrets/generated artifacts.
Changes:
- Added preflight checks + CLI entrypoints (healthcheck/run/bootstrap/index) and wired them into setup/CI.
- Centralized Elasticsearch indexer config (
utils/Indexer/es_config.py) with env overrides and safer cert path handling. - Removed committed secrets/artifacts (Elasticsearch certs/config, Parser input/output samples, tracked IDs) and added secret scanning + ignore rules.
Reviewed changes
Copilot reviewed 136 out of 159 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/Preprocessor/utils.py | Fixes primary completion date extraction and removes unused NER overlap variables. |
| utils/Indexer/index_trials.py | Switches to shared ES config helpers for consistent client/config creation. |
| utils/Indexer/index_criteria.py | Switches to shared ES config helpers and updates default index name. |
| utils/Indexer/es_config.py | New shared ES config loader/client builder with env overrides and CA path resolution. |
| utils/Indexer/config.json | Removes real password and updates CA cert path to new layout. |
| utils/finetuning/finetune_instruct/evaluate_gemma2.py | Makes incorrect-prediction output persist to disk and tidies imports. |
| utils/DataLoader/nct_ids.txt | Removes tracked generated ID file. |
| tests/test_settings.py | Removes manual sys.path manipulation for imports. |
| tests/test_search_queries.py | Removes manual sys.path manipulation for imports. |
| tests/test_schemas.py | Removes manual sys.path manipulation for imports. |
| tests/test_preflight_and_indexer.py | Adds unit tests covering preflight checks and indexer config env overrides. |
| tests/test_logging.py | Removes manual sys.path manipulation for imports. |
| tests/test_file_utils_pytest.py | Removes unused import. |
| tests/test_deployment_readiness.py | Adds tests for config resolution/env overrides + prompt/JSON extraction behavior. |
| source/Parser/output/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.gner.json | Removes committed generated Parser output artifact. |
| source/Parser/output/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.biomedner.json | Removes committed generated Parser output artifact. |
| source/Parser/input/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.gner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/input/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.biomedner.PubTator | Removes committed generated Parser input artifact. |
| source/Parser/biomedner_init.py | Renames locals and removes unused variables to reduce lint noise. |
| source/Parser/biomedner_engine.py | Removes stray print(os.getcwd()) debug output. |
| source/Matcher/utils/json_utils.py | Adds balanced-brace JSON extraction helper for model output parsing. |
| source/Matcher/services/preflight.py | Adds preflight validation for paths, ES reachability, models, and optional vLLM requirements. |
| source/Matcher/services/elasticsearch_service.py | Adds helper to build an ES client from config + certs. |
| source/Matcher/services/biomedner_service.py | Gates BioMedNER auto-start behind services.auto_start. |
| source/Matcher/pipeline/phenopacket_processor.py | Switches LLM JSON parsing to balanced-brace extraction helper. |
| source/Matcher/pipeline/cot_reasoning.py | Removes consent injection text and uses robust JSON extraction + error output persistence. |
| source/Matcher/pipeline/cot_reasoning_vllm.py | Removes consent injection text and uses robust JSON extraction + error output persistence. |
| source/Matcher/models/llm/vllm_loader.py | Disables remote code by default and supports explicit trust/revision configuration. |
| source/Matcher/models/llm/llm_reranker.py | Adds revision + trust_remote_code plumbing to reranker loading. |
| source/Matcher/models/llm/llm_loader.py | Adds revision + trust_remote_code plumbing to base model loading. |
| source/Matcher/models/embedding/text_embedder.py | Adds revision + trust_remote_code plumbing to embedder loading. |
| source/Matcher/main.py | Adds preflight execution and better exit codes; defers BioMedNER import until after readiness checks. |
| source/Matcher/config/settings.py | Expands env overrides, adds new settings fields, and refactors nested override handling. |
| source/Matcher/config/config.json | Updates default paths, disables auto-start defaults, standardizes index names, adds revision/trust flags. |
| source/Matcher/config/config_loader.py | Adds robust config path resolution + repo-root .env loading + path normalization. |
| source/Matcher/cli/run.py | New CLI entrypoint for running the batch pipeline. |
| source/Matcher/cli/index_data.py | New CLI entrypoint that runs the indexing script from repo root. |
| source/Matcher/cli/healthcheck.py | Switches to preflight checks + unified ES client builder; adds optional index requirement. |
| source/Matcher/cli/bootstrap_data.py | New CLI entrypoint that runs the bootstrap script from repo root. |
| setup.sh | Uses packaged CLI commands (via uv run when available) for bootstrap and indexing. |
| scripts/start_es.sh | Supports both docker compose and legacy docker-compose. |
| scripts/scan_secrets.py | Adds tracked-file secret scanning for CI/local checks. |
| scripts/bootstrap_data.sh | Adds checksum verification and safer archive extraction helpers. |
| requirements.txt | Restructures dependencies and documents extras (gpu/llm/training). |
| README.md | Updates deployment guidance, security posture, CLI usage, and checks. |
| pyproject.toml | Updates Python requirement, dependencies/extras, script entrypoints, and package data. |
| Makefile | Adds targets for GPU sync, auditing, and running new CLIs. |
| elasticsearch/tmp-config/roles.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/tmp-config/role_mapping.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/tmp-config/log4j2.properties | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/tmp-config/log4j2.file.properties | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/tmp-config/jvm.options | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/tmp-config/elasticsearch.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/tmp-config/elasticsearch-plugins.example.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/docker-compose.yml | Adjusts cert volume mount to allow setup container to write certs. |
| elasticsearch/config/es03/roles.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es03/role_mapping.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es03/log4j2.properties | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es03/jvm.options | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es03/es03.key | Removes tracked TLS private key material. |
| elasticsearch/config/es03/es03.crt | Removes tracked TLS cert material. |
| elasticsearch/config/es03/elasticsearch.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es03/elasticsearch-plugins.example.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es03/ca.crt | Removes tracked CA cert material. |
| elasticsearch/config/es02/roles.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es02/role_mapping.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es02/log4j2.properties | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es02/jvm.options | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es02/es02.key | Removes tracked TLS private key material. |
| elasticsearch/config/es02/es02.crt | Removes tracked TLS cert material. |
| elasticsearch/config/es02/elasticsearch.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es02/elasticsearch-plugins.example.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es02/ca.crt | Removes tracked CA cert material. |
| elasticsearch/config/es01/roles.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es01/role_mapping.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es01/log4j2.properties | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es01/jvm.options | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es01/es01.key | Removes tracked TLS private key material. |
| elasticsearch/config/es01/es01.crt | Removes tracked TLS cert material. |
| elasticsearch/config/es01/elasticsearch.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es01/elasticsearch-plugins.example.yml | Removes tracked Elasticsearch generated/config scaffolding. |
| elasticsearch/config/es01/ca.crt | Removes tracked CA cert material. |
| elasticsearch/certs/instances.yml | Removes tracked certutil instance configuration. |
| elasticsearch/certs/es03/es03.key | Removes tracked TLS private key material. |
| elasticsearch/certs/es03/es03.crt | Removes tracked TLS cert material. |
| elasticsearch/certs/es02/es02.key | Removes tracked TLS private key material. |
| elasticsearch/certs/es02/es02.crt | Removes tracked TLS cert material. |
| elasticsearch/certs/es01/es01.key | Removes tracked TLS private key material. |
| elasticsearch/certs/es01/es01.crt | Removes tracked TLS cert material. |
| elasticsearch/certs/ca/ca.key | Removes tracked CA private key material. |
| elasticsearch/certs/ca/ca.crt | Removes tracked CA cert material. |
| elasticsearch/certs/ca.crt | Removes tracked CA cert material. |
| elasticsearch/apptainer-run-es.sh | Adds checks to fail fast if TLS certs are missing for Apptainer ES startup. |
| elasticsearch/.env.example | Adds example env template for local ES stack. |
| elasticsearch/.env | Removes committed real credentials. |
| Dockerfile | Adds a container build for TrialMatchAI worker using uv sync + GPU extra. |
| docker-compose.yml | Adds root-level Compose stack for local ES + worker (+ optional Kibana). |
| .gitignore | Expands ignore rules to prevent committing env/certs/parser outputs/indexing state. |
| .github/workflows/ci.yml | Updates CI to include linting, secret scan, dependency audit, compose validation, and docker build. |
| .env.example | Adds root env template for runtime configuration. |
| .dockerignore | Prevents shipping secrets/artifacts into Docker build context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+28
to
+33
| if char == "{": | ||
| depth += 1 | ||
| elif char == "}": | ||
| depth -= 1 | ||
| if depth == 0: | ||
| return json.loads(text[start : index + 1]) |
Comment on lines
+75
to
+77
| issues.append( | ||
| f"Elasticsearch is not reachable at {config['elasticsearch']['host']}." | ||
| ) |
Comment on lines
+126
to
+129
| ELASTICSEARCH_HOSTS: https://elasticsearch:9200 | ||
| ELASTICSEARCH_USERNAME: kibana_system | ||
| ELASTICSEARCH_PASSWORD: ${KIBANA_PASSWORD:-} | ||
| ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES: config/certs/ca/ca.crt |
Safety net before the audit-driven refactor (see REFACTOR_PLAN.md).
- CI: add ml-extras-smoke job that `uv sync --extra entity`, imports the
heavy libs (torch/transformers/gliner/gliner2) and the six local-model
modules the default job never exercises, so a broken import or API drift
is caught instead of shipping silently.
- CI: keep the installed-smoke wheel path absolute ("$PWD"/dist) so $WHEEL
survives the cd into /tmp.
- tests: characterization tests pinning the current (buggy) score_trial
behavior, plus a strict xfail encoding the PR1 contract (a Violated
exclusion must hard-disqualify).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Previously score_trial averaged the inclusion ratio with the exclusion ratio, so a trial whose exclusion criteria were Violated could still outrank an eligible trial (audit finding C1). The classification lists also counted impossible labels (inclusion can never be "Not Violated", exclusion can never be "Met"). score_trial now: - returns DISQUALIFIED_SCORE (-1.0) if any exclusion is "Violated", so it ranks strictly below every eligible trial; - otherwise scores eligible trials in [0, 1] by the fraction of decided inclusion criteria (Met or Not Met) that are Met. The prompts already constrain the label sets correctly, so only the scorer changed; the duplicated prompt text is addressed in PR7. Contract tests (disqualification, met-fraction, ranking order) replace the prior xfail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reranker re-derived device/dtype/attention setup independently of llm_loader and got several wrong. Extract the shared logic into models/llm/_common.py and rebuild both loaders on it. Reranker fixes (audit C2 + highs/mediums): - device_map pinned to the resolved device instead of "auto", so the model's first layer lives where inputs are moved (fixes multi-GPU "tensors on different devices" crash); - left padding + pad token via configure_decoder_tokenizer, so logits[:, -1, :] reads the last real token, not a PAD position; - FlashAttention-2 -> SDPA fallback instead of hardcoding flash_attn; - compute dtype defaults to bf16/fp16 by capability instead of fp16; - device accepts int or "auto" (no silent coercion of non-ints to GPU 0); - dropped the ThreadPoolExecutor + model_lock that serialized anyway. llm_loader now reuses the same helpers. Stub-based unit tests cover the tokenizer/device/dtype/attention logic without the llm extra. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-grepped each target to confirm zero live references (only auto-generated egg-info/SOURCES.txt and the now-removed test/whitelist referenced them): - utils/evaluation.py: orphaned TREC eval, no caller, no entry point. - models/embedding/query_embedder.py, sentence_embedder.py: vestigial TextEmbedder subclasses nothing constructs. - preprocessing/regex/ tree: regex resource files with no Python consumer (leftovers from the deleted src/Matcher); also drop the package-data globs in pyproject and the "preprocessing" entry in the config_loader resource whitelist. - matching/phenopacket_processor.py (+ its test): superseded by the canonical interop/importers/phenopacket.py + interop/narrative.py path. This deletion also resolves the dead ClinicalSummarizer, the ontology label bug, the always-true temporal guards, the truncate typo, and the duplicate phenopacket pipeline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Verified-safe in-file dead code (re-grepped each before removal): - recognizers.with_schema_threshold (+ now-unused dataclasses.replace import) - types.EntityAnnotation.to_index_entity - utils/retry.py (with_retries) + its test — only the test used it; production retries go through tenacity - interop EvidenceSpan model, Provenance.raw_text_span field, and the PatientProfile.all_facts helper — none populated, read, or referenced (Provenance keeps extra="allow", so dropping the field is safe) - narrative.render_patient_narrative: dead style="audit" branch and the unused style parameter (only caller passed "rag") - annotator.annotate_texts_in_parallel: dead retries/delay params (accepted then immediately del'd; no caller supplies them) - criteria_retrieval.rerank_criteria: unused `queries` parameter (body keys off criterion["query"]); updated the call site Legacy shim removed: - CompatibilityEntityAnnotator was an empty SchemaEntityAnnotator subclass; build_entity_annotator now returns SchemaEntityAnnotator directly, and the export/test were repointed. Deferred: max_text_score -> PR6 (folded into the create_query rework); dead config settings (cot/LLM_reranker/TokenizerSettings/entity_extraction.threshold) -> PR9, since they are entangled with config.json + the env-override map and carry config-validation risk that does not belong in a deletion PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The join compared each child table's raw person_id against the patient's
*sanitized* profile id via astype(str). Two real failures:
1. a null person_id in a child table promotes the column to float64, so
person_id 1 serialized as "1.0" and "1.0" != "1" dropped every
condition/measurement/drug/procedure/observation/note for that patient;
2. a person_id needing sanitization ("pat 01" -> "pat-01") never matched
the raw child value.
Now join on the raw person_id via _normalize_join_id (handles float promotion
and string/int mismatch) and pre-group each child table once with
_group_by_person, so each patient is an O(1) lookup instead of a full astype
scan per table (removes the O(P*R) N+1). _concept_lookup also switches off
iterrows. Profile ids are still sanitized for display/provenance.
Two regression tests (null-person_id float promotion; sanitized id) fail on
the old code and pass now.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. prepare_trial_document now writes detailed_description and official_title. The backend weights both in TRIAL_TEXT_WEIGHTS (1.5 / 1.0) but they were never indexed, so the weights were dead and trials were under-indexed. (Requires rebuilding the LanceDB trial index to take effect.) 2. trial_ranker.load_trial_data only loads NCT-named files. It previously globbed every *.json in the output folder, scoring run sidecars (keywords/patient_profile/first_level_scores/rag_output) as bogus 0.0 trials. 3. _scan_rows fallback now applies the nct_id WHERE filter (table.search().where(...)). When FTS and vector both returned nothing it scanned an unfiltered head slice that could exclude the requested trials. Verified table.search().where() against lancedb 0.25.3. 4. create_query drops the 7 keys the backend never reads (age/sex/ overall_status/pre_selected_nct_ids/vector_score_threshold/max_text_score/ search_mode) and the misleading age=0-vs-None contract; search_trials already passes those filters to the backend directly. Removes the deferred max_text_score param. Regression tests: scan-fallback nct filter, NCT-only loading (+ sidecars skipped), minimal create_query contract, and prepared-doc field preservation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Behavior-preserving consolidation of copy-pasted logic: - models/embedding.build_embedder(config): single embedder factory replacing the identical TextEmbedder(TextEmbedderConfig(...)) block in main.py and the index/build-concepts/update-registry CLIs. - matching/retrieval/synonyms.disease_synonyms(): one disease-synonym extractor that ClinicalTrialSearch and SecondStageRetriever now delegate to. - utils/text.flatten_text(): single whitespace-normalizing flatten used by both registry.preparation and search.lancedb_backend (the two _flatten_text copies). Behavior-preserving for preparation since _preprocess_text re-normalizes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The HF (eligibility_reasoning) and vLLM (eligibility_reasoning_vllm) CoT processors duplicated the ~90-line prompt template, _load_trial_data, _save_outputs, and the worklist/length-bucketing orchestration verbatim — the largest duplication in the repo and a drift risk for the scoring contract. New matching/eligibility_base.BaseTrialProcessor holds the shared prompt, trial I/O, output persistence, and process_trials skeleton (parameterized by the _token_length and _progress_desc hooks). Each backend now subclasses it and implements only __init__ and _process_batch (plus vLLM's LoRA/token-length helpers). Prompt output is byte-identical; behavior preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…records - main_pipeline only loads (and half()s) the HuggingFace CoT model when cot_backend != "vllm". Under the vLLM backend run_rag_processing loads its own engine and ignored the HF model, so it was wasting GPU memory and load time (and could OOM alongside the vLLM engine). - _rank_trial_rows / _rank_criteria_rows skip rebuilding the trial/criteria record when the candidate row read from the index already has the derived fields (search_text); raw in-memory docs still get built. Avoids recomputing search_text/search_vector for every candidate on every query. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- logging_config: configure the root handler exactly once instead of building a new StreamHandler on every setup_logging() call (~all modules call it at import). Per-logger request-id ContextFilter behavior is unchanged. - import_patient: build and pass an embedder to the entity annotator so concept linking can use semantic search instead of silently degrading to lexical-only (still degrades gracefully to no-entities when ML extras are absent). - pyproject: drop 5 dependencies never imported in src (regex, python-dotenv, rich, bioregistry, rapidfuzz); regenerated uv.lock. numpy/pyarrow kept as real transitive runtime needs of pandas/lancedb. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets users train their own models and plug them straight into the pipeline,
instead of only relying on the vanilla checkpoints. The integration path already
existed via config (entity_extraction.model_name, model.reranker_adapter_path,
model.cot_adapter_path); this adds the missing training side, modernized for the
current architecture (GLiNER for NER, LoRA for reranker/CoT).
New trialmatchai.finetuning package (heavy deps imported lazily):
- config.FinetuneConfig: shared LoRA SFT hyper-parameters.
- data: JSONL loaders + prompt builders that REUSE the runtime prompts
(LLMReranker.create_messages, chat templates) so train == inference; plus a
char-span -> GLiNER token-span converter.
- _sft.run_sft: LoRA SFT loop with prompt-masked labels (loss on completion).
- cot/reranker/ner: thin task fine-tuners producing a LoRA adapter (cot,
reranker) or a GLiNER checkpoint (ner).
- cli: `trialmatchai-finetune {cot,reranker,ner}` console command.
pyproject: new `finetune` optional extra (torch/transformers/peft/accelerate/
datasets/gliner[2]/bitsandbytes) + the console entry point. LLMReranker
.create_messages is now a @staticmethod so finetuning reuses it without loading
a model. CI imports the finetuning modules and smoke-tests the new CLI.
Tests cover data conversion, prompt reuse, and CLI parsing (CPU-only).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- README: add an architecture/"how it works" overview, a "bring your own models" section (custom checkpoints/adapters via config), the new trialmatchai-finetune command, an extras table, and a navigable layout. - docs/finetuning.md: data formats, commands, and plug-back-in steps for the NER (GLiNER), reranker (LoRA Yes/No), and CoT (LoRA SFT) fine-tuners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The audit-driven refactor (PR0-PR9), fine-tuning integration, and README/docs updates are complete and merged; the tracking plan is no longer needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eted) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
It exercises heavy third-party deps (torch/gliner) on a Linux runner that we don't gate releases on; keep it as a signal without failing the workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Research-driven corrections to the fine-tuning integration:
- NER now uses the real GLiNER2 stack (GLiNER2Trainer / TrainingConfig /
InputExample with entities={label: [surface forms]}), not the gliner-v1
token-classification API. entity_descriptions are back-filled from the entity
schema so a fine-tuned model keeps the runtime label semantics. Saves a LoRA
adapter (output_dir/final) or full checkpoint (output_dir/best); both load via
entity_extraction.model_name. CLI exposes encoder-lr/task-lr/lora/schema-path.
- data: NER converter emits GLiNER2 surface-form schema (accepts {entities},
char spans, or native {input,output}) instead of token spans.
- _sft: add prepare_model_for_kbit_training + gradient checkpointing + a paged
8-bit optimizer + cosine schedule — required for stable, memory-efficient
4-bit (QLoRA) training of the reranker and CoT adapters.
- docs/finetuning.md updated to the GLiNER2 data format and flags.
Sources: GLiNER2 training + LoRA tutorials (fastino-ai/GLiNER2); QLoRA/PEFT
fine-tuning best practices.
Per direction: every LLM runs on vLLM (no HuggingFace backend), and fine-tuned LoRA adapters are served everywhere via vLLM's LoRARequest. Removed the HuggingFace LLM path entirely: - delete matching/eligibility_reasoning.py (HF CoT processor), models/llm/llm_loader.py, models/llm/_common.py, and tests/test_llm_common.py. - run_rag_processing always builds a vLLM engine; main_pipeline no longer loads or half()s an HF model. Drop the cot_backend config switch (settings, config.json, env map) and make preflight's vLLM/GPU check unconditional under require_models. Reranker now runs on vLLM: - rewrite LLMReranker to score (patient, criterion) pairs by constraining generation to the Yes/No tokens (allowed_token_ids) and reading their logprobs; a configured reranker_adapter_path is served via LoRARequest. create_messages is a @staticmethod (reused by the fine-tuner without loading a model). Adapters end to end: - CoT and reranker both load their LoRA adapters through vLLM (cot_adapter_path / reranker_adapter_path); NER LoRA loads via GLiNER2.load_adapter. - new `trialmatchai-finetune merge` to merge a LoRA adapter into the base model for users who prefer a standalone checkpoint over base+adapter. The embedder stays on transformers (it is an embedding model, not a generative LLM). Docs/README updated; CI import smoke adjusted to the vLLM modules.
Reintegrates domain knowledge dropped in the migration, folded into the current architecture rather than restored verbatim. Criteria chunking (registry/criteria_chunking.py): - replaces the generic line-splitter in normalization.split_eligibility_criteria with a single pass that understands multi-level enumeration hierarchies (1, 1.2, 1.2.3, (a), roman), varied inclusion/exclusion headers (incl. inline "Exclusion Criteria: 1. ... 2. ..."), parenthetical protection, decimal/ abbreviation split-exceptions, and continuation-line joining. 8 new tests. Genetic-variant recognizer (entities/recognizers.py): - restore the curated variant pattern table as entities/resources/ variant_patterns.tsv (HGVS mutations, fusions, chromosome arms, ...). - RegexVariantRecognizer matches these deterministically (e.g. p.V600E, c.1799T>A) and CompositeRecognizer runs it alongside the GLiNER model, merging by confidence/length so precise variant spans the model would miss are still captured. On by default (entity_extraction.variant_regex). 5 tests. Packaging: variant_patterns.tsv added to package-data; both load correctly from the built wheel.
Replace the arXiv reference with an explicit "please cite" message pointing to the Nature Communications paper (with a BibTeX entry); keep the Zenodo DOI as the software archive.
…ctions Adds the multi-channel first-level query planner and the eligibility-constraint extraction/evaluation subsystem, and corrects the bugs found in review: Constraints - exclusion criteria the patient cleanly passes now reward instead of scoring neutral (relation "absent": exclusion -> not-violated, inclusion -> neutral) - biomarker evaluation is wild-type-safe (a negative/wild-type patient no longer satisfies a "present"/"mutated" requirement) - age regexes require an explicit year unit or age cue (no more "100 to 200 mg" -> age); comparator-less labs are skipped (no silent ge default) - unknown_is_neutral config is now honored (penalizes unconfirmable inclusions) First-level planner - hard_filters config is actually applied (and [] truly disables filters); age parsing degrades gracefully instead of aborting the search; a guard logs when all channels return nothing; disease_synonyms output is deterministic Other - restore the entity_extraction.variant_regex knob broken by extra="forbid" - remove the vestigial GLiNER-v1 backend Regression tests added for each fix.
Adds geographic location as an opt-in first-level hard filter, done site-aware and recall-safe rather than as a blunt cut. - interop: PatientProfile gains a Location (country/state/city); the FHIR importer populates it from Patient.address. - matching/retrieval/location.py: country-level, match-ANY-site filtering. A trial is kept when the patient's country is unknown, when the trial has no indexed site countries, or when any of its sites is in the patient's country — so trials with unknown locations are never dropped. - run_first_level_search applies it only when "location" is in search.first_level.hard_filters (default keeps age/sex/overall_status, so the behavior is unchanged unless opted in). - settings: hard_filters Literal accepts "location". Tests cover recall-safety, the end-to-end hard-filter on/off behavior, and FHIR address extraction. README updated.
The OMOP importer now resolves a patient's location via the LOCATION table (PERSON.location_id -> LOCATION), filling country (country_source_value, or country_concept_id via concept lookup), state, and city. NaN-safe field extraction; degrades to no location when the table or link is absent. This extends the optional country-level location hard filter to OMOP patients (previously FHIR-only). Test added; README updated.
Addresses the correctness/coverage gaps found in adversarial review so the importer is trustworthy on real Epic/Cerner R4 output, not just Synthea. Correctness (incl. one patient-safety bug): - honor clinicalStatus/verificationStatus/status: resolved/inactive/refuted conditions are now NEGATED (no longer matched as active); entered-in-error, cancelled, and not-done resources are dropped (recorded in `unsupported`). Completed/stopped medications stay un-negated (real prior exposure). - CodeableConcept: extract ALL codings, ordering known vocabularies (SNOMED/LOINC/RxNorm/ICD) first, so concept linking uses the standard code instead of a proprietary one; secondary codings retained. - parse_date handles partial FHIR dates (YYYY, YYYY-MM) -> age no longer lost. Coverage: - medications via medicationReference / contained Medication / R5 wrapper. - Observation.component panels, interpretation flags, and value[x] variants (comparator, Range, Ratio, CodeableConcept, string/bool/int/datetime). - onset/effective/performed Period + recordedDate temporality; cleaner note and dosageInstruction text; broader genomic-observation detection. Robustness: - NDJSON loader tolerates malformed lines (skips + logs) instead of aborting. - reference resolution handles urn:uuid, absolute URLs, and contained refs; orphan resources attributed once, with a logged mapping-failure reason. 9 regression tests added.
…g-model eligibility fixes - Add CPU-capable Transformers reranker (TransformersReranker) and eligibility processor (BatchTrialProcessorTransformers) so the pipeline runs without vLLM - Add no_think config option (rag.no_think) to disable <think> blocks on reasoning models like Qwen3; passes enable_thinking=False to apply_chat_template and falls back to /no_think prefix for plain-prompt paths - Strip <think>...</think> blocks from model output before JSON extraction so eligibility responses from thinking models are parsed correctly - Separate reranker_model_path from base_model so a lightweight model can handle second-level reranking while a larger model handles RAG eligibility reasoning - Add hf/hashing backends to EmbedderSettings, RagSettings.enabled/backend, LLMRerankerSettings.enabled, and ConceptLinkerSettings to settings schema - Finetuning: multi-GPU support, LoRA config improvements, data pipeline hardening, NER and reranker trainer updates, extended CLI flags - Registry preparation: hardened FHIR/OMOP importer, improved field normalisation - LanceDB backend: expose candidate_limit, add FTS fallback path - Preflight: extended checks for concept index, embedding model, and registry - Update default entity model to fastino/gliner2-base-v1 across config and settings - Add tests for embedding, deployment readiness, finetuning, LanceDB backend, patient runtime loading, preflight, and registry updater Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.