Harden deployment readiness by majdabd · Pull Request #9 · cbib/TrialMatchAI

majdabd · 2026-06-19T13:06:12Z

No description provided.

Copilot

Pull request overview

This PR hardens TrialMatchAI’s deployment/readiness story by centralizing Elasticsearch configuration, adding preflight validation, tightening JSON extraction from model outputs, and introducing a containerized (Docker Compose) runtime path while removing committed secrets/generated artifacts.

Changes:

Added preflight checks + CLI entrypoints (healthcheck/run/bootstrap/index) and wired them into setup/CI.
Centralized Elasticsearch indexer config (utils/Indexer/es_config.py) with env overrides and safer cert path handling.
Removed committed secrets/artifacts (Elasticsearch certs/config, Parser input/output samples, tracked IDs) and added secret scanning + ignore rules.

Reviewed changes

Copilot reviewed 136 out of 159 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
utils/Preprocessor/utils.py	Fixes primary completion date extraction and removes unused NER overlap variables.
utils/Indexer/index_trials.py	Switches to shared ES config helpers for consistent client/config creation.
utils/Indexer/index_criteria.py	Switches to shared ES config helpers and updates default index name.
utils/Indexer/es_config.py	New shared ES config loader/client builder with env overrides and CA path resolution.
utils/Indexer/config.json	Removes real password and updates CA cert path to new layout.
utils/finetuning/finetune_instruct/evaluate_gemma2.py	Makes incorrect-prediction output persist to disk and tidies imports.
utils/DataLoader/nct_ids.txt	Removes tracked generated ID file.
tests/test_settings.py	Removes manual `sys.path` manipulation for imports.
tests/test_search_queries.py	Removes manual `sys.path` manipulation for imports.
tests/test_schemas.py	Removes manual `sys.path` manipulation for imports.
tests/test_preflight_and_indexer.py	Adds unit tests covering preflight checks and indexer config env overrides.
tests/test_logging.py	Removes manual `sys.path` manipulation for imports.
tests/test_file_utils_pytest.py	Removes unused import.
tests/test_deployment_readiness.py	Adds tests for config resolution/env overrides + prompt/JSON extraction behavior.
source/Parser/output/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/output/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.gner.json	Removes committed generated Parser output artifact.
source/Parser/output/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.biomedner.json	Removes committed generated Parser output artifact.
source/Parser/input/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/b5d32e3a13ff4d519235aa93bf6faeaa8e80439b77a08925ae9c617c.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/89e92a6963f2179038668173907fc2a79ae48e7712fe3dddfe49dace.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/7aed0d619cfb7fb7edf932fc5aeff01e489c3be8482e4c08c26f4de3.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/6db4a18d11e3899c21b4cc11489cf3c8b457a40273c48ecd39ab4377.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/6d60d5573378fb2ca71a90099fe304ec4d5532cafebec4676f42fe28.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/61d5392634e0658327dab1a020c904645ec31ef6fbb49b058ee86cdc.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/611019227d9ea65a714df4e8c4498bfc13e21c67879025c3b03d8637.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/525bbc5e481bdaf825bf80725e1ac63c15786fa13120fe734d703c0e.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/427736011957cbb0ce549f492b0330b9d77ba984a2beb7b2abca8453.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.gner.PubTator	Removes committed generated Parser input artifact.
source/Parser/input/1cd9760fe682423e9e3c37d70f3e932d45259c7c9d4b6e911fbf9b42.PubTator.biomedner.PubTator	Removes committed generated Parser input artifact.
source/Parser/biomedner_init.py	Renames locals and removes unused variables to reduce lint noise.
source/Parser/biomedner_engine.py	Removes stray `print(os.getcwd())` debug output.
source/Matcher/utils/json_utils.py	Adds balanced-brace JSON extraction helper for model output parsing.
source/Matcher/services/preflight.py	Adds preflight validation for paths, ES reachability, models, and optional vLLM requirements.
source/Matcher/services/elasticsearch_service.py	Adds helper to build an ES client from config + certs.
source/Matcher/services/biomedner_service.py	Gates BioMedNER auto-start behind `services.auto_start`.
source/Matcher/pipeline/phenopacket_processor.py	Switches LLM JSON parsing to balanced-brace extraction helper.
source/Matcher/pipeline/cot_reasoning.py	Removes consent injection text and uses robust JSON extraction + error output persistence.
source/Matcher/pipeline/cot_reasoning_vllm.py	Removes consent injection text and uses robust JSON extraction + error output persistence.
source/Matcher/models/llm/vllm_loader.py	Disables remote code by default and supports explicit trust/revision configuration.
source/Matcher/models/llm/llm_reranker.py	Adds revision + trust_remote_code plumbing to reranker loading.
source/Matcher/models/llm/llm_loader.py	Adds revision + trust_remote_code plumbing to base model loading.
source/Matcher/models/embedding/text_embedder.py	Adds revision + trust_remote_code plumbing to embedder loading.
source/Matcher/main.py	Adds preflight execution and better exit codes; defers BioMedNER import until after readiness checks.
source/Matcher/config/settings.py	Expands env overrides, adds new settings fields, and refactors nested override handling.
source/Matcher/config/config.json	Updates default paths, disables auto-start defaults, standardizes index names, adds revision/trust flags.
source/Matcher/config/config_loader.py	Adds robust config path resolution + repo-root `.env` loading + path normalization.
source/Matcher/cli/run.py	New CLI entrypoint for running the batch pipeline.
source/Matcher/cli/index_data.py	New CLI entrypoint that runs the indexing script from repo root.
source/Matcher/cli/healthcheck.py	Switches to preflight checks + unified ES client builder; adds optional index requirement.
source/Matcher/cli/bootstrap_data.py	New CLI entrypoint that runs the bootstrap script from repo root.
setup.sh	Uses packaged CLI commands (via `uv run` when available) for bootstrap and indexing.
scripts/start_es.sh	Supports both `docker compose` and legacy `docker-compose`.
scripts/scan_secrets.py	Adds tracked-file secret scanning for CI/local checks.
scripts/bootstrap_data.sh	Adds checksum verification and safer archive extraction helpers.
requirements.txt	Restructures dependencies and documents extras (gpu/llm/training).
README.md	Updates deployment guidance, security posture, CLI usage, and checks.
pyproject.toml	Updates Python requirement, dependencies/extras, script entrypoints, and package data.
Makefile	Adds targets for GPU sync, auditing, and running new CLIs.
elasticsearch/tmp-config/roles.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/role_mapping.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/log4j2.properties	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/log4j2.file.properties	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/jvm.options	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/elasticsearch.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/tmp-config/elasticsearch-plugins.example.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/docker-compose.yml	Adjusts cert volume mount to allow setup container to write certs.
elasticsearch/config/es03/roles.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/role_mapping.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/log4j2.properties	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/jvm.options	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/es03.key	Removes tracked TLS private key material.
elasticsearch/config/es03/es03.crt	Removes tracked TLS cert material.
elasticsearch/config/es03/elasticsearch.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/elasticsearch-plugins.example.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es03/ca.crt	Removes tracked CA cert material.
elasticsearch/config/es02/roles.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/role_mapping.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/log4j2.properties	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/jvm.options	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/es02.key	Removes tracked TLS private key material.
elasticsearch/config/es02/es02.crt	Removes tracked TLS cert material.
elasticsearch/config/es02/elasticsearch.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/elasticsearch-plugins.example.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es02/ca.crt	Removes tracked CA cert material.
elasticsearch/config/es01/roles.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/role_mapping.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/log4j2.properties	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/jvm.options	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/es01.key	Removes tracked TLS private key material.
elasticsearch/config/es01/es01.crt	Removes tracked TLS cert material.
elasticsearch/config/es01/elasticsearch.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/elasticsearch-plugins.example.yml	Removes tracked Elasticsearch generated/config scaffolding.
elasticsearch/config/es01/ca.crt	Removes tracked CA cert material.
elasticsearch/certs/instances.yml	Removes tracked certutil instance configuration.
elasticsearch/certs/es03/es03.key	Removes tracked TLS private key material.
elasticsearch/certs/es03/es03.crt	Removes tracked TLS cert material.
elasticsearch/certs/es02/es02.key	Removes tracked TLS private key material.
elasticsearch/certs/es02/es02.crt	Removes tracked TLS cert material.
elasticsearch/certs/es01/es01.key	Removes tracked TLS private key material.
elasticsearch/certs/es01/es01.crt	Removes tracked TLS cert material.
elasticsearch/certs/ca/ca.key	Removes tracked CA private key material.
elasticsearch/certs/ca/ca.crt	Removes tracked CA cert material.
elasticsearch/certs/ca.crt	Removes tracked CA cert material.
elasticsearch/apptainer-run-es.sh	Adds checks to fail fast if TLS certs are missing for Apptainer ES startup.
elasticsearch/.env.example	Adds example env template for local ES stack.
elasticsearch/.env	Removes committed real credentials.
Dockerfile	Adds a container build for TrialMatchAI worker using `uv sync` + GPU extra.
docker-compose.yml	Adds root-level Compose stack for local ES + worker (+ optional Kibana).
.gitignore	Expands ignore rules to prevent committing env/certs/parser outputs/indexing state.
.github/workflows/ci.yml	Updates CI to include linting, secret scan, dependency audit, compose validation, and docker build.
.env.example	Adds root env template for runtime configuration.
.dockerignore	Prevents shipping secrets/artifacts into Docker build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if char == "{":
+            depth += 1
+        elif char == "}":
+            depth -= 1
+            if depth == 0:
+                return json.loads(text[start : index + 1])


+            issues.append(
+                f"Elasticsearch is not reachable at {config['elasticsearch']['host']}."
+            )


+      ELASTICSEARCH_HOSTS: https://elasticsearch:9200
+      ELASTICSEARCH_USERNAME: kibana_system
+      ELASTICSEARCH_PASSWORD: ${KIBANA_PASSWORD:-}
+      ELASTICSEARCH_SSL_CERTIFICATEAUTHORITIES: config/certs/ca/ca.crt


Safety net before the audit-driven refactor (see REFACTOR_PLAN.md). - CI: add ml-extras-smoke job that `uv sync --extra entity`, imports the heavy libs (torch/transformers/gliner/gliner2) and the six local-model modules the default job never exercises, so a broken import or API drift is caught instead of shipping silently. - CI: keep the installed-smoke wheel path absolute ("$PWD"/dist) so $WHEEL survives the cd into /tmp. - tests: characterization tests pinning the current (buggy) score_trial behavior, plus a strict xfail encoding the PR1 contract (a Violated exclusion must hard-disqualify). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Previously score_trial averaged the inclusion ratio with the exclusion ratio, so a trial whose exclusion criteria were Violated could still outrank an eligible trial (audit finding C1). The classification lists also counted impossible labels (inclusion can never be "Not Violated", exclusion can never be "Met"). score_trial now: - returns DISQUALIFIED_SCORE (-1.0) if any exclusion is "Violated", so it ranks strictly below every eligible trial; - otherwise scores eligible trials in [0, 1] by the fraction of decided inclusion criteria (Met or Not Met) that are Met. The prompts already constrain the label sets correctly, so only the scorer changed; the duplicated prompt text is addressed in PR7. Contract tests (disqualification, met-fraction, ranking order) replace the prior xfail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The reranker re-derived device/dtype/attention setup independently of llm_loader and got several wrong. Extract the shared logic into models/llm/_common.py and rebuild both loaders on it. Reranker fixes (audit C2 + highs/mediums): - device_map pinned to the resolved device instead of "auto", so the model's first layer lives where inputs are moved (fixes multi-GPU "tensors on different devices" crash); - left padding + pad token via configure_decoder_tokenizer, so logits[:, -1, :] reads the last real token, not a PAD position; - FlashAttention-2 -> SDPA fallback instead of hardcoding flash_attn; - compute dtype defaults to bf16/fp16 by capability instead of fp16; - device accepts int or "auto" (no silent coercion of non-ints to GPU 0); - dropped the ThreadPoolExecutor + model_lock that serialized anyway. llm_loader now reuses the same helpers. Stub-based unit tests cover the tokenizer/device/dtype/attention logic without the llm extra. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Re-grepped each target to confirm zero live references (only auto-generated egg-info/SOURCES.txt and the now-removed test/whitelist referenced them): - utils/evaluation.py: orphaned TREC eval, no caller, no entry point. - models/embedding/query_embedder.py, sentence_embedder.py: vestigial TextEmbedder subclasses nothing constructs. - preprocessing/regex/ tree: regex resource files with no Python consumer (leftovers from the deleted src/Matcher); also drop the package-data globs in pyproject and the "preprocessing" entry in the config_loader resource whitelist. - matching/phenopacket_processor.py (+ its test): superseded by the canonical interop/importers/phenopacket.py + interop/narrative.py path. This deletion also resolves the dead ClinicalSummarizer, the ontology label bug, the always-true temporal guards, the truncate typo, and the duplicate phenopacket pipeline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Verified-safe in-file dead code (re-grepped each before removal): - recognizers.with_schema_threshold (+ now-unused dataclasses.replace import) - types.EntityAnnotation.to_index_entity - utils/retry.py (with_retries) + its test — only the test used it; production retries go through tenacity - interop EvidenceSpan model, Provenance.raw_text_span field, and the PatientProfile.all_facts helper — none populated, read, or referenced (Provenance keeps extra="allow", so dropping the field is safe) - narrative.render_patient_narrative: dead style="audit" branch and the unused style parameter (only caller passed "rag") - annotator.annotate_texts_in_parallel: dead retries/delay params (accepted then immediately del'd; no caller supplies them) - criteria_retrieval.rerank_criteria: unused `queries` parameter (body keys off criterion["query"]); updated the call site Legacy shim removed: - CompatibilityEntityAnnotator was an empty SchemaEntityAnnotator subclass; build_entity_annotator now returns SchemaEntityAnnotator directly, and the export/test were repointed. Deferred: max_text_score -> PR6 (folded into the create_query rework); dead config settings (cot/LLM_reranker/TokenizerSettings/entity_extraction.threshold) -> PR9, since they are entangled with config.json + the env-override map and carry config-validation risk that does not belong in a deletion PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The join compared each child table's raw person_id against the patient's *sanitized* profile id via astype(str). Two real failures: 1. a null person_id in a child table promotes the column to float64, so person_id 1 serialized as "1.0" and "1.0" != "1" dropped every condition/measurement/drug/procedure/observation/note for that patient; 2. a person_id needing sanitization ("pat 01" -> "pat-01") never matched the raw child value. Now join on the raw person_id via _normalize_join_id (handles float promotion and string/int mismatch) and pre-group each child table once with _group_by_person, so each patient is an O(1) lookup instead of a full astype scan per table (removes the O(P*R) N+1). _concept_lookup also switches off iterrows. Profile ids are still sanitized for display/provenance. Two regression tests (null-person_id float promotion; sanitized id) fail on the old code and pass now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

1. prepare_trial_document now writes detailed_description and official_title. The backend weights both in TRIAL_TEXT_WEIGHTS (1.5 / 1.0) but they were never indexed, so the weights were dead and trials were under-indexed. (Requires rebuilding the LanceDB trial index to take effect.) 2. trial_ranker.load_trial_data only loads NCT-named files. It previously globbed every *.json in the output folder, scoring run sidecars (keywords/patient_profile/first_level_scores/rag_output) as bogus 0.0 trials. 3. _scan_rows fallback now applies the nct_id WHERE filter (table.search().where(...)). When FTS and vector both returned nothing it scanned an unfiltered head slice that could exclude the requested trials. Verified table.search().where() against lancedb 0.25.3. 4. create_query drops the 7 keys the backend never reads (age/sex/ overall_status/pre_selected_nct_ids/vector_score_threshold/max_text_score/ search_mode) and the misleading age=0-vs-None contract; search_trials already passes those filters to the backend directly. Removes the deferred max_text_score param. Regression tests: scan-fallback nct filter, NCT-only loading (+ sidecars skipped), minimal create_query contract, and prepared-doc field preservation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Behavior-preserving consolidation of copy-pasted logic: - models/embedding.build_embedder(config): single embedder factory replacing the identical TextEmbedder(TextEmbedderConfig(...)) block in main.py and the index/build-concepts/update-registry CLIs. - matching/retrieval/synonyms.disease_synonyms(): one disease-synonym extractor that ClinicalTrialSearch and SecondStageRetriever now delegate to. - utils/text.flatten_text(): single whitespace-normalizing flatten used by both registry.preparation and search.lancedb_backend (the two _flatten_text copies). Behavior-preserving for preparation since _preprocess_text re-normalizes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The HF (eligibility_reasoning) and vLLM (eligibility_reasoning_vllm) CoT processors duplicated the ~90-line prompt template, _load_trial_data, _save_outputs, and the worklist/length-bucketing orchestration verbatim — the largest duplication in the repo and a drift risk for the scoring contract. New matching/eligibility_base.BaseTrialProcessor holds the shared prompt, trial I/O, output persistence, and process_trials skeleton (parameterized by the _token_length and _progress_desc hooks). Each backend now subclasses it and implements only __init__ and _process_batch (plus vLLM's LoRA/token-length helpers). Prompt output is byte-identical; behavior preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…records - main_pipeline only loads (and half()s) the HuggingFace CoT model when cot_backend != "vllm". Under the vLLM backend run_rag_processing loads its own engine and ignored the HF model, so it was wasting GPU memory and load time (and could OOM alongside the vLLM engine). - _rank_trial_rows / _rank_criteria_rows skip rebuilding the trial/criteria record when the candidate row read from the index already has the derived fields (search_text); raw in-memory docs still get built. Avoids recomputing search_text/search_vector for every candidate on every query. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- logging_config: configure the root handler exactly once instead of building a new StreamHandler on every setup_logging() call (~all modules call it at import). Per-logger request-id ContextFilter behavior is unchanged. - import_patient: build and pass an embedder to the entity annotator so concept linking can use semantic search instead of silently degrading to lexical-only (still degrades gracefully to no-entities when ML extras are absent). - pyproject: drop 5 dependencies never imported in src (regex, python-dotenv, rich, bioregistry, rapidfuzz); regenerated uv.lock. numpy/pyarrow kept as real transitive runtime needs of pandas/lancedb. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@staticmethod

Lets users train their own models and plug them straight into the pipeline, instead of only relying on the vanilla checkpoints. The integration path already existed via config (entity_extraction.model_name, model.reranker_adapter_path, model.cot_adapter_path); this adds the missing training side, modernized for the current architecture (GLiNER for NER, LoRA for reranker/CoT). New trialmatchai.finetuning package (heavy deps imported lazily): - config.FinetuneConfig: shared LoRA SFT hyper-parameters. - data: JSONL loaders + prompt builders that REUSE the runtime prompts (LLMReranker.create_messages, chat templates) so train == inference; plus a char-span -> GLiNER token-span converter. - _sft.run_sft: LoRA SFT loop with prompt-masked labels (loss on completion). - cot/reranker/ner: thin task fine-tuners producing a LoRA adapter (cot, reranker) or a GLiNER checkpoint (ner). - cli: `trialmatchai-finetune {cot,reranker,ner}` console command. pyproject: new `finetune` optional extra (torch/transformers/peft/accelerate/ datasets/gliner[2]/bitsandbytes) + the console entry point. LLMReranker .create_messages is now a @staticmethod so finetuning reuses it without loading a model. CI imports the finetuning modules and smoke-tests the new CLI. Tests cover data conversion, prompt reuse, and CLI parsing (CPU-only). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- README: add an architecture/"how it works" overview, a "bring your own models" section (custom checkpoints/adapters via config), the new trialmatchai-finetune command, an extras table, and a navigable layout. - docs/finetuning.md: data formats, commands, and plug-back-in steps for the NER (GLiNER), reranker (LoRA Yes/No), and CoT (LoRA SFT) fine-tuners. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The audit-driven refactor (PR0-PR9), fine-tuning integration, and README/docs updates are complete and merged; the tracking plan is no longer needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eted) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

It exercises heavy third-party deps (torch/gliner) on a Linux runner that we don't gate releases on; keep it as a signal without failing the workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Research-driven corrections to the fine-tuning integration: - NER now uses the real GLiNER2 stack (GLiNER2Trainer / TrainingConfig / InputExample with entities={label: [surface forms]}), not the gliner-v1 token-classification API. entity_descriptions are back-filled from the entity schema so a fine-tuned model keeps the runtime label semantics. Saves a LoRA adapter (output_dir/final) or full checkpoint (output_dir/best); both load via entity_extraction.model_name. CLI exposes encoder-lr/task-lr/lora/schema-path. - data: NER converter emits GLiNER2 surface-form schema (accepts {entities}, char spans, or native {input,output}) instead of token spans. - _sft: add prepare_model_for_kbit_training + gradient checkpointing + a paged 8-bit optimizer + cosine schedule — required for stable, memory-efficient 4-bit (QLoRA) training of the reranker and CoT adapters. - docs/finetuning.md updated to the GLiNER2 data format and flags. Sources: GLiNER2 training + LoRA tutorials (fastino-ai/GLiNER2); QLoRA/PEFT fine-tuning best practices.

@staticmethod

Per direction: every LLM runs on vLLM (no HuggingFace backend), and fine-tuned LoRA adapters are served everywhere via vLLM's LoRARequest. Removed the HuggingFace LLM path entirely: - delete matching/eligibility_reasoning.py (HF CoT processor), models/llm/llm_loader.py, models/llm/_common.py, and tests/test_llm_common.py. - run_rag_processing always builds a vLLM engine; main_pipeline no longer loads or half()s an HF model. Drop the cot_backend config switch (settings, config.json, env map) and make preflight's vLLM/GPU check unconditional under require_models. Reranker now runs on vLLM: - rewrite LLMReranker to score (patient, criterion) pairs by constraining generation to the Yes/No tokens (allowed_token_ids) and reading their logprobs; a configured reranker_adapter_path is served via LoRARequest. create_messages is a @staticmethod (reused by the fine-tuner without loading a model). Adapters end to end: - CoT and reranker both load their LoRA adapters through vLLM (cot_adapter_path / reranker_adapter_path); NER LoRA loads via GLiNER2.load_adapter. - new `trialmatchai-finetune merge` to merge a LoRA adapter into the base model for users who prefer a standalone checkpoint over base+adapter. The embedder stays on transformers (it is an embedding model, not a generative LLM). Docs/README updated; CI import smoke adjusted to the vLLM modules.

Reintegrates domain knowledge dropped in the migration, folded into the current architecture rather than restored verbatim. Criteria chunking (registry/criteria_chunking.py): - replaces the generic line-splitter in normalization.split_eligibility_criteria with a single pass that understands multi-level enumeration hierarchies (1, 1.2, 1.2.3, (a), roman), varied inclusion/exclusion headers (incl. inline "Exclusion Criteria: 1. ... 2. ..."), parenthetical protection, decimal/ abbreviation split-exceptions, and continuation-line joining. 8 new tests. Genetic-variant recognizer (entities/recognizers.py): - restore the curated variant pattern table as entities/resources/ variant_patterns.tsv (HGVS mutations, fusions, chromosome arms, ...). - RegexVariantRecognizer matches these deterministically (e.g. p.V600E, c.1799T>A) and CompositeRecognizer runs it alongside the GLiNER model, merging by confidence/length so precise variant spans the model would miss are still captured. On by default (entity_extraction.variant_regex). 5 tests. Packaging: variant_patterns.tsv added to package-data; both load correctly from the built wheel.

Replace the arXiv reference with an explicit "please cite" message pointing to the Nature Communications paper (with a BibTeX entry); keep the Zenodo DOI as the software archive.

…ctions Adds the multi-channel first-level query planner and the eligibility-constraint extraction/evaluation subsystem, and corrects the bugs found in review: Constraints - exclusion criteria the patient cleanly passes now reward instead of scoring neutral (relation "absent": exclusion -> not-violated, inclusion -> neutral) - biomarker evaluation is wild-type-safe (a negative/wild-type patient no longer satisfies a "present"/"mutated" requirement) - age regexes require an explicit year unit or age cue (no more "100 to 200 mg" -> age); comparator-less labs are skipped (no silent ge default) - unknown_is_neutral config is now honored (penalizes unconfirmable inclusions) First-level planner - hard_filters config is actually applied (and [] truly disables filters); age parsing degrades gracefully instead of aborting the search; a guard logs when all channels return nothing; disease_synonyms output is deterministic Other - restore the entity_extraction.variant_regex knob broken by extra="forbid" - remove the vestigial GLiNER-v1 backend Regression tests added for each fix.

Adds geographic location as an opt-in first-level hard filter, done site-aware and recall-safe rather than as a blunt cut. - interop: PatientProfile gains a Location (country/state/city); the FHIR importer populates it from Patient.address. - matching/retrieval/location.py: country-level, match-ANY-site filtering. A trial is kept when the patient's country is unknown, when the trial has no indexed site countries, or when any of its sites is in the patient's country — so trials with unknown locations are never dropped. - run_first_level_search applies it only when "location" is in search.first_level.hard_filters (default keeps age/sex/overall_status, so the behavior is unchanged unless opted in). - settings: hard_filters Literal accepts "location". Tests cover recall-safety, the end-to-end hard-filter on/off behavior, and FHIR address extraction. README updated.

The OMOP importer now resolves a patient's location via the LOCATION table (PERSON.location_id -> LOCATION), filling country (country_source_value, or country_concept_id via concept lookup), state, and city. NaN-safe field extraction; degrades to no location when the table or link is absent. This extends the optional country-level location hard filter to OMOP patients (previously FHIR-only). Test added; README updated.

Addresses the correctness/coverage gaps found in adversarial review so the importer is trustworthy on real Epic/Cerner R4 output, not just Synthea. Correctness (incl. one patient-safety bug): - honor clinicalStatus/verificationStatus/status: resolved/inactive/refuted conditions are now NEGATED (no longer matched as active); entered-in-error, cancelled, and not-done resources are dropped (recorded in `unsupported`). Completed/stopped medications stay un-negated (real prior exposure). - CodeableConcept: extract ALL codings, ordering known vocabularies (SNOMED/LOINC/RxNorm/ICD) first, so concept linking uses the standard code instead of a proprietary one; secondary codings retained. - parse_date handles partial FHIR dates (YYYY, YYYY-MM) -> age no longer lost. Coverage: - medications via medicationReference / contained Medication / R5 wrapper. - Observation.component panels, interpretation flags, and value[x] variants (comparator, Range, Ratio, CodeableConcept, string/bool/int/datetime). - onset/effective/performed Period + recordedDate temporality; cleaner note and dosageInstruction text; broader genomic-observation detection. Robustness: - NDJSON loader tolerates malformed lines (skips + logs) instead of aborting. - reference resolution handles urn:uuid, absolute URLs, and contained refs; orphan resources attributed once, with a logged mapping-failure reason. 9 regression tests added.

…g-model eligibility fixes - Add CPU-capable Transformers reranker (TransformersReranker) and eligibility processor (BatchTrialProcessorTransformers) so the pipeline runs without vLLM - Add no_think config option (rag.no_think) to disable <think> blocks on reasoning models like Qwen3; passes enable_thinking=False to apply_chat_template and falls back to /no_think prefix for plain-prompt paths - Strip <think>...</think> blocks from model output before JSON extraction so eligibility responses from thinking models are parsed correctly - Separate reranker_model_path from base_model so a lightweight model can handle second-level reranking while a larger model handles RAG eligibility reasoning - Add hf/hashing backends to EmbedderSettings, RagSettings.enabled/backend, LLMRerankerSettings.enabled, and ConceptLinkerSettings to settings schema - Finetuning: multi-GPU support, LoRA config improvements, data pipeline hardening, NER and reranker trainer updates, extended CLI flags - Registry preparation: hardened FHIR/OMOP importer, improved field normalisation - LanceDB backend: expose candidate_limit, add FTS fallback path - Preflight: extended checks for concept index, embedding model, and registry - Update default entity model to fastino/gliner2-base-v1 across config and settings - Add tests for embedding, deployment readiness, finetuning, LanceDB backend, patient runtime loading, preflight, and registry updater Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Harden deployment readiness

6bebe5e

Copilot AI review requested due to automatic review settings June 19, 2026 13:06

Copilot started reviewing on behalf of majdabd June 19, 2026 13:06 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

majdabd and others added 26 commits June 21, 2026 16:12

Replace BioNER services with schema entity linker

9e8e9da

Replace search service with LanceDB backend

3746f00

Professionalize package and add registry updater

d51239d

Remove Docker deployment artifacts

e20a3dc

Clean up package workflows and bootstrap CLI

40b2a09

Add patient interoperability profiles

3f737b2

Fix lean CI optional ML imports

c4b753e

Harden package release and remove legacy shims

463bfe0

chore: remove REFACTOR_PLAN.md

01b30e8

The audit-driven refactor (PR0-PR9), fine-tuning integration, and README/docs updates are complete and merged; the tracking plan is no longer needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ci: drop refactor/audit-fixes from push triggers (branch merged & del…

a337868

…eted) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ci: make ml-extras-smoke non-blocking

6b197b8

It exercises heavy third-party deps (torch/gliner) on a Linux runner that we don't gate releases on; keep it as a signal without failing the workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

majdabd and others added 7 commits June 22, 2026 11:26

docs: cite the Nature Communications paper

b1f344a

Replace the arXiv reference with an explicit "please cite" message pointing to the Nature Communications paper (with a BibTeX entry); keep the Zenodo DOI as the software archive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden deployment readiness#9

Harden deployment readiness#9
majdabd wants to merge 34 commits into
mainfrom
deployment-readiness-audit

majdabd commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

majdabd commented Jun 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants