feat(medcat):CU-869cy3xa0 Improve training by mart-r · Pull Request #414 · CogStack/cogstack-nlp

mart-r · 2026-04-21T15:34:22Z

This PR does an overhaul to the training setup of MedCAT:

It modifies the existing TrainableComponent protocol to also include a train_unsupervised method
- And uses that over the "check config for train and run inference" unsupervised training
It allows all components that follow the TrainableComponent protocol to be trained supervised
- Previously only the linker was able to be trained in a supervised manner
It provides a few utility methods to allow training and evaluating components individually
- I.e dataset-aware components that will enable either training or evaluating only NER or Linker if/when required

Example code snippets:

When only training linker

with dataset_aware_component(cat, CoreComponentType.ner, DATASET):
    trainer.train_supervised_raw(DATASET, nepochs=1)

When only training NER

with dataset_aware_component(cat, CoreComponentType.linking, self.DATASET):
    trainer.train_unsupervised([doc['text'] for proj in self.DATASET['projects'] for doc in proj['documents']], nepochs=1)

When doing evaluation / stats one component at a time

with dataset_aware_component(cat, CoreComponentType.ner, self.DATASET):
    tps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts, examples = get_stats(
        cat, self.DATASET, do_print=False)

… protocol

…xt based linker

…nner

…omponent

adam-sutton-1992 · 2026-04-24T21:54:16Z

+            component.train(cui=cui, entity=mut_entity, doc=mut_doc,
+                            negative=negative, names=names)


Does this mean we're still doing unsupervised on a per entity basis? I can't think of a case where in an unsupervised manner you would need the entity.

This is supervised training still. This is within the add_and_train_concept method.

adam-sutton-1992 · 2026-04-24T21:54:27Z

A few queries but I think it looks good. I might've missed these within the commits:

If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose?

And one more above^^^

mart-r · 2026-04-25T05:33:00Z

If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose?

The description already had 2 examples for this :)

The dataset aware implementation can serve that purpose. Because they replace the specific component with another one (which isn't trainable, but that's kind of irrelevant since it's a different component) for the duration of the context manager.

But I think what makes it unclear is that in the example I've given it a dataset, but realistically, you could provide an empty dataset for it, i.e like this:

# supervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
    trainer.train_supervised_raw(DATASET, nepochs=1)
# unsupervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
    trainer.train_unsupervisedsupervised(["list", "of", "texts'], nepochs=1)

github-actions Bot added 15 commits April 16, 2026 15:24

CU-869cy3yz9: Add unsupervised training method to trainable component…

d307dab

… protocol

CU-869cy3yz9: Follow the intercace for unsupervised trianing in conte…

acf623b

…xt based linker

CU-869cy3xa0: Use new interface for self-supervised training

f204535

CU-869cy3yz9: Fix fake pipe in tests

b1cc70a

CU-869cy3yz9: Fix issue with unrealised generator

a76fde4

CU-869cy3z45: Allow any component to be trained in an unsupervised ma…

7d9adf5

…nner

CU-869cy3z45: Remove unused import

4a71c1c

CU-869cy3yz9: Add a few more tests for trainable components

4c25292

CU-869cy3zb0: Add utilities to create a dataset-aware NER or linker c…

9620d9b

…omponent

CU-869cy3zb0: Fix minor issues with new utilities

7463586

CU-869cy3zb0: Fix minor order of operations issue

d782e0e

CU-869cy3zb0: Add a few tests for training utilities

3e3ae16

CU-869cy3zb0: Add a few missing doc strings

1e7e8cd

CU-869cy3zb0: Add a few supervised training based tests

9185ae7

CU-869cy3zb0: Fix import of Self (from typing extensions)

5df2870

adam-sutton-1992 reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(medcat):CU-869cy3xa0 Improve training#414

feat(medcat):CU-869cy3xa0 Improve training#414
mart-r wants to merge 15 commits intomainfrom
feat/medcat/CU-869cy3xa0-specify-unsupervised-training-in-trainable-component-protocol

mart-r commented Apr 21, 2026 •

edited

Loading

Uh oh!

adam-sutton-1992 Apr 24, 2026

Uh oh!

mart-r Apr 25, 2026

Uh oh!

adam-sutton-1992 commented Apr 24, 2026

Uh oh!

mart-r commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		component.train(cui=cui, entity=mut_entity, doc=mut_doc,
		negative=negative, names=names)

Conversation

mart-r commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-sutton-1992 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

adam-sutton-1992 commented Apr 24, 2026

Uh oh!

mart-r commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mart-r commented Apr 21, 2026 •

edited

Loading

mart-r commented Apr 25, 2026 •

edited

Loading