Skip to content

feat(medcat):CU-869cy3xa0 Improve training#414

Open
mart-r wants to merge 15 commits intomainfrom
feat/medcat/CU-869cy3xa0-specify-unsupervised-training-in-trainable-component-protocol
Open

feat(medcat):CU-869cy3xa0 Improve training#414
mart-r wants to merge 15 commits intomainfrom
feat/medcat/CU-869cy3xa0-specify-unsupervised-training-in-trainable-component-protocol

Conversation

@mart-r
Copy link
Copy Markdown
Collaborator

@mart-r mart-r commented Apr 21, 2026

This PR does an overhaul to the training setup of MedCAT:

  • It modifies the existing TrainableComponent protocol to also include a train_unsupervised method
    • And uses that over the "check config for train and run inference" unsupervised training
  • It allows all components that follow the TrainableComponent protocol to be trained supervised
    • Previously only the linker was able to be trained in a supervised manner
  • It provides a few utility methods to allow training and evaluating components individually
    • I.e dataset-aware components that will enable either training or evaluating only NER or Linker if/when required

Example code snippets:

  1. When only training linker
with dataset_aware_component(cat, CoreComponentType.ner, DATASET):
    trainer.train_supervised_raw(DATASET, nepochs=1)
  1. When only training NER
with dataset_aware_component(cat, CoreComponentType.linking, self.DATASET):
    trainer.train_unsupervised([doc['text'] for proj in self.DATASET['projects'] for doc in proj['documents']], nepochs=1)
  1. When doing evaluation / stats one component at a time
with dataset_aware_component(cat, CoreComponentType.ner, self.DATASET):
    tps, fns, tps, cui_prec, cui_rec, cui_f1, cui_counts, examples = get_stats(
        cat, self.DATASET, do_print=False)

Comment on lines +647 to +648
component.train(cui=cui, entity=mut_entity, doc=mut_doc,
negative=negative, names=names)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we're still doing unsupervised on a per entity basis? I can't think of a case where in an unsupervised manner you would need the entity.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supervised training still. This is within the add_and_train_concept method.

@adam-sutton-1992
Copy link
Copy Markdown
Contributor

A few queries but I think it looks good. I might've missed these within the commits:

If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose?

And one more above^^^

@mart-r
Copy link
Copy Markdown
Collaborator Author

mart-r commented Apr 25, 2026

If you have two trainable components. is it possible to turn of training for one of them when running training methods? Do the dataset aware components serve that purpose?

The description already had 2 examples for this :)

The dataset aware implementation can serve that purpose. Because they replace the specific component with another one (which isn't trainable, but that's kind of irrelevant since it's a different component) for the duration of the context manager.

But I think what makes it unclear is that in the example I've given it a dataset, but realistically, you could provide an empty dataset for it, i.e like this:

# supervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
    trainer.train_supervised_raw(DATASET, nepochs=1)
# unsupervised
with dataset_aware_component(cat, CoreComponentType.ner, {"projects" : []}):
    trainer.train_unsupervisedsupervised(["list", "of", "texts'], nepochs=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants