Skip to content

Abletobetable/document-extractor

Repository files navigation

document-extractor 📄

Extract part from document that corresponds to the specified label

Content 📖

Stack of technologies 🏗

  • Python 🐍
  • Transformers 🤗
  • Wandb 🪄

Task description 📋

For raw text of document, need to extract part that corresponds to the specified label. There are types of labels. Also, no answer("") is possible.

So, looks like question answering task. There are lots of pretrained models on squadv2 dataset, that is why I will fine-tune some of them.

EDA 🔎

Number of documents with "обеспечение исполнения контракта": 988 Number of documents with "обеспечение гарантийных обязательств": 811 Amount of "обеспечение исполнения контракта" with empty part: 4 Amount of "обеспечение гарантийных обязательств" with empty part: 303

Even classes of labels seem balanced, but 37% of "обеспечение гарантийных обязательств" has empty text in extracted part.

So for train test split, I will use stratified strategy over random_split.

Lengths of documents:

map

Lengths of tokenized text:

map

Lengths of tokenized answer:

map

Proposed solution 🚳

For tokenizing text I trained custom tokenizer.

Choose some models for finetuning:

Set configs for hyperparameters optmization:

  • max_length of tokenized text input
  • stride for tokenizing text
  • dropout layers in model
  • learning rate
  • warmup_ratio
  • weight_decay strategy

Hyperparams optimization plots on validation test:

deberta:

map

distilbert:

map

mdeberta:

map

Then for best set of parameters I trained again models and test them on test set:

  • deberta:

    exact_match: 83.9

    f1_score: 96.6

  • distilbert:

    exact_match: 84.7

    f1_score: 96.8

  • mdeberta:

    exact_match: 83.6

    f1_score: 97.14

I also add models on model hub:

deberta, distilbert, mdeberta

How to improve 🔨

And what could be done more:

  1. I trained a new tokenizer for this task, and it will be great to compare the old and new one on the corresponding task.
  2. For hyperparameters optimization, I made only few iterations of random search, so I can use this results for Bayes optimization.

Feel free to contact with me 📞

https://t.me/abletobetable

abletobetable@mail.ru

About

Extract part from document that corresponds to the specified label

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published