Extract part from document that corresponds to the specified label
- Python 🐍
- Transformers 🤗
- Wandb 🪄
For raw text of document, need to extract part that corresponds to the specified label. There are types of labels. Also, no answer("") is possible.
So, looks like question answering task. There are lots of pretrained models on squadv2 dataset, that is why I will fine-tune some of them.
Number of documents with "обеспечение исполнения контракта": 988 Number of documents with "обеспечение гарантийных обязательств": 811 Amount of "обеспечение исполнения контракта" with empty part: 4 Amount of "обеспечение гарантийных обязательств" with empty part: 303
Even classes of labels seem balanced, but 37% of "обеспечение гарантийных обязательств" has empty text in extracted part.
So for train test split, I will use stratified strategy over random_split.
Lengths of documents:
Lengths of tokenized text:
Lengths of tokenized answer:
For tokenizing text I trained custom tokenizer.
Choose some models for finetuning:
Set configs for hyperparameters optmization:
- max_length of tokenized text input
- stride for tokenizing text
- dropout layers in model
- learning rate
- warmup_ratio
- weight_decay strategy
Hyperparams optimization plots on validation test:
deberta:
distilbert:
mdeberta:
Then for best set of parameters I trained again models and test them on test set:
-
deberta:
exact_match: 83.9
f1_score: 96.6
-
distilbert:
exact_match: 84.7
f1_score: 96.8
-
mdeberta:
exact_match: 83.6
f1_score: 97.14
I also add models on model hub:
And what could be done more:
- I trained a new tokenizer for this task, and it will be great to compare the old and new one on the corresponding task.
- For hyperparameters optimization, I made only few iterations of random search, so I can use this results for Bayes optimization.





