document-extractor 📄

Extract part from document that corresponds to the specified label

Content 📖

Stack of technologies
Task description
exploratory data analysis
Proposed solution
How to improve?

Stack of technologies 🏗

Python 🐍
Transformers 🤗
Wandb 🪄

Task description 📋

For raw text of document, need to extract part that corresponds to the specified label. There are types of labels. Also, no answer("") is possible.

So, looks like question answering task. There are lots of pretrained models on squadv2 dataset, that is why I will fine-tune some of them.

EDA 🔎

Number of documents with "обеспечение исполнения контракта": 988 Number of documents with "обеспечение гарантийных обязательств": 811 Amount of "обеспечение исполнения контракта" with empty part: 4 Amount of "обеспечение гарантийных обязательств" with empty part: 303

Even classes of labels seem balanced, but 37% of "обеспечение гарантийных обязательств" has empty text in extracted part.

So for train test split, I will use stratified strategy over random_split.

Lengths of documents:

Lengths of tokenized text:

Lengths of tokenized answer:

Proposed solution 🚳

For tokenizing text I trained custom tokenizer.

Choose some models for finetuning:

Set configs for hyperparameters optmization:

max_length of tokenized text input
stride for tokenizing text
dropout layers in model
learning rate
warmup_ratio
weight_decay strategy

Hyperparams optimization plots on validation test:

deberta:

distilbert:

mdeberta:

Then for best set of parameters I trained again models and test them on test set:

deberta:

exact_match: 83.9

f1_score: 96.6
distilbert:

exact_match: 84.7

f1_score: 96.8
mdeberta:

exact_match: 83.6

f1_score: 97.14

I also add models on model hub:

deberta, distilbert, mdeberta

How to improve 🔨

And what could be done more:

I trained a new tokenizer for this task, and it will be great to compare the old and new one on the corresponding task.
For hyperparameters optimization, I made only few iterations of random search, so I can use this results for Bayes optimization.

Feel free to contact with me 📞

https://t.me/abletobetable

abletobetable@mail.ru

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deberta_sweep_config.yaml		deberta_sweep_config.yaml
distilbert_hyper_config.yaml		distilbert_hyper_config.yaml
mdeberta_hyper_config.yaml		mdeberta_hyper_config.yaml
pipeline_kontur2023.ipynb		pipeline_kontur2023.ipynb
predictions.json		predictions.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

document-extractor 📄

Content 📖

Stack of technologies 🏗

Task description 📋

EDA 🔎

Proposed solution 🚳

How to improve 🔨

Feel free to contact with me 📞

About

Uh oh!

Releases

Packages

Languages

License

Abletobetable/document-extractor

Folders and files

Latest commit

History

Repository files navigation

document-extractor 📄

Content 📖

Stack of technologies 🏗

Task description 📋

EDA 🔎

Proposed solution 🚳

How to improve 🔨

Feel free to contact with me 📞

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages