A language model based classifier to extract and classify URLs in scientific papers into two categories: URLs linking to open access datasets/software (OADS) or Non-OADS URLs. The repository contains both training dataset, test dataset and the model.
-
language_model.py: A script for loading transformer language model. This module has been called within the script "classification_report.py"
-
pub_url_cleaner.py: A script for eliminating publisher urls from dataset. This module has been called within the script "classification_report.py"
-
model_weight.sav: This file contains model weights which is loaded in the "OADS.py" script.
-
sentence_with_url_extractor.py: A script to extract sentences containing URLs from pdf files.
-
OADS.py: A script for implementing OADSClassifier.
-
dataset.csv: This csv file contains all the ground truth used for training the model. This file is not for using in any script. It is saved here only for the record.
pip install --upgrade transformers==4.2
pip install pickle5
python3 classification_report.py
"OADS.py" script takes a text file as input. In the script "OADS.py", replace the text "filename" of line 89 with the input file name.
The output of this script is the list of sentences that contains URLs and their predicted class. The classes are - "OADS", "Not-OADS". The output is also saved in a csv file named "Output.csv"
- "OADS" means Open access dataset/software
- "Not-OADS" means not open access dataset/software
File "41367_2016_Mechanical-Engineering_41367.txt" under "oadsclassifier-Data" directory is a sample text file which can be used as an input to test the classifier.