oadsclassifier

A language model based classifier to extract and classify URLs in scientific papers into two categories: URLs linking to open access datasets/software (OADS) or Non-OADS URLs. The repository contains both training dataset, test dataset and the model.

Scripts and sav file

language_model.py: A script for loading transformer language model. This module has been called within the script "classification_report.py"
pub_url_cleaner.py: A script for eliminating publisher urls from dataset. This module has been called within the script "classification_report.py"
model_weight.sav: This file contains model weights which is loaded in the "OADS.py" script.
sentence_with_url_extractor.py: A script to extract sentences containing URLs from pdf files.
OADS.py: A script for implementing OADSClassifier.
dataset.csv: This csv file contains all the ground truth used for training the model. This file is not for using in any script. It is saved here only for the record.

Requirements

pip install --upgrade transformers==4.2
pip install pickle5

How to use the script

python3 classification_report.py

Input

"OADS.py" script takes a text file as input. In the script "OADS.py", replace the text "filename" of line 89 with the input file name.

Output

The output of this script is the list of sentences that contains URLs and their predicted class. The classes are - "OADS", "Not-OADS". The output is also saved in a csv file named "Output.csv"

"OADS" means Open access dataset/software
"Not-OADS" means not open access dataset/software

Sample input and output

File "41367_2016_Mechanical-Engineering_41367.txt" under "oadsclassifier-Data" directory is a sample text file which can be used as an input to test the classifier.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
ETD_output		ETD_output
PMC_Results		PMC_Results
SCL		SCL
oadsclassifier-Data		oadsclassifier-Data
pdf-to-text		pdf-to-text
publisher-domain-dataset		publisher-domain-dataset
raw_data_outputs		raw_data_outputs
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oadsclassifier

Scripts and sav file

Requirements

How to use the script

Input

Output

Sample input and output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

oadsclassifier

Scripts and sav file

Requirements

How to use the script

Input

Output

Sample input and output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages