Inferring the source of official texts: can SVM beat ULMFiT?

This page holds the dataset and source code described in the paper below:

Pedro H. Luz de Araujo, Teófilo E. de Campos, Marcelo M. Silva de Sousa.
Inferring the source of official texts: can SVM beat ULMFiT?
International Conference on the Computational Processing of Portuguese (PROPOR), March 2-4, Évora, Portugal, 2020.
Download: [ paper | bib | code and dataset ]

We kindly request that users cite our paper above if any publication that is generated as a result of the use of our code or our dataset.

Presentations

Long version, presented in internal meeting, February 2020
PROPOR presentation, March 2020
Workshop KnEDLe release 1, July de 2020, in Portuguese - video below:

Resources

We provide the data splits as csv files, the data after the preprocessing described in the paper as pickle files, the SentencePiece tokenizer model and vocabulary and the jupyter notebooks used. We describe the directory structure and individual files below.

Unfortunately, we are not able to upload the trained model files used in the paper, but the "update" section at the end of this page has the resources needed to generate our latest results.

Note: this project is also listed at Papers With Code, where you can access eventual updates of the resources of this project and see the latest results of our benchmark.

Requirements

Directory Structure

The code and dataset can be obtained by clicking here.

data:
- data_clas_bwd.pkl: preprocessed data for backward classification
- data_clas_export.pkl: preprocessed data for forward classification
- data_lm_back.pkl: preprocessed data for backward language model
- data_lm_export.pkl: preprocessed data for forward language model
- test_data.pkl: preprocessed data for forward classification evaluation
- test_data_bwd.pkl: preprocessed data for backward classification evaluation
- clean:
  - train.csv: raw training + validation data (unsplit)
  - train_val.csv: raw training + validation data (split)
  - unsup:
    - unsup.csv: raw language model data
tmp:
- spm.model: SentencePiece tokenizer model
- spm.vocab: SentencePiece vocabulary
train_ulmfit.ipynb: preprocesses and saves the data, and trains and evaluates ULMFiT models
train_baseline.ipynb: trains and evaluates the BOW models

Reproducing Results

Download the pretrained language model and place it in a model directory at the root
Run train_ulmfit.ipynb
Run train_baseline.ipynb

Update (27/05/20)

The pre-trained language model used in this work was not originally released with its tokenizer model and vocabulary data, so our fine-tuned model and classifier were not able to leverage subword embeddings trained on general domain portuguese data.
This has been amended, so we re-ran all experiments using the pre-trained vocab data.
We present the new results and resources in the links below:

new results;
updated source code to train ULMFiT, in two formats: Jupyter notebook and plain Python;
updated trained models (1.7GB);
new pre-processed data.

teodecampos

Last modified: Friday July 24 10:58 -03 2020