Inferring the source of official texts: can SVM beat ULMFiT?

This page holds the dataset and source code described in the paper below:

Pedro H. Luz de Araujo, Teófilo E. de Campos, Marcelo M. Silva de Sousa.
Inferring the source of official texts: can SVM beat ULMFiT?
International Conference on the Computational Processing of Portuguese (PROPOR), March 2-4, Évora, Portugal, 2020.
Download: [ paper | bib | code and dataset ]

See also new results and pre-trained model at the bottom of this page.

We kindly request that users cite our paper above if any publication that is generated as a result of the use of our code or our dataset.

Presentations

Resources

We provide the data splits as csv files, the data after the preprocessing described in the paper as pickle files, the SentencePiece tokenizer model and vocabulary and the jupyter notebooks used. We describe the directory structure and individual files below.

Unfortunately, we are not able to upload the trained model files used in the paper, but the "update" section at the end of this page has the resources needed to generate our latest results.

Note: this project is also listed at Papers With Code, where you can access eventual updates of the resources of this project and see the latest results of our benchmark.

Requirements

Directory Structure

The code and dataset can be obtained by clicking here.

Reproducing Results

Update (27/05/20)

The pre-trained language model used in this work was not originally released with its tokenizer model and vocabulary data, so our fine-tuned model and classifier were not able to leverage subword embeddings trained on general domain portuguese data.
This has been amended, so we re-ran all experiments using the pre-trained vocab data.
We present the new results and resources in the links below:


teodecampos
Last modified: Friday July 24 10:58 -03 2020