This page holds the dataset and source code described in the paper below:
Pedro H. Luz de Araujo, Teófilo E. de Campos, Marcelo M. Silva de Sousa.
Inferring the source of official texts: can SVM beat ULMFiT?
International Conference on the Computational Processing of Portuguese (PROPOR), March 2-4, Évora, Portugal, 2020.
Download: [
paper |
bib |
code and dataset ]
See also new results and pre-trained model at the bottom of this page.
We kindly request that users cite our paper above if any publication that is generated as a result of the use of our code or our dataset.
We provide the data splits as csv files, the data after the preprocessing described in the paper as pickle files, the SentencePiece tokenizer model and vocabulary and the jupyter notebooks used. We describe the directory structure and individual files below.
Unfortunately, we are not able to upload the trained model files used in the paper, but the "update" section at the end of this page has the resources needed to generate our latest results.
Note: this project is also listed at Papers With Code, where you can access eventual updates of the resources of this project and see the latest results of our benchmark.
The pre-trained language model used in this
work was not originally released with
its tokenizer model and vocabulary data, so our fine-tuned model and classifier were not able to leverage
subword embeddings trained on
general domain portuguese data.
This has been amended, so we re-ran all experiments
using the pre-trained vocab data.
We present the new results and resources in the links below: