Victor: a dataset for Brazilian legal documents classification

This page holds the source code and data described in the papers below:

Pedro H. Luz de Araujo, Ana Paula G. S. de Almeida, Fabricio Ataides Braz, Nilton Correia da Silva, Flavio de Barros Vidal, Teófilo E. de Campos
Sequence-aware multimodal page classiﬁcation of Brazilian legaldocuments
International Journal on Document Analysis and Recognition (IJDAR), July 2022.
[ view it at Springer-Nature | DOI 10.1007/s10032-022-00406-7 | arXiv preprint ]

Pedro H. Luz de Araujo, Teófilo E. de Campos, Fabricio Ataides Braz, Nilton Correia da Silva
Victor: a dataset for Brazilian legal documents classification
Language Resources and Evaluation Conference (LREC), May, Marseille, France, 2020.
[ pdf | bib ]

We kindly request that users cite our papers in any publication that is generated as a result of the use of our code or our dataset.

Relevant links:

Dataset

Please follow this link to fill in a consent form and get our dataset: http://ailab.unb.br/victor/lrec2020

Once ou fill in the form, you will first get a confirmation email with the information you provided and up to 48hs later you will get another email with the URL where you can download the data from.

We make available the Medium (MVic) and Small (SVic) versions of Victor. We are at the present time unfortunately unable to distribute Big Victor (BVic).

These are the sizes of each part of the dataset (uncompressed):

SmallVictor text (CSV files)
- train: 216.7 MB
- val: 137.8 MB
- test: 139.0 MB
SmallVictor images (PDF files)
- train: 5.4 GB
- val: 3.6 GB
- test: 3.5 GB
MediumVictor text (CSV files)
- train: 2.0 GB
- val: 422.6 MB
- test: 418.7 MB

Source code

Our code was developed in Python 3, using TensorFlow and Keras. Below are resources that should enable users to reproduce our results.

Requirements

Source files

shallow_clf_docType.ipynb: notebook to train the shallow classifiers for document type prediction
baseline_clf_themes.ipynb: notebook to train baseline classifiers for theme prediction
dataset_statistics.ipynb: notebook to compute dataset statistics
get_preds.py: script to compute and save model predictions (to use in the CRF experiments)
crf_experiments.ipynb: notebook for CRF post-processing for document type classification
train_cnn.py script to train CNN for document type classification
train_lstm.py script to train LSTM for document type classification
train_xgboost_themes.py script to train XGBoost for theme classification

teodecampos

Last modified: 11 November 2022