Victor: a dataset for Brazilian legal documents classification

This page holds the source code and data described in the papers below:

Pedro H. Luz de Araujo, Ana Paula G. S. de Almeida, Fabricio Ataides Braz, Nilton Correia da Silva, Flavio de Barros Vidal, Teófilo E. de Campos
Sequence-aware multimodal page classification of Brazilian legaldocuments
International Journal on Document Analysis and Recognition (IJDAR), July 2022.
[ view it at Springer-Nature | DOI 10.1007/s10032-022-00406-7 | arXiv preprint ]

Pedro H. Luz de Araujo, Teófilo E. de Campos, Fabricio Ataides Braz, Nilton Correia da Silva
Victor: a dataset for Brazilian legal documents classification
Language Resources and Evaluation Conference (LREC), May, Marseille, France, 2020.
[ pdf | bib ]

We kindly request that users cite our papers in any publication that is generated as a result of the use of our code or our dataset.

Relevant links:

Dataset

Please follow this link to fill in a consent form and get our dataset: http://ailab.unb.br/victor/lrec2020

Once ou fill in the form, you will first get a confirmation email with the information you provided and up to 48hs later you will get another email with the URL where you can download the data from.

We make available the Medium (MVic) and Small (SVic) versions of Victor. We are at the present time unfortunately unable to distribute Big Victor (BVic).

These are the sizes of each part of the dataset (uncompressed):

Source code

Our code was developed in Python 3, using TensorFlow and Keras. Below are resources that should enable users to reproduce our results.

Requirements

Source files


teodecampos
Last modified: 11 November 2022