The Chars74K dataset

Character Recognition in Natural Images

Note: this page branched off from the original page at the University of Surrey, which is in an old server behind robot blockers, hence I have copied it to here.

Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision. With today's omnipresence of cameras, the applications of automatic character recognition are broader than ever. For Latin script, this is largely considered a solved problem in constrained situations, such as images of scanned documents containing common character fonts and uniform background. However, images obtained with popular cameras and hand held devices still pose a formidable challenge for character recognition. The challenging aspects of this problem are evident in this dataset.

In this dataset, symbols used in both English and Kannada are available.

In the English language, Latin script (excluding accents) and Hindu-Arabic numerals are used. For simplicity we call this the "English" characters set. Our dataset consists of:

64 classes (0-9, A-Z, a-z)
7705 characters obtained from natural images
3410 hand drawn characters using a tablet PC
62992 synthesised characters from computer fonts

This gives a total of over 74K images (which explains the name of the dataset).

The compound symbols of Kannada were treated as individual classes, meaning that a combination of a consonant and a vowel leads to a third class in our dataset. Clearly this is not the ideal representation for this type of script, as it leads to a very large number of classes. However, we decided to use this representation for our baseline evaluations present in [deCampos et al] as a way to evaluate a generic recognition method for this problem.

Reference

The following paper gives further descriptions of this dataset and baseline evaluations using a bag-of-visual-words approach with several feature extraction methods and their combination using multiple kernel learning:

T. E. de Campos, B. R. Babu and M. Varma. Character recognition in natural images. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, February 2009.
Bibtex | Abstract | PDF

Follow this link for a list publications that have cited the above paper and this link for papers that mention this dataset.

Download

Disclaimer: by downloading and using the datasets below (or part of them) you agree to acknowledge their source and cite the above paper in related publications. We will be grateful if you contact us to let us know about the usage of the our datasets.

Images of individual characters: the files below contain directory trees of each dataset of individual characters. In these trees, there is one directory per class of character. Each character sample appear in an individual PNG image. There's a large variation is scale, as we kept the original resolution of the characters as they appear in the original images.

English, 62 classes (0-9, A-Z, a-z)

EnglishImg.tgz (127.9 MB) [sample characters]: segmented characters from natural scenes. For each character, a binary segmentation mask file is also provided.
EnglishHnd.tgz (13.0 MB) [sample characters]: hand-drawn characters. 55 samples per class. The pen stroke trajectories are also provided, so this dataset can also be used to evaluate on-line handwritten character recognition methods.
EnglishFnt.tgz (51.1 MB): characters from computer fonts with 4 variations (combinations of italic, bold and normal).

Kannada (657+ classes)

KannadaImg.tgz (105.8 MB) [sample characters]: segmented characters from natural scenes. A copy of this dataset is available at the mldata.org repository, doi:10.5881/CHARS74K-KANNADA-IMG
KannadaHnd.tgz (125.8 MB) [sample characters]: hand-drawn characters. A copy of this dataset is available at the mldata.org repository, doi:10.5881/CHARS74K-KANNADA-HND

Lists.tgz (7.3 MB): lists of files used for training and testing in our experiments (in MatLab data file format ".MAT"). Please use these splits in order to make fair comparisons with the results published in the paper above.
Each file has a data structure "list" with these elements:
- ALLlabels: class label for each sample
- ALLnames: sub-directory and name of the image for each sample
- classlabels: set of labels (classes) in this dataset, coded numerically, e.g. 10=A, 11=B, ..., 64=z
- classnames: scrings of the directories where samples of each class are stored
- NUMclasses: total number of classes in this dataset
- TRNind: indexes of the training samples. If 20 splits are used, this is a matrix of N_train_samples X 20
- TSTind: indexes of the test samples. If 20 splits are used, this is a matrix of N_test_samples X 20
- VALind: indexes of the validation samples. If 20 splits are used, this is a matrix of N_validation_samples X 20
- TXNind: indexes of the texton samples, i.e., samples used to build the vocabulary with the bag-of-visual-words method. If 20 splits are used, this is a matrix of N_texton_samples X 20
ListsTXT.tgz (2.8 MB): Same as above, but in "M" format, i.e., you can load all the data by running the M-files in this TGZ package (in MatLab). For those who don't have MatLab: these files are human-readable ASCII files with all the lists.
FullImagesAndAnnotations_Frontal.tgz (739.7 MB) [sample images]: original images and TXT files which give the coordinates, bounding polygons and labels of characters that appear in each image. Many whole words have also been annotated. These annotations may be used to evaluate character/word detection methods, but not all the words that appear in the images have been annotated.
Maps.tgz (1.3 MB): inverse maps, points each sample character in EnglishImg.tgz (or KannadaImg.tgz) to the original full image in FullImagesAndAnnotations_frontal.tgz. This file also contains maps from each character to its class number. For instance, the file Maps/Kannada/Img/map.mat contains a MatLab cell array of 990 elements. Each cell contains a Kannada character (in Unicode) and its position in the array is the class number. So, given Kannada character (in a variable called input, to find its class number is you need to do class_number = find(strcmp(input, map),1);

Sample usage

For experiments with Chars74K-15, i.e., train with 15 samples per class and test with other 15 samples per class for the images of "English" characters in the wild:

Download and unpack EnglishImg.tgz and ListsTXT.tgz
In Octave (or MatLab), run list_English_Img to load the lists to the memory.
The training set for this particular experiment will be indexed by list.TRNind(:,end) (so please copy the result of this to a variable), i.e., it is defined by the last split of training data. Note that sum(list.TRNind(:,end)>0)/list.NUMclasses results in 15, confirming that there are 15 training samples per class.
The test set will be defined by list.TSTind(:,end) (Note that here the index end can be replaced by any valid number because the columns of list.TSTind are repetitions of the same list. This is because we have fixed the test set for all the experiments with 15 samples per class.)
The class labels (ground truth) are obtained with list.ALLlabels(list.TRNind(:,end)) for the training set and list.ALLlabels(list.TSTind(:,end)) for the test set. Again, note that for each class label, there should be 15 samples, i.e., sum(list.ALLlabels(list.TSTind(:,end))==x) results in 15 for x=1:list.NUMclasses
You can then proceed with your experiments by selecting training files following this: list.ALLnames(list.TRNind(c,end),:) for training and list.ALLnames(list.TSTind(c,end),:) for testing, where c is the index that iterates through the samples (from 1 to 930 in the case of this experiment, but it can be bound by class labels, depending on how you implement your experiment). Note that list.ALLnames does not include the file extension (png) and the absolute path of the images, you need to append these to the string.

Note that the above can also be done for the validation and vocabulary lists if they are used in the same experiment (list.VALind and list.TXNind)

Sample characters

(In addition to the ones shown following the links above)

Credits and Acknowledgements

This dataset and the experiments present in the paper were done at Microsoft Research India by T de Campos, with the mentoring support from M Varma. Additional SVM and MKL experiments were performed by BR Babu.

We would like to acknowledge the help of several volunteers who annotated this dataset. In particular, we would like to thank Arun, Kavya, Ranjeetha, Riaz and Yuvraj. We would also like to thank Richa Singh and Gopal Srinivasa for developing some of the tools for annotation (one of the tools used is described here). We are grateful to CV Jawahar for helpful discussions.

We thank the CVSSP/Surrey for hosting this web page. T de Campos also thanks Xerox RCE for the support while he finalized the paper.

Main contact: T.deCampos
Last modified: Mon Oct 15 10:47:12 BST 2012