Tesseract-Indic-OCR: My old training methodology

The principle on which this works is this: Tesseract needs two things to train itself, 1) An image of the character 2) The name of the character. This information is provided with the help of "box files". A box file contains the co-ordinates of the bounding boxes around characters with labels as to what those characters are. The traditional method of training the engine is to take a scanned image, meticulously create a box file using some tool such as tesseractrainer.py , edit the box file, and keep doing the same for several other images and fonts. This process was tedious enough to force me to seek new methods.

Now lets do a little reverse engineering. What if we could take a list of characters in a text file, "generate" an image out of those characters, store the co-ordinates of the bounding boxes of those generated images in a file and then feed these to the OCR engine? It would work, right?

Links:

http://tesseractindic.googlecode.com/files/tesseract_trainer.beta.tar.gz - The tar ball itself

http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/readme - The readme file

http://www.youtube.com/watch?v=vuuVwm5ZjkI - YouTube video of the tool working for Bengali

But there are problems. Tesseract-OCR has its quirks.

Tesseract wants one bounding box to enclose a single "blob" only. A blob is a wholly connected component. So ক is a blob, and ক খ are two blobs. There are cases where a consonant+vowel sign generates two blobs, for example the 3 images below have multiple blobs:

And hence Tesseract throws a "FATALITY" error during training.

So i had to change my approach a little bit. Obviously there has to be some feedback mechanism where i parse the output of Tesseract during training to see if a particular set of characters threw errors. Once I know what they are, I can separate them and train them later. To accomplish this, I changed my approach of generating a strip of character images to generating just one image per character, so I can pin point the problems better.

The downside, too many images getting generated. To train a simple font it generates 405 images+405 box files+405 tr files. And all this when I have not included conjuncts yet. It is not much of a problem though, since the images generated are not required once the training files have been generated.

Well it leads me to new challenges. I remember Prof B.B. Choudhury saying that training all the conjuncts will kill any recogniser, ie, it will work very slowly while recognising. He also told me some cool ways to get past that. May have to implement that. Lets see.

Tesseract-Indic-OCR

Friday, April 17, 2009

My old training methodology

No comments:

Post a Comment

Blog Archive

About Me