Friday, April 17, 2009

My training methodology does not work :(

As much as I hate to admit it my training methodology of generating one image per akshar does not work. I hate to say it since I put some effort into writing the Python code that does this .
Well the reason is probably that Tesseract OCR training code looks for characters on a single line during training as it also extracts base line metrics for rare/strange characters like numerals. As such it may not be able to extract all the information it needs for its training.
Or may be Tesseract OCR training code accepts a very little number of .tr files and since my code generates thousands of tr files, it becomes useless.
Let me show you an example of how miserably it failed.
I decided to test the training on the string " ভারত মাতা " (Bharat Mata which means Mother India). I generated the tiff image using Pango rendering.
Then I generated 7 images per sample of ভ র ত ম and used the subsequently generated training fils for OCR.
The result was this: " মভতভ ভভভভ "
Yes, I know. The result is absolutely outrageous.
However, what if I still autogenerate images of characters but this time in single lines adjacently? Will it work?

No comments:

Post a Comment