Well the reason is probably that Tesseract OCR training code looks for characters on a single line during training as it also extracts base line metrics for rare/strange characters like numerals. As such it may not be able to extract all the information it needs for its training.
Or may be Tesseract OCR training code accepts a very little number of .tr files and since my code generates thousands of tr files, it becomes useless.
Let me show you an example of how miserably it failed.
I decided to test the training on the string " ভারত মাতা " (Bharat Mata which means Mother India). I generated the tiff image using Pango rendering.
Then I generated 7 images per sample of ভ র ত ম
The result was this: " মভতভ ভভভভ "
Yes, I know. The result is absolutely outrageous.
However, what if I still autogenerate images of characters but this time in single lines adjacently? Will it work?
No comments:
Post a Comment