One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:
For the image below, the OCR may first box the 2 characters together.

At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.

Hence, for a symbol like কু the OCR can not segment ক and ু separately.
There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?


As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.
No comments:
Post a Comment