For the image below, the OCR may first box the 2 characters together.
At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.
Hence, for a symbol like কু the OCR can not segment ক and ু separately.There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?


As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.
No comments:
Post a Comment