One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:
For the image below, the OCR may first box the 2 characters together.
data:image/s3,"s3://crabby-images/26148/26148547c2366179323fba83ec3b2b8fc6f05074" alt=""
At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.
data:image/s3,"s3://crabby-images/23d93/23d939f66b5dbe8da7a0612d3b8c37fb19ffddb8" alt=""
Hence, for a symbol like কু the OCR can not segment ক and ু separately.
There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?
data:image/s3,"s3://crabby-images/90a73/90a737e38c75197d20d3264bc32656359573dc44" alt=""
data:image/s3,"s3://crabby-images/c5207/c520775125e50fb4641638e8e473b400e482a678" alt=""
As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.
No comments:
Post a Comment