Saturday, November 28, 2009

Conversation with Sayamindu regarding ambiguities

Here is a mail i sent to Sayamindu:
"By the way, one difficult problem I am facing is that all the া are
being mistakenly recognised as । . The dictionary should help in
resolving this, and also there is a file where we can specify
ambiguities like these. But nothing seems to work.
One way to solve the problem is to add the following rule in the
reorder script: We make a pass and replace all instances of । with া .
Then we make another pass and see whether there are any leftover া
with the dotted circle. These should be replaced by । .
Is the logic ok? How to find out if an া has a dotted circle?"

He has not replied yet.

Here is what i think. The change can not be made simply in the reorder script, which gets executed only in the post-ocr stage. The problem is that the OCR engine itself recognises this wrongly and it throws off the rest of the recognition.
One solution is to not train । (equivalent to fullstop in bengali) at all. We can always add the । in the post OCR script using the method in the mail addressed to Sayamindu.

