Friday, November 6, 2009

The Problem of Dotted Circles

Certain vowel signs in in Indic scripts have a dotted circle in them. For example : ৈ, ে , া . When these are used in conjunction with consonants however, the dotted circles vanish. For example: কৈ, কে, কা .
This is a problem doing automated training. The python script draws ে and trains the engine to recognise the shape, along with the dotted circle. However, when we OCR a document, the dotted circle is no longer there.
Hence we somehow need a method of automatically eliminating the dotted circles from vowel signs while generating training images. Any ideas?

1 comment:

  1. Well - as per the official specs you can use a ZWJ+vowelsign to get the standalone form, but it does not work with Pango.

    One option would be to use a special version of pango where you turn off the thing altogether. http://git.gnome.org/cgit/pango/tree/modules/indic/indic-ot.c#n273 is the code chunk you need to modify.

    ReplyDelete