Thursday, November 19, 2009

utf-8 = ok?

I have been trying to add wide character support to Tesseract code base by converting most char* to wchar_t* data types. However I read in depth about UTF-8 encoding today here . It says UTF-8 handles unicode well. Tesseract already supports UTF-8, or so it says.
However when I print out the dawg file contents I see garbage for Indic scripts, but see proper characters for english. Why is this happening?
This makes me think that maybe I am on the wrong track. I did ask the Tesseract list whether I am on the right track or not, but found no useful replies.
Infact now that I think about it, we are creating the dictionaries out of word lists, we are forgetting that we need to introduce the vower 'de-reordering' rules. Only then will the OCR be able match words in run time.
Refer to my earlier post where I have mentioned that we need to do vowel reordering post OCR. If you reverse the analogy, we need to intentionally include anomalies in the dictionary so the OCR can work on the dictionary. Hence the OCR may think েক by looking at the dictionary, and we can use the vowel reordering code to correct this.

What a realisation!!

2 comments:

  1. You do not always need wchar_t, but there are certain things to consider when you are dealing with UTF-8 stuff. For example, you need specialized functions to iterate over characters, and the standard string functions do not work.
    http://library.gnome.org/devel/glib/2.22/glib-Unicode-Manipulation.html is a set of examples functions. http://www.joelonsoftware.com/articles/Unicode.html is another article that is good :-)
    No idea if Tesseract actually handles all these stuff - never looked at its code.

    ReplyDelete
  2. We have similar considerations.
    For Eg;- in Telugu,
    you might write swati and see sawti
    as in a symbol for sa, then vowel-less w, then a symbol for ti.
    THis is a serious problem for us, and might have to tweak the Dictionary to fix this.

    ReplyDelete