I have been trying to add wide character support to Tesseract code base by converting most char* to wchar_t* data types. However I read in depth about UTF-8 encoding today here . It says UTF-8 handles unicode well. Tesseract already supports UTF-8, or so it says.
However when I print out the dawg file contents I see garbage for Indic scripts, but see proper characters for english. Why is this happening?
This makes me think that maybe I am on the wrong track. I did ask the Tesseract list whether I am on the right track or not, but found no useful replies.
Infact now that I think about it, we are creating the dictionaries out of word lists, we are forgetting that we need to introduce the vower 'de-reordering' rules. Only then will the OCR be able match words in run time.
Refer to my earlier post where I have mentioned that we need to do vowel reordering post OCR. If you reverse the analogy, we need to intentionally include anomalies in the dictionary so the OCR can work on the dictionary. Hence the OCR may think েক by looking at the dictionary, and we can use the vowel reordering code to correct this.
What a realisation!!