Wednesday, November 25, 2009

How the dictionary was fixed?

Well, it was a single line!

Added the following line to line number 1077 in dict/permute.cpp

any_alpha=1;

Here is the diff against 2.04 release:

--- tesseract-2.04/dict/permute.cpp 2008-11-14 23:07:17.000000000 +0530
+++ tessmod/dict/permute.cpp 2009-11-26 00:34:50.660737699 +0530
@@ -1077,6 +1077,7 @@
return (NULL);
if (permute_only_top)
return result_1;
+ any_alpha=1;
if (any_alpha && array_count (char_choices) <= MAX_WERD_LENGTH) {
result_2 = permute_words (char_choices, rating_limit);
if (class_probability (result_1) < class_probability (result_2)


For non-eng script the if condition was never getting satisfied and
hence the DAWG files were not being scanned properly. Adding a
any_alpha=1 on the top explicitly on the top solves this problem for
the time. There is probably a more elegant solution though.
By the way, I do not see this particular if condition in the trunk
anywhere in the file. Perhaps the deveopers have fixed it in the trunk
already.

1 comment:

  1. Hi,

    I discovered Tesseract today, looking for marathi/sanskrit OCR. Looks good, and as your posts denote the Indic support is not built in. Thanks for taking a lead here. I could follow your lead and help a bit with devanagari.

    -ashish mahabal

    ReplyDelete