Tesseract-Indic-OCR: ocr

Total character classes required to be trained:

ক
খ
গ
ঘ
ঙ
চ
ছ
জ
ঝ
ঞ
ট
ঠ
ড
ঢ
ণ
ত
থ
দ
ধ
ন
প
ফ
ব
ভ
ম
য
র
ল
শ
ষ
স
হ
য
য়
ৰ
ৱ

অ
আ
ই
ঈ
উ
ঊ
ঋ
এ
ঐ
ও
ঔ

০
১
২
৩
৪
৫
৬
৭
৮
৯

া
ে
ৈ
ৌ (cant get to render the last symbol independently :()
ং

 ঃ

৷
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
;
<>
?
@
=
[
\
]
^
_
`

{
|
}
~
ৢ
ৣ
‘
’
“

Here are semivowels that need to be trained combined with consonants/conjuncts:

ি
ী
ু
ূ
ৃ
ৄ
্
়

Here are the conjuncts:

ক্ক
ক্ট
ক্ত
ক্ন
ক্ম
ক্র
ক্ল
ক্ব
ক্ষ
ক্স
ক্ষ্ণ
ক্ষ্ম
ক্ট্র
খ্র
গ্গ
গ্ধ
গ্ন
গ্ম
গ্ল
গ্ব
গ্র
ঘ্ন
ঘ্র
ঙ্ক
ঙ্খ
ঙ্গ
ঙ্ঘ
ঙ্ম
ঙ্ক্ষ
চ্চ
চ্ছ
চ্ঞ
চ্ছ্র
চ্ছ্ব
ছ্ব
ছ্র
জ্জ
জ্ঝ
জ্ঞ
জ্র
জ্ব
জ্জ্ব
ঞ্চ
ঞ্ছ
ঞ্জ
ঞ্ঝ
ট্ট
ট্র
ঠ্র
ড্ড
ড্র
ড়্গ
ণ্ট
ণ্ঠ
ণ্ড
ণ্ঢ
ণ্ণ
ণ্ম
ণ্ব
ণ্র
ণ্ড্র
ত্ত
ত্থ
ত্ন
ত্ম
ত্র
ত্ব
ত্ত্ব
থ্র
থ্ব
দ্গ
দ্ঘ
দ্দ
দ্ধ
দ্ভ
দ্ম
দ্র
দ্ব
দ্দ্ব
দ্ধ্ব
ধ্ন
ধ্র
ধ্ব
ন্ত
ন্থ
ন্দ
ন্ধ
ন্ন
ন্য
ন্ব
ন্ম
ন্স
ন্ত্ব
ন্ত্র
ন্দ্ব
ন্দ্র
ন্ধ্র
প্ট
প্প
প্ন
প্ত
প্ল
প্স
প্র
ফ্র
ফ্ল
ব্জ
ব্দ
ব্ধ
ব্ব
ব্ল
ব্র
ব্দ্র
ভ্র
ম্ন
ম্প
ম্ফ
ম্ব
ম্ভ
ম্ম
ম্র
ম্ল
ম্ভ্র
ম্প্র
ল্ক
ল্গ
ল্ট
ল্ড
ল্প
ল্ফ
ল্ব
ল্ম
ল্ল
শ্চ
শ্ছ
শ্ন
শ্ম
শ্ব
শ্র
শ্ল
শ্য
ষ্ক
ষ্ট
ষ্ঠ
ষ্ণ
ষ্প
ষ্ফ
ষ্ম
ষ্ক্র
ষ্ট্র
ষ্য
স্ক
স্খ
স্ট
স্ত
স্থ
স্ন
স্প
স্ফ
স্ম
স্র
স্ল
স্ব
স্ত্র
স্ক্র
স্ট্র
স্য
হ্ণ
হ্ন
হ্ম
হ্র
হ্ল
হ্ব
হ্য
গু
ন্তু
নু
সু
রু
রূ
দু
শু
হৃ
হু
গ্রু
গ্রূ
ব্রু
ভ্রু
ভ্রূ
শ্রু
শ্রূ
স্তু
ন্দু
ত্রু
থ্রু
থ্রূ
দ্রু
দ্রূ
ধ্রু
ধ্রূ
ল্গু
ন্ড
ন্ট
ন্ঠ
চ্ন
ট্ম
ট্ব
ড্ম
ভ্ল
ম্ত
ম্থ
ম্দ
ল্ত
ল্ধ
শ্ত

Total number of character classes to be trained:

36 (number of consonants) + 11 (number of vowel) + 10 (digits) +
6 (vowel-signs that can be rendered separately) + 49 (punctuations and symbols) + 215 (conjuncts)
+ (215+36)x6 (for semi-vowels that can not be trained individually) = 1833

Hence the character classifier for an Indic OCR needs to comb through 1833 character classifications
to find a character. For an english OCR on the other hand, this number is below 50.
Hence the difficulties in Indic OCR.

How to reduce number of character classes to be trained?

In my conversation with Prof. B.B. Chaudhuri I learnt techniques to reduce the number of character classes.
First we need to separate a word image into three parts, top, middle, bottom. The top part will have
the rising part of vowel signs like ি ী , the middle part will have consonant, conjuncts, vowels, digits etc.
The bottom part will have descending part of vowel-signs like ু.

Ocropus already seems capable of achieving this. See this. The image below has segmented rising part
of a few vowel signs separately:



If we can successfully adopt this segmentation approach, we can reduce the number of trainable character
classes to around 350.
Now once we have segmented the image, how does the present Tesseract-OCR engine classify the new
character classes. For example how to train the engine so it understands that the rising part of
ি is part of another vowel-sign. In any case, Tesseract only understands characters with unicode
values during training. Hence I dont think Tesseract-OCR will understand this segmentation.
So what do we do. There are 2 possibilites:

1) We use a different OCR engine. Will have to dig deeper into ocropus.
2) We use the Tesseract-OCR classifier and the 1800 odd character classes augmented with a strong
spell checker based correction mechanism.

The 2nd method I am working on right now.

I had tried some time last year to push my matra clipping code to Tesseract-OCR upstream, but Ray Smith the lead developer of the project asked about the accuracy of the code and I never got around to calculating it. Well actually I still havent calculated it, but I did something new.
Check the set of pictures I uploaded at . The first picture is the normal picture to be OCRed. The second picture is the clipped+thresholded image. The third image is the difference of the clipped+thresholded and thresholded images.

Here is the Python code that creates a new image out of two input images:

#!/usr/local/bin/python

import ImageChops, Image

th=Image.open("benth.tif")
clip=Image.open("bentest.tif")

new=ImageChops.difference(th,clip)
new=ImageChops.invert(new)

new.save("diff.tif","TIFF")

I will now show this to Ray Smith. Lets see if he likes it.

Tesseract-Indic-OCR

Friday, May 8, 2009

Bengali Stats

Friday, April 17, 2009

Clipping accuracy

Saturday, March 21, 2009

Bengali Conjuncts

Blog Archive

About Me