Tesseract-Indic-OCR

Friday, November 9, 2012

Why FOSS Indic OCR is now feasible

I was supposed to attend mediawiki.org/wiki/Pune_LanguageSummit_November_2012, but ended up not going. I did not have too much to share since I have not worked on OCR for almost 2 years now. I wanted to share the few things I did have to say and hence I decided to write this post.

Tesseract-OCR today has several new features that make it more suitable for Indic OCR now.

1) They have now moved to a new classifier called "cube" which can handle many more character classes than the older neural net engine. This is important because Indic script has hundreds of different glyphs when you consider conjuncts and overlapping vowels.

2) They have now added significant amounts of code to carry out "shirorekha clipping" to the code base which will help Hindi and Bengali OCR http://code.google.com/p/tesseract-ocr/source/search?q=shirorekha&origq=shirorekha&btnG=Search+Trunk . The approach is similar to http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf

3) Generous amounts of Indic script training data is now hosted on the official Tesseract-OCR website, and also on other community supported projects like http://code.google.com/p/parichit/

For someone who wants to start working on Indic OCR now, the barriers have been lowered somewhat. Much work still needs to be done because in OCR the last 2% is what matters most, and that requires a lot of fine tuning and testing. Getting your hands on ground truth test data is hence of vital importance.

Sunday, April 10, 2011

What next?

Step one was to get the maatraa clipping code into Tesseract, which has happened. We still have the following issues to resolve before we can have excellent recognition rates:

We need to split the following glyphs into separate consonant and vowel signs.

1) Consonant + descending vowel sign

Example:

2) Consonant + ascending vowel sign

Example:

In summary we need to be able to do the following transformation before sending the image to Tesseract:

FROM

Monday, April 4, 2011

Horizontal histogram profiles of consonants with descending vowel signs

 #!/usr/bin/python  
 #-*- coding:utf8 -*-  
 import Image,ImageDraw  
 import sys, os  
 def horizontalHistogram (input_image, image_name):  
   width = input_image.size[0]  
   height = input_image.size[1]  
   print "Width = %d and height = %d" %(width,height)  
   histoGram = []  
   for i in range (height):  
     blackPixelCount = 0  
     for j in range (width):  
       pixel = input_image.getpixel ((j, i))  
       if (pixel == 0):  
         blackPixelCount += 1  
     histoGram.append (blackPixelCount)  
   print histoGram  
   histogramImage = Image.new("L",(width,height),255)  
   pen = ImageDraw.Draw(histogramImage)  
   y = 0  
   for count in histoGram:  
     pen.line((0, y) + (count, y), fill=128)  
     y += 1  
   cumulativeImage = Image.new("L",(width * 2, height),255)  
   cumulativeImage.paste (input_image, (0, 0, width, height))  
   cumulativeImage.paste (histogramImage, (width, 0, width*2, height))  
   cumulativeImage.save(image_name+"_"+".png","PNG")  
 def verticalHistogram ():  
   pass  
 path = "/home/debayan/code/lower_descender_images/"  
 dirName = os.listdir (path)  
 for image_name in dirName:  
   input_image = Image.open(path + image_name)  
   horizontalHistogram (input_image, image_name)

The piece of code above takes a set of images (consonant + vowel) in the folder /home/debayan/code/lower_descender_images/ generated by http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/generate.py and generates tiny images with horizontal histogram profiles. The generated set can be found at https://picasaweb.google.com/debayanin/OCRStuff .
Observing the set thus generated can give us insights as to where the descender vowel sign begins. If we know this, we can separate the consonant and vowel sign. One general observation in this case is that the point where the vowel sign begins causes a local minima in the histogram profile.

Here are some examples. Note the red horizontal lines which mark the minima in the histogram:

Tthri in Bengali

Mu in Bengali

These are the exceptions:

Chhu in Bengali

Du in Bengali

With some fonts the above glyph might now show a local minima in the histogram

Jri in Bengali

Hri in Bengali

Hu in Bengali

Gu in Bengali

Wednesday, December 16, 2009

Review of current status and vision document

Details of work done till now in the Tesseract-Indic project
-------------------------------------------------------------------------

1) Maatraa Clipping

Maatraa here refers to shironaam, or the headline in Devanagri and Bengali script.

The first step in adapting Tesseract-OCR to recognise Indic script like Devanagari and Bengali was to clip (remove) the shironaam at points between successive characters so that Tesseract's connected component analysis does not mistake the entire word for a character,

Here is the algorithm and the code is in the form of a patch in the May 27th, 2008 entry on http://sites.google.com/site/debayanin/hackingtesseract .

Ray Smith, the project owner of Tesseract-OCR commented on the code here and Thomas Breuel makes a mention of "Matraa Clipping" in the morphological operations wiki in the OCRopus project.

2) De-skewing

For the above clipping algorithm to work, the page should be perfectly aligned. The should be no skew/tilt during the OCR process. For this purpose a de-skewing algorithm was required. I wrote an ad-hoc algorithm for that purpose, which has been disabled by default in recent releases of tesseract-indic. Better deskewing methods are available elsewhere. Code can be found at October 28 entry in http://sites.google.com/site/debayanin/hackingtesseract .

3) Training Data Auto Generation

I was initially working alone. One of the biggest problems of working alone on an OCR project is generating training data for different scripts. I tried to solve the problem by rendering all possible glyphs for a script onto an image, recording corresponding bounding boxes to a text file and then feeding the pair to the Tesseract-OCR training mechanism.
Instructions on how to use it can be found here and you may download the latest version at http://code.google.com/p/tesseractindic/downloads/list . The latest version at time of writing this is TesseractIndic-Trainer-GUI-0.1.3 .

4) Getting the dictionary to work

One of the big blockers for this project was a non-working dictionary for Indic scripts. It turned out to be one missing line of code that never caused the dictionary sub routine to be called.
Here is how the problem was located.

5) OCRFeeder

I was working on creating a desktop GUI for scratch in PyGtk. Sayamindu suggested that I look at OCRFeeder instead. The code is very nice and the author has even taken care of surrounding all printable strings with suitable modifiers so gettext can process them for i18n requirements. I am modifying the GUI to support other scripts suitably. Am yet to upload it to a public space, but will do it soon. Sayamindu and I fixed a few problems with it during FOSS.IN 2009.

5) Tilt method

http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html

http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html

6) Community Building

At FOSS.IN i saw a strong urge in people to work on OCR related problems. I felt responsible for creating a community and a framework for the OCR project that allows comunity contribution an easy process.
For a technology intensive project, the traditional FOSS model does not work in the same way. You generally wont expect people to tweak with core algorithms in pattern matching or machine learning components. This is something that Prof. C.V. Jawahar said, and I find it true for Tesseract-OCR too. In the case of Tesseract, a lot of people work on training data, fixing bugs, tweaking parameters, creating UIs but very rarely does someone decide to touch the core algorithms.
The fact is (as said by Prof. Anoop ), core algorithms and the training data/UI share a 50/50 ratio in importance in OCR development.
It is my intention to create a feedback based learning system for the OCR, which makes it trivially easy for the user to send back erroneous recognitions to a maintainer, and it becomes trivially easy for the maintainer to incorporate that data to the newer better training set.

http://hacking-tesseract.blogspot.com/2009/11/crowd-sourcing-ocr-development.html

ToDo
------

1) Documentation on how different language teams can help

2) Integrating OCRFeeder with Training and Testing frameworks. Create feedback module.

3) Web based OCR. Feedback based learning mechanism

4) Can the dictionary be improved?

5) OCRFeeder page layout analysis is a little off

Monday, December 7, 2009

Preliminary results for tilt method

I wrote this python code that reads in a box file and performs the rotation operation on the corresponding image

#!/usr/bin/python
#-*- coding:utf8 -*-

import Image,ImageDraw
import sys

box = open(sys.argv[1],'r')
print type(sys.argv[1])

lines = box.readlines()
image_name = sys.argv[1].split('.')[0]+'.tif'

input_image = Image.open(image_name)

wt = input_image.size[0]
ht = input_image.size[1]
#print wt," ",ht
new_image=Image.new("L",(wt*2,ht),255)
pen=ImageDraw.Draw(new_image)

offset = 0
prevtlx = 0
for line in lines:
    fields = line.split(' ')
    delta_y = int(int(fields[4].strip())) - int(fields[2])
    delta_x = int(fields[3]) - int(fields[1])
    top_left_x = int(fields[1])
    top_left_y = ht - int(fields[2]) - delta_y
    bot_right_x = int(fields[3])
    bot_right_y = ht - int(fields[4].strip()) + delta_y
    box = (top_left_x,top_left_y,bot_right_x,bot_right_y)
    char = input_image.crop(box)
    char = char.rotate(90)
    if top_left_x<prevtlx:
        offset = 0
    
    newwt = char.size[0]
    newht = char.size[1]

    newbox = (top_left_x+offset , top_left_y , top_left_x+offset+newwt ,top_left_y+newht)
    print newbox
    offset = offset+ (newwt - newht + 2)
    prevtlx = top_left_x
     
        new_image.paste(char, newbox)
    #aw_input('>')
new_image.save('mod.tif',"TIFF")

Then I take an image and run the following command on it to generate the box file:

tesseract bengali2.tif bengali2 -l ban batch.nochop makebox

On running the script one finds the images below:

Original Image

Transformed Image

The experiment has been somewhat disappointing. The quality of the character images degrades after rotation. Also since the boxing is not perfect, wrong groups have been rotated. Not that this technique can not be used. I need to make the same modifications in Tesseract C++ code. The idea is to rotate the character images and compare the classifier confidence between the original and the modified character image. The higher value will be chosen.
Also, I need a version of Pango renderer that can render the vowel signs without the dotted circles. I probably need to make a few lines of changes and rebuild Pango, as Sayamindu said.
So here I dive into the code base again.

Sunday, December 6, 2009

The tilt method for character segmentation in Indic Scripts

One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:

For the image below, the OCR may first box the 2 characters together.

At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.

Hence, for a symbol like কু the OCR can not segment ক and ু separately.
There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?

As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.

Saturday, November 28, 2009

Initial tests

Initial test results are pretty good.

Test Condition:

Image: A deskewed image with Bengali text.

Training data:
Word List: Superset of all the words contained in the image
Shape Information: Using CRBLP's data

Output Text:

বৈঠকী মেজাজের এক উপেক্ষিত সা হিতি্যক
পত্র।স্তরে অনুজপ্রতিম লেখক ভ্রী শৈবাল মিত্র কিছুদিন অাগে বা-ঙা লি
লেখবঢ়ুঘুদ্ধিজীবীদের খুব একচেটি বকূনি দিয়েছো। তার অভিযোড়ৈগ এই যে, -যখনই
এই সব ব্যক্তিদের প্রশ্ল কর। হয়, অাপনার। গত এক বছরে উল্লেখযোগ্য কীড়ী কীা বই
পড়েছো, তখন অবধা রিতভাবে প্রায় সকলেই কিছুইংরিজি বই বা বিদেশি সাহিভ্যের
সূখ্য।তি ভরঢ়ু করেন। তার। কি ব।ংল। বই পড়্গে না ? নিজেরা বাংলা ভাষার লেখক
হয়েও অপর কে।নও বাঙালি লেখকের রচন।কে ণ্ডরঢ়ুত্বপুর্ণ মনে করেন নাড়ৈ? না কি
ব।ংল। ভ।ষায় উল্লেখযে।গ্য কিছু লেখাই হয় না।
এই অভিযে।গে সত্যতা অাছে। প।ণ্ডিত্য প্রমাণ করার জন্য অনেকেই বিদেশি
স।হিত্য সম্পর্কে জান জ। হির কর।র জন্য ব্যস্ত হয়ে পড়্গে; পা ণ্ডিত্য কিংবানূরবারি
য।ই হে।ক, ব।ংল। বই-টই এর মধ্যে অ।সে না। কফি হাউসের বুদ্ধিজীবীদের হাভে
বাংলা বই র।খ।র রেওয়।জ নেই। ঢেউয়ের মতন কখনও ম।র্কেজ, কখনও-
দেরিদ।-গ্র।মসি, কখনও টনি মরিসন ব।ঙ।লি বুদ্ধিজীবীদের ওণ্ঠের ওপর খেলাকরে
যান।
বিদেশি স।হিত্য ও তত্ত্বগ্রছ প।ঠ করা অবশ্যই জকরি, কিত বাংলা ভাষায়
অালোচন।যে।গ্য কে।নও গ্রছ লেখ। হয় ন।, এমন য দি মনে কর। হয় তা হলে বাংলা
-ভাষানিয়ে এত গর্ব কর।রইবা কী অ।ছে ? বিদেশি স।হিত্য প।ঠ করলেই বরং বে।ঝা
-যায়-, সম্প্রতি অন্যান্য ভ।ষ।য় রচিত গল্প-উপন্য।সবঢ়ুবিত। বাংলার তূলনায় এমন কিছু
-অাহাম রি উচচ।ঙ্গের নয়। সৈয়দ মুস্ত।ফ। সির।জের ড়ৃঅলীক মানুষ,এর মতন উপন্যাস
বিঢ়ুংব। -জয় গে।স্ব।মীর ড়ৈপ।গলী, ভে।মার সঙ্গে-ব তুল্য কাব্যগ্রছইদানীং কোন ভ।ষ।য়
প্রকাশিত হয়েছে?
যাই হ্রোক, অামার পক্ষে এরকম পণ্ডি তিপনা কিংবা স্নবারি দেখ।বার কোনও
সুযে।গই নেই; কারণ গত এব৪ বৎসরে অামি বিদেশি সাহিত্য কিছুই প ড়িনি ! এমনকী
ইংরিজি অক্ষরে লেখ। নিতাস্ত কয়েবঢ়ুখান। পুরনো ইতিহাসখজীবনীগ্রস্হু ছাড়া কোনও
গল্পঞ্ঝউপন্যাস চোখেও দেখি নি! বস্কু-ব।ন্ধবরা কেউ যখন সা।ভ.ঘতিক কোনও
সাড়।-জাগ।নো বইয়ের প্রসঙ্গ তুলে জিজ্ঞেস করে, তূ মি পড়েছ নিশ্চয়ই? অামারে৪
সসংকে।চে স্বীক।র করতেই হয়ড়ৈ, না ভ।ই পড়িনি! কিংব। বিদেশ খেকে ফিরে এলে
যখন কেউ জিজেস করে, ও দেশের হালফিল সাহিত্যের ধারা কী দেখলে, অামি
মাথা চুলকোই। জা নি না, খবর নেব।র সময় পাইনি! লভনে গিয়ে ইন্ডিয়া অ ফিস
লাইরেরিভে অামি পুরনো গুথিপত্র ঘেঁটেছি, এবচটাও নজৃব ইংরিজি কবিতার বই
কিনি নি, এটা স্বীকার করভে অামার লজ্জ। হয়। ত্রটা অামার এবচঁটা অধঃপতনের চিহ্ন

Accuracy: 93% ~

One major source of errors is । vs া ambiguity. That can be fixed.

This is pretty good news. The OCR is working well.