Tesseract-Indic-OCR: December 2009

Wednesday, December 16, 2009

Review of current status and vision document

Details of work done till now in the Tesseract-Indic project
-------------------------------------------------------------------------

1) Maatraa Clipping

Maatraa here refers to shironaam, or the headline in Devanagri and Bengali script.

The first step in adapting Tesseract-OCR to recognise Indic script like Devanagari and Bengali was to clip (remove) the shironaam at points between successive characters so that Tesseract's connected component analysis does not mistake the entire word for a character,

Here is the algorithm and the code is in the form of a patch in the May 27th, 2008 entry on http://sites.google.com/site/debayanin/hackingtesseract .

Ray Smith, the project owner of Tesseract-OCR commented on the code here and Thomas Breuel makes a mention of "Matraa Clipping" in the morphological operations wiki in the OCRopus project.

2) De-skewing

For the above clipping algorithm to work, the page should be perfectly aligned. The should be no skew/tilt during the OCR process. For this purpose a de-skewing algorithm was required. I wrote an ad-hoc algorithm for that purpose, which has been disabled by default in recent releases of tesseract-indic. Better deskewing methods are available elsewhere. Code can be found at October 28 entry in http://sites.google.com/site/debayanin/hackingtesseract .

3) Training Data Auto Generation

I was initially working alone. One of the biggest problems of working alone on an OCR project is generating training data for different scripts. I tried to solve the problem by rendering all possible glyphs for a script onto an image, recording corresponding bounding boxes to a text file and then feeding the pair to the Tesseract-OCR training mechanism.
Instructions on how to use it can be found here and you may download the latest version at http://code.google.com/p/tesseractindic/downloads/list . The latest version at time of writing this is TesseractIndic-Trainer-GUI-0.1.3 .

4) Getting the dictionary to work

One of the big blockers for this project was a non-working dictionary for Indic scripts. It turned out to be one missing line of code that never caused the dictionary sub routine to be called.
Here is how the problem was located.

5) OCRFeeder

I was working on creating a desktop GUI for scratch in PyGtk. Sayamindu suggested that I look at OCRFeeder instead. The code is very nice and the author has even taken care of surrounding all printable strings with suitable modifiers so gettext can process them for i18n requirements. I am modifying the GUI to support other scripts suitably. Am yet to upload it to a public space, but will do it soon. Sayamindu and I fixed a few problems with it during FOSS.IN 2009.

5) Tilt method

http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html

http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html

6) Community Building

At FOSS.IN i saw a strong urge in people to work on OCR related problems. I felt responsible for creating a community and a framework for the OCR project that allows comunity contribution an easy process.
For a technology intensive project, the traditional FOSS model does not work in the same way. You generally wont expect people to tweak with core algorithms in pattern matching or machine learning components. This is something that Prof. C.V. Jawahar said, and I find it true for Tesseract-OCR too. In the case of Tesseract, a lot of people work on training data, fixing bugs, tweaking parameters, creating UIs but very rarely does someone decide to touch the core algorithms.
The fact is (as said by Prof. Anoop ), core algorithms and the training data/UI share a 50/50 ratio in importance in OCR development.
It is my intention to create a feedback based learning system for the OCR, which makes it trivially easy for the user to send back erroneous recognitions to a maintainer, and it becomes trivially easy for the maintainer to incorporate that data to the newer better training set.

http://hacking-tesseract.blogspot.com/2009/11/crowd-sourcing-ocr-development.html

ToDo
------

1) Documentation on how different language teams can help

2) Integrating OCRFeeder with Training and Testing frameworks. Create feedback module.

3) Web based OCR. Feedback based learning mechanism

4) Can the dictionary be improved?

5) OCRFeeder page layout analysis is a little off

Monday, December 7, 2009

Preliminary results for tilt method

I wrote this python code that reads in a box file and performs the rotation operation on the corresponding image

#!/usr/bin/python
#-*- coding:utf8 -*-

import Image,ImageDraw
import sys

box = open(sys.argv[1],'r')
print type(sys.argv[1])

lines = box.readlines()
image_name = sys.argv[1].split('.')[0]+'.tif'

input_image = Image.open(image_name)

wt = input_image.size[0]
ht = input_image.size[1]
#print wt," ",ht
new_image=Image.new("L",(wt*2,ht),255)
pen=ImageDraw.Draw(new_image)

offset = 0
prevtlx = 0
for line in lines:
    fields = line.split(' ')
    delta_y = int(int(fields[4].strip())) - int(fields[2])
    delta_x = int(fields[3]) - int(fields[1])
    top_left_x = int(fields[1])
    top_left_y = ht - int(fields[2]) - delta_y
    bot_right_x = int(fields[3])
    bot_right_y = ht - int(fields[4].strip()) + delta_y
    box = (top_left_x,top_left_y,bot_right_x,bot_right_y)
    char = input_image.crop(box)
    char = char.rotate(90)
    if top_left_x<prevtlx:
        offset = 0
    
    newwt = char.size[0]
    newht = char.size[1]

    newbox = (top_left_x+offset , top_left_y , top_left_x+offset+newwt ,top_left_y+newht)
    print newbox
    offset = offset+ (newwt - newht + 2)
    prevtlx = top_left_x
     
        new_image.paste(char, newbox)
    #aw_input('>')
new_image.save('mod.tif',"TIFF")

Then I take an image and run the following command on it to generate the box file:

tesseract bengali2.tif bengali2 -l ban batch.nochop makebox

On running the script one finds the images below:

Original Image

Transformed Image

The experiment has been somewhat disappointing. The quality of the character images degrades after rotation. Also since the boxing is not perfect, wrong groups have been rotated. Not that this technique can not be used. I need to make the same modifications in Tesseract C++ code. The idea is to rotate the character images and compare the classifier confidence between the original and the modified character image. The higher value will be chosen.
Also, I need a version of Pango renderer that can render the vowel signs without the dotted circles. I probably need to make a few lines of changes and rebuild Pango, as Sayamindu said.
So here I dive into the code base again.

Sunday, December 6, 2009

The tilt method for character segmentation in Indic Scripts

One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:

For the image below, the OCR may first box the 2 characters together.

At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.

Hence, for a symbol like কু the OCR can not segment ক and ু separately.
There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?

As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.
This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.

Tesseract-Indic-OCR