<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6518118086872671696</id><updated>2011-07-30T20:48:13.440-07:00</updated><category term='linux'/><category term='sayamindu ocr ambiguity'/><category term='picasa'/><category term='wchar'/><category term='sayamindu'/><category term='dawg'/><category term='ocr tilt'/><category term='bengali conjuncts'/><category term='indic meet'/><category term='utf8'/><category term='tesseract ocr trainer'/><category term='pango'/><category term='tilt'/><category term='imagechops'/><category term='degradation'/><category term='tesseractindic'/><category term='ocr'/><category term='cairo'/><category term='mbstowcs'/><category term='ocr histogram minima'/><title type='text'>Tesseract-Indic-OCR</title><subtitle type='html'>http://code.google.com/p/tesseractindic | http://groups.google.com/group/indic-ocr</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>31</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-3039343151999442890</id><published>2011-04-10T10:07:00.000-07:00</published><updated>2011-04-10T11:38:03.191-07:00</updated><title type='text'>What next?</title><content type='html'>Step one was to get the maatraa clipping code into Tesseract, which has happened. We still have the following issues to resolve before we can have excellent recognition rates:&lt;br /&gt;&lt;br /&gt;We need to split the following glyphs into separate consonant and vowel signs.&lt;br /&gt;&lt;br /&gt;1) Consonant + descending vowel sign&lt;br /&gt;&lt;br /&gt;Example:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://3.bp.blogspot.com/-_7K3yLOFz78/TaH0Id0C7bI/AAAAAAAAH08/wb64lIj13zY/s1600/Screenshot-1.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 222px; height: 99px;" src="http://3.bp.blogspot.com/-_7K3yLOFz78/TaH0Id0C7bI/AAAAAAAAH08/wb64lIj13zY/s400/Screenshot-1.png" alt="" id="BLOGGER_PHOTO_ID_5594020638449921458" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;2) Consonant +  ascending vowel sign&lt;br /&gt;&lt;br /&gt;Example:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://3.bp.blogspot.com/-s9LiQQPBuOY/TaH0ja_uRqI/AAAAAAAAH1E/kmL26DJsbsI/s1600/Screenshot-2.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 265px; height: 89px;" src="http://3.bp.blogspot.com/-s9LiQQPBuOY/TaH0ja_uRqI/AAAAAAAAH1E/kmL26DJsbsI/s400/Screenshot-2.png" alt="" id="BLOGGER_PHOTO_ID_5594021101550061218" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In summary we need to be able to do the following transformation before sending the image to Tesseract:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;                                                                                                                                                   FROM&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://4.bp.blogspot.com/-4_jXKSwTr54/TaH3QPL0U0I/AAAAAAAAH1Q/5jGbfxLztXc/s1600/Screenshot-3.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 134px;" src="http://4.bp.blogspot.com/-4_jXKSwTr54/TaH3QPL0U0I/AAAAAAAAH1Q/5jGbfxLztXc/s400/Screenshot-3.png" alt="" id="BLOGGER_PHOTO_ID_5594024070496932674" border="0" /&gt;&lt;/a&gt;                                                                                                                                                            TO&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://3.bp.blogspot.com/-Mb_nA5L4g0g/TaH44gmpXoI/AAAAAAAAH1s/O8KddFaW5lc/s1600/transformed.png"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 400px; height: 55px;" src="http://3.bp.blogspot.com/-Mb_nA5L4g0g/TaH44gmpXoI/AAAAAAAAH1s/O8KddFaW5lc/s400/transformed.png" alt="" id="BLOGGER_PHOTO_ID_5594025861879258754" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-3039343151999442890?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/3039343151999442890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2011/04/what-next.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/3039343151999442890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/3039343151999442890'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2011/04/what-next.html' title='What next?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-_7K3yLOFz78/TaH0Id0C7bI/AAAAAAAAH08/wb64lIj13zY/s72-c/Screenshot-1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-8698872075408910958</id><published>2011-04-04T13:47:00.000-07:00</published><updated>2011-04-07T13:15:04.870-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ocr histogram minima'/><title type='text'>Horizontal histogram profiles of consonants with descending vowel signs</title><content type='html'>&lt;pre  style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;;background-image:URL(http://2.bp.blogspot.com/_z5ltvMQPaa8/SjJXr_U2YBI/AAAAAAAAAAM/46OqEP32CJ8/s320/codebg.gif);padding:0px;color:#000000;text-align:left;line-height:20px;"&gt;&lt;code style="color:#000000;word-wrap:normal;"&gt; #!/usr/bin/python  &lt;br /&gt; #-*- coding:utf8 -*-  &lt;br /&gt; import Image,ImageDraw  &lt;br /&gt; import sys, os  &lt;br /&gt; def horizontalHistogram (input_image, image_name):  &lt;br /&gt;   width = input_image.size[0]  &lt;br /&gt;   height = input_image.size[1]  &lt;br /&gt;   print "Width = %d and height = %d" %(width,height)  &lt;br /&gt;   histoGram = []  &lt;br /&gt;   for i in range (height):  &lt;br /&gt;     blackPixelCount = 0  &lt;br /&gt;     for j in range (width):  &lt;br /&gt;       pixel = input_image.getpixel ((j, i))  &lt;br /&gt;       if (pixel == 0):  &lt;br /&gt;         blackPixelCount += 1  &lt;br /&gt;     histoGram.append (blackPixelCount)  &lt;br /&gt;   print histoGram  &lt;br /&gt;   histogramImage = Image.new("L",(width,height),255)  &lt;br /&gt;   pen = ImageDraw.Draw(histogramImage)  &lt;br /&gt;   y = 0  &lt;br /&gt;   for count in histoGram:  &lt;br /&gt;     pen.line((0, y) + (count, y), fill=128)  &lt;br /&gt;     y += 1  &lt;br /&gt;   cumulativeImage = Image.new("L",(width * 2, height),255)  &lt;br /&gt;   cumulativeImage.paste (input_image, (0, 0, width, height))  &lt;br /&gt;   cumulativeImage.paste (histogramImage, (width, 0, width*2, height))  &lt;br /&gt;   cumulativeImage.save(image_name+"_"+".png","PNG")  &lt;br /&gt; def verticalHistogram ():  &lt;br /&gt;   pass  &lt;br /&gt; path = "/home/debayan/code/lower_descender_images/"  &lt;br /&gt; dirName = os.listdir (path)  &lt;br /&gt; for image_name in dirName:  &lt;br /&gt;   input_image = Image.open(path + image_name)  &lt;br /&gt;   horizontalHistogram (input_image, image_name)  &lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The piece of code above takes a set of images (consonant + vowel) in the folder /home/debayan/code/lower_descender_images/ generated by &lt;a href="http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/generate.py"&gt;http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/generate.py&lt;/a&gt; and generates tiny images with horizontal histogram profiles. The generated set can be found at &lt;a href="https://picasaweb.google.com/debayanin/OCRStuff"&gt;https://picasaweb.google.com/debayanin/OCRStuff&lt;/a&gt; .&lt;br /&gt;Observing the set thus generated can give us insights as to where the descender vowel sign begins. If we know this, we can separate the consonant and vowel sign. One general observation in this case is that the point where the vowel sign begins causes a local minima in the histogram profile.&lt;br /&gt;&lt;br /&gt;Here are some examples. Note the red horizontal lines which mark the minima in the histogram:&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://1.bp.blogspot.com/-Y7CaiQH_iZ4/TZ4UVtTeJzI/AAAAAAAAH0k/7c6DMj-zlhY/s1600/46.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 86px; height: 103px;" src="http://1.bp.blogspot.com/-Y7CaiQH_iZ4/TZ4UVtTeJzI/AAAAAAAAH0k/7c6DMj-zlhY/s400/46.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592930150411806514" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Tthri in Bengali&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://2.bp.blogspot.com/-QcV9tS8LTS8/TZ4U6v5Hi4I/AAAAAAAAH0s/yW8xBjAVViw/s1600/97.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 104px; height: 65px;" src="http://2.bp.blogspot.com/-QcV9tS8LTS8/TZ4U6v5Hi4I/AAAAAAAAH0s/yW8xBjAVViw/s400/97.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592930786761739138" border="0" /&gt;&lt;/a&gt;Mu in Bengali&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;These are the  exceptions:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://4.bp.blogspot.com/-YI-OAnx9KvM/TZ4QfF2DruI/AAAAAAAAHzc/90km8uwvQd0/s1600/25.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 108px; height: 66px;" src="http://4.bp.blogspot.com/-YI-OAnx9KvM/TZ4QfF2DruI/AAAAAAAAHzc/90km8uwvQd0/s400/25.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592925913571634914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Chhu in Bengali&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://4.bp.blogspot.com/-qJYfht7Ilfs/TZ4Qz2UYYfI/AAAAAAAAHzk/bhJi4umzR3s/s1600/69.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 84px; height: 65px;" src="http://4.bp.blogspot.com/-qJYfht7Ilfs/TZ4Qz2UYYfI/AAAAAAAAHzk/bhJi4umzR3s/s400/69.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592926270181106162" border="0" /&gt;&lt;/a&gt;Du in Bengali&lt;br /&gt;&lt;br /&gt;With some fonts the above glyph might now show a local minima in the histogram&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://1.bp.blogspot.com/-V4uLGpt2UOA/TZ4RURv7HhI/AAAAAAAAHzs/8PgUCPQjRJw/s1600/30.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 138px; height: 75px;" src="http://1.bp.blogspot.com/-V4uLGpt2UOA/TZ4RURv7HhI/AAAAAAAAHzs/8PgUCPQjRJw/s400/30.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592926827300199954" border="0" /&gt;&lt;/a&gt;Jri in Bengali&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://3.bp.blogspot.com/-9GBVzMxTjR0/TZ4SNm9wiJI/AAAAAAAAH0I/VKOzx0Xp2Rg/s1600/126.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 104px; height: 58px;" src="http://3.bp.blogspot.com/-9GBVzMxTjR0/TZ4SNm9wiJI/AAAAAAAAH0I/VKOzx0Xp2Rg/s400/126.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592927812247914642" border="0" /&gt;&lt;/a&gt;Hri in Bengali&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}  catch(e) {}" href="http://1.bp.blogspot.com/-MpaIDlgarD4/TZ4SJXUjveI/AAAAAAAAH0A/qkQ3TaDbMlU/s1600/125.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 94px; height: 60px;" src="http://1.bp.blogspot.com/-MpaIDlgarD4/TZ4SJXUjveI/AAAAAAAAH0A/qkQ3TaDbMlU/s400/125.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592927739329101282" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Hu in Bengali&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();}   catch(e) {}" href="http://4.bp.blogspot.com/-s_NXfvjHeeQ/TZ4SFffGxFI/AAAAAAAAHz4/8J6xlo7-IbU/s1600/9.png_.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 112px; height: 51px;" src="http://4.bp.blogspot.com/-s_NXfvjHeeQ/TZ4SFffGxFI/AAAAAAAAHz4/8J6xlo7-IbU/s400/9.png_.jpg" alt="" id="BLOGGER_PHOTO_ID_5592927672801346642" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Gu in Bengali&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-8698872075408910958?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/8698872075408910958/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2011/04/horizontal-histogram-profiles-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/8698872075408910958'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/8698872075408910958'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2011/04/horizontal-histogram-profiles-of.html' title='Horizontal histogram profiles of consonants with descending vowel signs'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-Y7CaiQH_iZ4/TZ4UVtTeJzI/AAAAAAAAH0k/7c6DMj-zlhY/s72-c/46.png_.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-2135198976857488721</id><published>2009-12-16T10:07:00.000-08:00</published><updated>2009-12-16T11:17:34.893-08:00</updated><title type='text'>Review of current status and vision document</title><content type='html'>Details of work done till now in the Tesseract-Indic project&lt;br /&gt;-------------------------------------------------------------------------&lt;br /&gt;&lt;br /&gt;1) &lt;span style="font-weight: bold;"&gt;Maatraa Clipping&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Maatraa here refers to shironaam, or the headline in Devanagri and Bengali script.&lt;br /&gt;&lt;br /&gt;The first step in adapting Tesseract-OCR to recognise Indic script like Devanagari and Bengali was to clip (remove) the shironaam at points between successive characters so that Tesseract's connected component analysis does not mistake the entire word for a character,&lt;br /&gt;&lt;br /&gt;&lt;a href="http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf"&gt;Here&lt;/a&gt; is the algorithm and the code is in the form of a patch in the May 27th, 2008 entry on&lt;a href="http://sites.google.com/site/debayanin/hackingtesseract"&gt;  http://sites.google.com/site/debayanin/hackingtesseract&lt;/a&gt; .&lt;br /&gt;&lt;br /&gt;&lt;a href="http://research.google.com/pubs/author4479.html"&gt;Ray Smith&lt;/a&gt;, the project owner of Tesseract-OCR commented on the code &lt;a href="http://groups.google.com/group/tesseract-ocr/browse_thread/thread/f68c8bfbbb9964f8/e6b41d0b697bc997?lnk=gst&amp;amp;q=false+positives#e6b41d0b697bc997"&gt;here&lt;/a&gt; and &lt;a href="http://www.iupr.com/"&gt;Thomas Breuel&lt;/a&gt; makes a &lt;a href="http://sites.google.com/site/ocropus/old-documentation/morphological-operations"&gt;mention&lt;/a&gt; of "Matraa Clipping" in the morphological operations wiki in the OCRopus project.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;2) &lt;span style="font-weight: bold;"&gt;De-skewing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For the above clipping algorithm to work, the page should be perfectly aligned. The should be no skew/tilt during the OCR process. For this purpose a &lt;a href="http://tesseractindic.googlecode.com/files/skew_deskew.pdf"&gt;de-skewing algorithm&lt;/a&gt; was required. I wrote an ad-hoc algorithm for that purpose, which has been disabled by default in recent releases of tesseract-indic. Better deskewing methods are available elsewhere. Code can be found at October 28 entry in &lt;a href="http://sites.google.com/site/debayanin/hackingtesseract"&gt;http://sites.google.com/site/debayanin/hackingtesseract&lt;/a&gt; .&lt;br /&gt;&lt;br /&gt;3) &lt;span style="font-weight: bold;"&gt;Training Data Auto Generation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I was initially working alone. One of the biggest problems of working alone on an OCR project is generating training data for different scripts. I tried to solve the problem by rendering all possible glyphs for a script onto an image, recording corresponding bounding boxes to a text file and then feeding the pair to the Tesseract-OCR training mechanism.&lt;br /&gt;Instructions on how to use it can be found here and you may download the latest version at http://code.google.com/p/tesseractindic/downloads/list . The latest version at time of writing this is TesseractIndic-Trainer-GUI-0.1.3 .&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;4) &lt;span style="font-weight: bold;"&gt;Getting the dictionary to work&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;One of the big blockers for this project was a non-working dictionary for Indic scripts. It turned out to be one missing line of code that never caused the dictionary sub routine to be called.&lt;br /&gt;&lt;a href="http://hacking-tesseract.blogspot.com/2009/11/tesseract-dictionary-finally-works-for.html"&gt;Here&lt;/a&gt; &lt;a href="http://hacking-tesseract.blogspot.com/2009/11/how-dictionary-was-fixed.html"&gt;is&lt;/a&gt; how the problem was located.&lt;br /&gt;&lt;br /&gt;5) &lt;span style="font-weight: bold;"&gt;OCRFeeder&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I was working on creating a desktop GUI for scratch in PyGtk. Sayamindu suggested that I look at &lt;a href="http://live.gnome.org/OCRFeeder"&gt;OCRFeeder&lt;/a&gt; instead. The code is very nice and the author has even taken care of surrounding all printable strings with suitable modifiers so gettext can process them for i18n requirements. I am modifying the GUI to support other scripts suitably. Am yet to upload it to a public space, but will do it soon. Sayamindu and I fixed a few problems with it during FOSS.IN 2009.&lt;br /&gt;&lt;br /&gt;5) &lt;span style="font-weight: bold;"&gt;Tilt method&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html"&gt;http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="display: block;" id="formatbar_Buttons"&gt;&lt;span class=" on down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"&gt;&lt;img src="http://www.blogger.com/img/blank.gif" alt="Link" class="gl_link" border="0" /&gt;&lt;/span&gt;&lt;/span&gt;&lt;a href="http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html"&gt;http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;6) &lt;span style="font-weight: bold;"&gt;Community Building&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At FOSS.IN i saw a strong urge in people to work on OCR related problems. I felt responsible for creating a community and a framework for the OCR project that allows comunity contribution an easy process.&lt;br /&gt;For a technology intensive project, the traditional FOSS model does not work in the same way. You generally wont expect people to tweak with core algorithms in pattern matching or machine learning components. This is something that &lt;a href="http://mail.iiit.ac.in/%7Ejawahar/"&gt;Prof. C.V. Jawahar&lt;/a&gt; said, and I find it true for Tesseract-OCR too. In the case of Tesseract, a lot of people work on training data, fixing bugs, tweaking parameters, creating UIs but very rarely does someone decide to touch the core algorithms.&lt;br /&gt;The fact is (as said by &lt;a href="http://cvit.iiit.ac.in/index.php?page=people"&gt;Prof. Anoop&lt;/a&gt; ), core algorithms and the training data/UI share a 50/50 ratio in importance in OCR development.&lt;br /&gt;It is my intention to create a feedback based learning system for the OCR, which makes it trivially easy for the user to send back erroneous recognitions to a maintainer, and it becomes trivially easy for the maintainer to incorporate that data to the newer better training set.&lt;br /&gt;&lt;br /&gt;http://hacking-tesseract.blogspot.com/2009/11/crowd-sourcing-ocr-development.html&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;ToDo&lt;br /&gt;------&lt;br /&gt;&lt;br /&gt;1) Documentation on how different language teams can help&lt;br /&gt;&lt;br /&gt;2) Integrating OCRFeeder with Training and Testing frameworks. Create feedback module.&lt;br /&gt;&lt;br /&gt;3) Web based OCR. Feedback based learning mechanism&lt;br /&gt;&lt;br /&gt;4) Can the dictionary be improved?&lt;br /&gt;&lt;br /&gt;5) OCRFeeder page layout analysis is a little off&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-2135198976857488721?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/2135198976857488721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/12/review-of-current-status-and-vision.html#comment-form' title='13 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2135198976857488721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2135198976857488721'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/12/review-of-current-status-and-vision.html' title='Review of current status and vision document'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>13</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-5084603468632410341</id><published>2009-12-07T00:22:00.000-08:00</published><updated>2009-12-07T00:58:02.573-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ocr tilt'/><title type='text'>Preliminary results for tilt method</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j0a3i0txX5Q/SxzBXHFdV7I/AAAAAAAAGVU/Oztf81a39wE/s1600-h/bengali2.png"&gt;&lt;/a&gt;&lt;br /&gt;I wrote this python code that reads in a box file and performs the rotation operation on the corresponding image&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"&gt;&lt;code&gt;#!/usr/bin/python&lt;br /&gt;#-*- coding:utf8 -*-&lt;br /&gt;&lt;br /&gt;import Image,ImageDraw&lt;br /&gt;import sys&lt;br /&gt;&lt;br /&gt;box = open(sys.argv[1],'r')&lt;br /&gt;print type(sys.argv[1])&lt;br /&gt;&lt;br /&gt;lines = box.readlines()&lt;br /&gt;image_name = sys.argv[1].split('.')[0]+'.tif'&lt;br /&gt;&lt;br /&gt;input_image = Image.open(image_name)&lt;br /&gt;&lt;br /&gt;wt = input_image.size[0]&lt;br /&gt;ht = input_image.size[1]&lt;br /&gt;#print wt,&amp;quot; &amp;quot;,ht&lt;br /&gt;new_image=Image.new(&amp;quot;L&amp;quot;,(wt*2,ht),255)&lt;br /&gt;pen=ImageDraw.Draw(new_image)&lt;br /&gt;&lt;br /&gt;offset = 0&lt;br /&gt;prevtlx = 0&lt;br /&gt;for line in lines:&lt;br /&gt;    fields = line.split(' ')&lt;br /&gt;    delta_y = int(int(fields[4].strip())) - int(fields[2])&lt;br /&gt;    delta_x = int(fields[3]) - int(fields[1])&lt;br /&gt;    top_left_x = int(fields[1])&lt;br /&gt;    top_left_y = ht - int(fields[2]) - delta_y&lt;br /&gt;    bot_right_x = int(fields[3])&lt;br /&gt;    bot_right_y = ht - int(fields[4].strip()) + delta_y&lt;br /&gt;    box = (top_left_x,top_left_y,bot_right_x,bot_right_y)&lt;br /&gt;    char = input_image.crop(box)&lt;br /&gt;    char = char.rotate(90)&lt;br /&gt;    if top_left_x&amp;lt;prevtlx:&lt;br /&gt;        offset = 0&lt;br /&gt;    &lt;br /&gt;    newwt = char.size[0]&lt;br /&gt;    newht = char.size[1]&lt;br /&gt;&lt;br /&gt;    newbox = (top_left_x+offset , top_left_y , top_left_x+offset+newwt ,top_left_y+newht)&lt;br /&gt;    print newbox&lt;br /&gt;    offset = offset+ (newwt - newht + 2)&lt;br /&gt;    prevtlx = top_left_x&lt;br /&gt;     &lt;br /&gt;        new_image.paste(char, newbox)&lt;br /&gt;    #aw_input('&amp;gt;')&lt;br /&gt;new_image.save('mod.tif',&amp;quot;TIFF&amp;quot;)&lt;br /&gt;&lt;/code&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Then I take an image and run the following command on it to generate the box file:&lt;br /&gt;&lt;blockquote&gt;tesseract bengali2.tif bengali2 -l ban batch.nochop makebox&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;On running the script one finds the images below:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j0a3i0txX5Q/SxzBXHFdV7I/AAAAAAAAGVU/Oztf81a39wE/s1600-h/bengali2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 308px;" src="http://4.bp.blogspot.com/_j0a3i0txX5Q/SxzBXHFdV7I/AAAAAAAAGVU/Oztf81a39wE/s400/bengali2.png" alt="" id="BLOGGER_PHOTO_ID_5412413454975588274" border="0" /&gt;Original Image&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div style="text-align: right;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j0a3i0txX5Q/SxzBXiV9nAI/AAAAAAAAGVc/TjiV6xs8bqc/s1600-h/mod.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 154px;" src="http://2.bp.blogspot.com/_j0a3i0txX5Q/SxzBXiV9nAI/AAAAAAAAGVc/TjiV6xs8bqc/s400/mod.png" alt="" id="BLOGGER_PHOTO_ID_5412413462292569090" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;Transformed Image&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;The experiment has been somewhat disappointing. The quality of the  character images degrades after rotation. Also since the boxing is not perfect, wrong groups have been rotated. Not that this technique can not be used. I need to make the same modifications in Tesseract C++ code. The idea is to rotate the character images and compare the classifier confidence between the original and the modified character image. The higher value will be chosen.&lt;br /&gt;Also, I need a version of Pango renderer that can render the vowel signs without the dotted circles. I probably need to make a few lines of changes and rebuild Pango, as Sayamindu said.&lt;br /&gt;So here I dive into the code base again.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-5084603468632410341?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/5084603468632410341/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/5084603468632410341'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/5084603468632410341'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/12/preliminary-results-for-tilt-method.html' title='Preliminary results for tilt method'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j0a3i0txX5Q/SxzBXHFdV7I/AAAAAAAAGVU/Oztf81a39wE/s72-c/bengali2.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-8921184679868469350</id><published>2009-12-06T00:58:00.000-08:00</published><updated>2009-12-06T01:17:50.359-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tilt'/><title type='text'>The tilt method for character segmentation in Indic Scripts</title><content type='html'>One of the problems in Indic script character classification is the huge number of glyphs. This is mostly due to conjuncts. But a major component of the huge number of glyphs are also symbols formed by consonant+vowel signs. Once a combination of consonant and vowel sign overlaps on a vertical axis, Tesseract has to be trained with that entire symbol. This is because Tesseract does a left-to-right scan of the image and can only box a wholly connected component. Then it proceeds to sub-divide the box, again on a vertical axis, in case it fails to recognise the entire word. For example:&lt;br /&gt;&lt;br /&gt;For the image below, the OCR may first box the 2 characters together.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j0a3i0txX5Q/Sxt1CR-807I/AAAAAAAAGUM/NxRDJQRqI1I/s1600-h/Screenshot-1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 90px; height: 71px;" src="http://3.bp.blogspot.com/_j0a3i0txX5Q/Sxt1CR-807I/AAAAAAAAGUM/NxRDJQRqI1I/s400/Screenshot-1.png" alt="" id="BLOGGER_PHOTO_ID_5412048059263407026" border="0" /&gt;&lt;/a&gt;At the next iteration, it will split the box into 2 so that it has a better chance of identifying the characters.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_j0a3i0txX5Q/Sxt1UP_zyVI/AAAAAAAAGUU/XmlppXKn39U/s1600-h/kaga.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 90px; height: 71px;" src="http://1.bp.blogspot.com/_j0a3i0txX5Q/Sxt1UP_zyVI/AAAAAAAAGUU/XmlppXKn39U/s400/kaga.png" alt="" id="BLOGGER_PHOTO_ID_5412048367967783250" border="0" /&gt;&lt;/a&gt;Hence, for a symbol like কু the OCR can not segment ক and ু separately.&lt;br /&gt;There is a hack for this though. What if we rotate the image by 90 degress counter-clockwise?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j0a3i0txX5Q/Sxt11-0TTeI/AAAAAAAAGUc/AXGooj_Y9HM/s1600-h/ku.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 97px; height: 90px;" src="http://4.bp.blogspot.com/_j0a3i0txX5Q/Sxt11-0TTeI/AAAAAAAAGUc/AXGooj_Y9HM/s400/ku.png" alt="" id="BLOGGER_PHOTO_ID_5412048947471666658" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j0a3i0txX5Q/Sxt2FPic_gI/AAAAAAAAGUk/Y6pvUj0aRdQ/s1600-h/ku90.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 90px; height: 97px;" src="http://2.bp.blogspot.com/_j0a3i0txX5Q/Sxt2FPic_gI/AAAAAAAAGUk/Y6pvUj0aRdQ/s400/ku90.png" alt="" id="BLOGGER_PHOTO_ID_5412049209658244610" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;As you can see, rotating the symbol allows Tesseract to box the vowel separately. We can train the rotated symbols to stand for a particular character.&lt;br /&gt;This will significantly reduce the number of character classes to be trained for Tesseract OCR. I am working on the Python script that does this transformation of the image.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-8921184679868469350?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/8921184679868469350/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/8921184679868469350'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/8921184679868469350'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/12/tilt-method-for-character-segmentation.html' title='The tilt method for character segmentation in Indic Scripts'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j0a3i0txX5Q/Sxt1CR-807I/AAAAAAAAGUM/NxRDJQRqI1I/s72-c/Screenshot-1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-6056892163501708398</id><published>2009-11-28T14:33:00.001-08:00</published><updated>2009-11-28T14:40:43.645-08:00</updated><title type='text'>Initial tests</title><content type='html'>Initial test results are pretty good.&lt;br /&gt;&lt;br /&gt;Test Condition:&lt;br /&gt;&lt;br /&gt;Image: A deskewed image with Bengali text.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j0a3i0txX5Q/SxGmSdIZlTI/AAAAAAAAGKA/MD1JyD_iYG4/s1600/amar158.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 240px; height: 400px;" src="http://2.bp.blogspot.com/_j0a3i0txX5Q/SxGmSdIZlTI/AAAAAAAAGKA/MD1JyD_iYG4/s400/amar158.jpg" alt="" id="BLOGGER_PHOTO_ID_5409287463436391730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Training data:&lt;br /&gt;     Word List: Superset of all the words contained in the image&lt;br /&gt;     Shape Information: Using CRBLP's data&lt;br /&gt;&lt;br /&gt;Output Text:&lt;br /&gt;&lt;br /&gt;বৈঠকী মেজাজের এক উপেক্ষিত সা হিতি্যক&lt;br /&gt;পত্র।স্তরে অনুজপ্রতিম লেখক ভ্রী শৈবাল  মিত্র কিছুদিন অাগে বা-ঙা লি&lt;br /&gt;লেখবঢ়ুঘুদ্ধিজীবীদের খুব একচেটি বকূনি দিয়েছো। তার অভিযোড়ৈগ এই যে, -যখনই&lt;br /&gt;এই সব ব্যক্তিদের প্রশ্ল কর। হয়, অাপনার। গত এক বছরে উল্লেখযোগ্য কীড়ী কীা বই&lt;br /&gt;পড়েছো, তখন অবধা রিতভাবে প্রায় সকলেই  কিছুইংরিজি বই বা বিদেশি সাহিভ্যের&lt;br /&gt;সূখ্য।তি ভরঢ়ু করেন। তার। কি ব।ংল। বই পড়্গে না ? নিজেরা বাংলা ভাষার লেখক&lt;br /&gt;হয়েও অপর কে।নও বাঙালি লেখকের রচন।কে ণ্ডরঢ়ুত্বপুর্ণ মনে করেন নাড়ৈ? না  কি&lt;br /&gt;ব।ংল। ভ।ষায় উল্লেখযে।গ্য কিছু লেখাই হয় না।&lt;br /&gt;এই অভিযে।গে সত্যতা অাছে। প।ণ্ডিত্য প্রমাণ করার জন্য অনেকেই বিদেশি&lt;br /&gt;স।হিত্য সম্পর্কে জান জ। হির কর।র জন্য ব্যস্ত হয়ে পড়্গে; পা ণ্ডিত্য  কিংবানূরবারি&lt;br /&gt;য।ই হে।ক, ব।ংল। বই-টই এর মধ্যে  অ।সে না। কফি হাউসের বুদ্ধিজীবীদের  হাভে&lt;br /&gt;বাংলা বই র।খ।র রেওয়।জ নেই। ঢেউয়ের মতন কখনও ম।র্কেজ, কখনও-&lt;br /&gt;দেরিদ।-গ্র।মসি, কখনও টনি মরিসন ব।ঙ।লি বুদ্ধিজীবীদের ওণ্ঠের ওপর খেলাকরে&lt;br /&gt;যান।&lt;br /&gt;বিদেশি স।হিত্য ও তত্ত্বগ্রছ প।ঠ করা অবশ্যই জকরি,  কিত বাংলা ভাষায়&lt;br /&gt;অালোচন।যে।গ্য কে।নও গ্রছ লেখ। হয় ন।, এমন য দি মনে কর। হয় তা হলে বাংলা&lt;br /&gt;-ভাষানিয়ে এত গর্ব কর।রইবা কী অ।ছে ? বিদেশি স।হিত্য প।ঠ করলেই বরং বে।ঝা&lt;br /&gt;-যায়-, সম্প্রতি অন্যান্য ভ।ষ।য় রচিত গল্প-উপন্য।সবঢ়ুবিত। বাংলার তূলনায় এমন কিছু&lt;br /&gt;-অাহাম রি উচচ।ঙ্গের নয়। সৈয়দ মুস্ত।ফ। সির।জের ড়ৃঅলীক মানুষ,এর মতন উপন্যাস&lt;br /&gt;বিঢ়ুংব। -জয় গে।স্ব।মীর ড়ৈপ।গলী, ভে।মার সঙ্গে-ব তুল্য কাব্যগ্রছইদানীং কোন ভ।ষ।য়&lt;br /&gt;প্রকাশিত হয়েছে?&lt;br /&gt;যাই হ্রোক, অামার পক্ষে এরকম পণ্ডি তিপনা কিংবা স্নবারি দেখ।বার কোনও&lt;br /&gt;সুযে।গই নেই; কারণ গত এব৪ বৎসরে অামি বিদেশি সাহিত্য কিছুই প ড়িনি ! এমনকী&lt;br /&gt;ইংরিজি অক্ষরে লেখ। নিতাস্ত কয়েবঢ়ুখান। পুরনো ইতিহাসখজীবনীগ্রস্হু ছাড়া কোনও&lt;br /&gt;গল্পঞ্ঝউপন্যাস চোখেও দেখি নি! বস্কু-ব।ন্ধবরা কেউ যখন সা।ভ.ঘতিক কোনও&lt;br /&gt;সাড়।-জাগ।নো বইয়ের প্রসঙ্গ তুলে জিজ্ঞেস করে, তূ মি পড়েছ নিশ্চয়ই? অামারে৪&lt;br /&gt; সসংকে।চে স্বীক।র করতেই হয়ড়ৈ, না  ভ।ই পড়িনি! কিংব। বিদেশ খেকে ফিরে এলে&lt;br /&gt;যখন কেউ জিজেস করে, ও দেশের হালফিল সাহিত্যের ধারা কী দেখলে, অামি&lt;br /&gt;মাথা চুলকোই। জা নি না, খবর নেব।র সময় পাইনি! লভনে গিয়ে  ইন্ডিয়া অ ফিস&lt;br /&gt;লাইরেরিভে অামি পুরনো গুথিপত্র ঘেঁটেছি, এবচটাও নজৃব ইংরিজি কবিতার বই&lt;br /&gt;কিনি নি, এটা স্বীকার করভে অামার লজ্জ। হয়। ত্রটা অামার এবচঁটা অধঃপতনের চিহ্ন&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Accuracy: 93% ~&lt;br /&gt;&lt;br /&gt;One major source of errors is । vs া ambiguity. That &lt;a href="http://hacking-tesseract.blogspot.com/2009/11/conversation-with-sayamindu-regarding.html"&gt;can be fixed&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;This is pretty good news. The OCR is working well.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-6056892163501708398?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/6056892163501708398/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/initial-tests.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/6056892163501708398'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/6056892163501708398'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/initial-tests.html' title='Initial tests'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j0a3i0txX5Q/SxGmSdIZlTI/AAAAAAAAGKA/MD1JyD_iYG4/s72-c/amar158.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-6727491424848423371</id><published>2009-11-28T14:25:00.000-08:00</published><updated>2009-11-28T14:28:33.645-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sayamindu ocr ambiguity'/><title type='text'>Conversation with Sayamindu regarding ambiguities</title><content type='html'>Here is a mail i sent to Sayamindu:&lt;br /&gt;"By the way, one difficult problem I am facing is that all the া are&lt;br /&gt;being mistakenly recognised as । . The dictionary should help in&lt;br /&gt;resolving this, and also there is a file where we can specify&lt;br /&gt;ambiguities like these. But nothing seems to work.&lt;br /&gt;One way to solve the problem is to add the following rule in the&lt;br /&gt;reorder script: We make a pass and replace all instances of । with া .&lt;br /&gt;Then we make another pass and see whether there are any leftover া&lt;br /&gt;with the dotted circle. These should be replaced by । .&lt;br /&gt;Is the logic ok? How to find out if an া has a dotted circle?"&lt;br /&gt;&lt;br /&gt;He has not replied yet.&lt;br /&gt;&lt;br /&gt;Here is what i think. The change can not be made simply in the reorder script, which gets executed only in the post-ocr stage. The problem is that the OCR engine itself recognises this wrongly and it throws off the rest of the recognition.&lt;br /&gt;One solution is to not train । (equivalent to fullstop in bengali) at all. We can always add the । in the post OCR script using the method in the mail addressed to Sayamindu.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-6727491424848423371?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/6727491424848423371/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/conversation-with-sayamindu-regarding.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/6727491424848423371'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/6727491424848423371'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/conversation-with-sayamindu-regarding.html' title='Conversation with Sayamindu regarding ambiguities'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-1087558713633098813</id><published>2009-11-28T12:16:00.000-08:00</published><updated>2009-11-28T12:17:19.131-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tesseract ocr trainer'/><title type='text'>TesseractIndic Trainer GUI</title><content type='html'>I just uploaded the TesseractIndic Trainer GUI Version 0.1 to&lt;br /&gt;&lt;a href="http://tesseractindic.googlecode.com/files/TesseracIindic-Trainer-GUI-0.1.tar.gz" target="_blank"&gt;http://tesseractindic.&lt;wbr&gt;googlecode.com/files/&lt;wbr&gt;TesseracIindic-Trainer-GUI-0.&lt;wbr&gt;1.tar.gz&lt;/a&gt;&lt;br /&gt;. This application allows a person to generate custom/application&lt;br /&gt;specific training data quickly.&lt;br /&gt;To see how to use it, read&lt;br /&gt;&lt;a href="http://code.google.com/p/tesseractindic/wiki/TrainerGUI" target="_blank"&gt;http://code.google.com/p/&lt;wbr&gt;tesseractindic/wiki/TrainerGUI&lt;/a&gt; or watch&lt;br /&gt;&lt;a href="http://www.youtube.com/watch?v=xuBlfN6Va4k" target="_blank"&gt;http://www.youtube.com/watch?&lt;wbr&gt;v=xuBlfN6Va4k&lt;/a&gt; .&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-1087558713633098813?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/1087558713633098813/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/tesseractindic-trainer-gui.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1087558713633098813'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1087558713633098813'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/tesseractindic-trainer-gui.html' title='TesseractIndic Trainer GUI'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-2215315327006319417</id><published>2009-11-25T11:13:00.000-08:00</published><updated>2009-11-25T11:15:23.365-08:00</updated><title type='text'>How the dictionary was fixed?</title><content type='html'>Well, it was a single line!&lt;br /&gt;&lt;br /&gt;Added the following line to line number 1077 in dict/permute.cpp&lt;br /&gt;&lt;br /&gt;any_alpha=1;&lt;br /&gt;&lt;br /&gt;Here is the diff against 2.04 release:&lt;br /&gt;&lt;br /&gt; --- tesseract-2.04/dict/permute.cpp    2008-11-14 23:07:17.000000000 +0530&lt;br /&gt;+++ tessmod/dict/permute.cpp    2009-11-26 00:34:50.660737699 +0530&lt;br /&gt;@@ -1077,6 +1077,7 @@&lt;br /&gt;     return (NULL);&lt;br /&gt;   if (permute_only_top)&lt;br /&gt;     return result_1;&lt;br /&gt;+  any_alpha=1;&lt;br /&gt;   if (any_alpha &amp;amp;&amp;amp; array_count (char_choices) &lt;= MAX_WERD_LENGTH) {&lt;br /&gt;     result_2 = permute_words (char_choices, rating_limit);&lt;br /&gt;     if (class_probability (result_1) &lt; class_probability (result_2)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For non-eng script the if condition was never getting satisfied and&lt;br /&gt;hence the DAWG files were not being scanned properly. Adding a&lt;br /&gt;any_alpha=1 on the top explicitly on the top solves this problem for&lt;br /&gt;the time. There is probably a more elegant solution though.&lt;br /&gt;By the way, I do not see this particular if condition in the trunk&lt;br /&gt;anywhere in the file. Perhaps the deveopers have fixed it in the trunk&lt;br /&gt;already.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-2215315327006319417?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/2215315327006319417/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/how-dictionary-was-fixed.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2215315327006319417'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2215315327006319417'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/how-dictionary-was-fixed.html' title='How the dictionary was fixed?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-4523699182316506467</id><published>2009-11-25T05:38:00.000-08:00</published><updated>2009-11-25T10:45:26.652-08:00</updated><title type='text'>Tesseract Dictionary (finally) works for Indic</title><content type='html'>This is going to be one long post&lt;br /&gt;&lt;br /&gt;For the past few weeks I have been experimenting with the Indic script support in the Tesseract dictionary. I will first record the observations/results of my experiments and then elaborate on the logic involved.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We start with a pristine copy of tesseract-2.04 downloaded from &lt;a href="http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz"&gt;here&lt;/a&gt;. Then we add some code to enable the maatraa clipping support (March 27 entry &lt;a href="http://hacking-tesseract.blogspot.com/2009/03/may-27-2008.html"&gt;here&lt;/a&gt;).&lt;br /&gt;Our aim is to see whether the dictionary works for Indic. Here is the methodology:&lt;br /&gt;&lt;br /&gt;1) Take an image with a single word.&lt;br /&gt;2) Create empty DAWG files.&lt;br /&gt;3) OCR and see the result.&lt;br /&gt;4) Now create  DAWG files with a single word. The word is the same as the one in the image.&lt;br /&gt;5) Now OCR again and see if the result improves.&lt;br /&gt;&lt;br /&gt;I chose this image:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_j0a3i0txX5Q/Sw1tvRDqBRI/AAAAAAAAF7M/DOfXo75Kpfw/s1600/wed.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 88px; height: 45px;" src="http://3.bp.blogspot.com/_j0a3i0txX5Q/Sw1tvRDqBRI/AAAAAAAAF7M/DOfXo75Kpfw/s400/wed.jpg" alt="" id="BLOGGER_PHOTO_ID_5408099386341852434" border="0" /&gt;&lt;/a&gt;In text form it reads: পূনরায় (punoraaye which means 'again' in Bengali)&lt;br /&gt;&lt;br /&gt;On OCRing this image with empty dawg files I received this result: পূনরুায়&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The result is wrong. The third character is রু instead of র . Also the vowel sign া is not joined to the previous consonant.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now I generate the 2 DAWG files: freq-dawg and word-dawg with a word list containing just this word : পূনরায় . Here is the process:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-style: italic;"&gt;debayan@deep-blur:/tmp/orig/tesseract-2.04$&lt;/span&gt; cat list&lt;br /&gt;পূনরায়&lt;br /&gt;&lt;span style="font-style: italic;"&gt;debayan@deep-blur:/tmp/orig/tesseract-2.04$&lt;/span&gt; wordlist2dawg list dawg&lt;br /&gt;Building DAWG from word list in file, 'list'&lt;br /&gt;Compacting the DAWG&lt;br /&gt;Compacting node from 9570029 to 1000034  (2)&lt;br /&gt;Writing squished DAWG file, 'dawg'&lt;br /&gt;18 nodes in DAWG&lt;br /&gt;18 edges in DAWG&lt;br /&gt;&lt;/blockquote&gt;Each symbol holds three bytes (according to unicode specs). There are 6 symbols in all: প ূ ন র া য় ; hence 6x3= 18 nodes in the DAWG. Makes sense!&lt;br /&gt;&lt;br /&gt;Now I copy these DAWG files to the appropriate locations (/usr/local/share/tessdata/) and OCR again, and get the same result. This shows that the DAWG files are ineffective currently.&lt;br /&gt;&lt;br /&gt;Now lets look at how to solve the problem. Ofcourse, the first step is to find out what is going on in DAWG creation/reading process. This involves inserting &lt;a href="http://code.google.com/p/tesseractindic/source/detail?r=82#"&gt;several cprintf statements&lt;/a&gt; all throughout the code. This gives us an &lt;a href="http://sites.google.com/site/debayanin/temp.txt"&gt;insight&lt;/a&gt; (600 KB download) on how the DAWG file is being used. I intend to analyse the output and pinpoint the problem in the next post. In this post, lets concentrate on the results.&lt;br /&gt;&lt;br /&gt;After I made the changes, I followed the same 5 steps followed above. Here is the output:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ vim space&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg space dawg&lt;br /&gt;Building DAWG from word list in file, 'space'&lt;br /&gt;Compacting the DAWG&lt;br /&gt;Compacting node from 0 to 1000000  (2)&lt;br /&gt;Writing squished DAWG file, 'dawg'&lt;br /&gt;1 nodes in DAWG&lt;br /&gt;1 edges in DAWG&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.&lt;br /&gt;ban.DangAmbigs      ban.freq-dawg       ban.inttemp         ban.pffmtable       ban.user-words      ban.word-dawg     &lt;br /&gt;ban.DangAmbigs~     ban.freq-dawg.old   ban.normproto       ban.unicharset      ban.user-words.old  ban.word-dawg.old &lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg&lt;br /&gt;[sudo] password for debayan:&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2&gt;temp&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt&lt;br /&gt;পূনরুায়&lt;br /&gt;&lt;br /&gt;========================================================================&lt;br /&gt;&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ echo 'পূনরায়'&gt;list&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat list&lt;br /&gt;পূনরায়&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ wordlist2dawg list dawg&lt;br /&gt;Building DAWG from word list in file, 'list'&lt;br /&gt;Compacting the DAWG&lt;br /&gt;Compacting node from 9570029 to 1000034  (2)&lt;br /&gt;Writing squished DAWG file, 'dawg'&lt;br /&gt;18 nodes in DAWG&lt;br /&gt;18 edges in DAWG&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.freq-dawg&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ sudo cp dawg /usr/local/share/tessdata/ban.word-dawg&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ tesseract wed.tif wed -l ban 2&gt;temp&lt;br /&gt;debayan@deep-blur:~/ocr/branches/tesseract-2.04$ cat wed.txt&lt;br /&gt;পূনরায়&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;If you follow the output above closely, you will find that adding the word to the DAWG affected the output constructively. If you have the patience follow the 600KB text file to see how it did that. Wait for my next post for a detailed analysis of the process.&lt;br /&gt;&lt;br /&gt;For now the conclusion is: The dictionary works for Indic. I need to send a patch to ray and team.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-4523699182316506467?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/4523699182316506467/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/tesseract-dictionary-finally-works-for.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4523699182316506467'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4523699182316506467'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/tesseract-dictionary-finally-works-for.html' title='Tesseract Dictionary (finally) works for Indic'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_j0a3i0txX5Q/Sw1tvRDqBRI/AAAAAAAAF7M/DOfXo75Kpfw/s72-c/wed.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-9201854591852170080</id><published>2009-11-19T12:51:00.000-08:00</published><updated>2009-11-19T12:58:28.195-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='utf8'/><title type='text'>utf-8 = ok?</title><content type='html'>I have been trying to add wide character support to Tesseract code base by converting most char* to wchar_t* data types. However I read in depth about UTF-8 encoding today &lt;a href="http://www.amk.ca/python/howto/unicode"&gt;here &lt;/a&gt;. It says UTF-8 handles unicode well. Tesseract already supports UTF-8, or so it says.&lt;br /&gt;However when I print out the dawg file contents I see garbage for Indic scripts, but see proper characters for english. Why is this happening?&lt;br /&gt;This makes me think that maybe I am on the wrong track. I did ask the Tesseract list whether I am on the right track or not, but found no useful replies.&lt;br /&gt;Infact now that I think about it, we are creating the dictionaries out of word lists, we are forgetting that we need to introduce the vower 'de-reordering' rules. Only then will the OCR be able match words in run time.&lt;br /&gt;Refer to my earlier &lt;a href="http://hacking-tesseract.blogspot.com/2009/11/why-is-vowel-reordering-required.html"&gt;post&lt;/a&gt; where I have mentioned that we need to do vowel reordering post OCR. If you reverse the analogy, we need to intentionally include anomalies in the dictionary so the OCR can work on the dictionary. Hence the OCR may think েক by looking at the dictionary, and we can use the vowel reordering code to correct this.&lt;br /&gt;&lt;br /&gt;What a realisation!!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-9201854591852170080?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/9201854591852170080/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/utf-8-ok.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/9201854591852170080'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/9201854591852170080'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/utf-8-ok.html' title='utf-8 = ok?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-1180339952950699461</id><published>2009-11-19T02:49:00.000-08:00</published><updated>2009-11-19T02:54:26.759-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='linux'/><category scheme='http://www.blogger.com/atom/ns#' term='mbstowcs'/><category scheme='http://www.blogger.com/atom/ns#' term='wchar'/><title type='text'>char* to wchar_t* conversion</title><content type='html'>Say you have a const char* string and you need to convert it to wchar_t type so that it can be stored in wide character format, here is the piece of code that takes the const char* and returns the wchar string for you.&lt;br /&gt;Note that it does not work without the setlocale funtion.&lt;br /&gt;&lt;br /&gt;You need to include locale.h and wchar.h header files for this to work.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;wchar_t* utf2wchar(const char *str) {&lt;br /&gt; setlocale(LC_ALL, "en_US.UTF-8");&lt;br /&gt; int size = strlen(str);&lt;br /&gt; wchar_t uni[100]; //assuming that there wont be a 101+ charcter word&lt;br /&gt; int ret = mbstowcs(uni,str,size);&lt;br /&gt; if(ret&lt;=0){cprintf("mbstowc failed, ret=%d",ret);}&lt;br /&gt; return uni;&lt;br /&gt;}&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-1180339952950699461?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/1180339952950699461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/char-to-wchart-conversion.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1180339952950699461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1180339952950699461'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/char-to-wchart-conversion.html' title='char* to wchar_t* conversion'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-733516119991816676</id><published>2009-11-16T13:51:00.000-08:00</published><updated>2009-11-16T14:03:45.789-08:00</updated><title type='text'>No unicode support in Tesseract-OCR?</title><content type='html'>If I were to point out one single issue on which this project's success depends. it would be the dictionary. The dictionary for this OCR system is not just a text file full of words, but  a data structure called Directed acyclic word graph.&lt;br /&gt;I decided to finally solve this blocker of a problem and delved into the &lt;a href="http://groups.google.com/group/tesseract-ocr/browse_thread/thread/5495c4e348a4b272/6a14c25cafb84a5f?lnk=gst&amp;amp;q=dawg+#6a14c25cafb84a5f"&gt;mailing lists&lt;/a&gt; once again. I did not find any new information there and hence decided to look at the source code itself.&lt;br /&gt;I soon noticed that while building the dictionary, the code is treating the words as a stream of bytes and storing each byte per node. This means that the code &lt;a href="http://code.google.com/p/tesseract-ocr/source/browse/trunk/dict/trie.cpp"&gt;does not support wide characters&lt;/a&gt;. Wide character support requires wchar_t type instead of char.&lt;br /&gt;This is a major problem. One could try to make the code wide character compatible, but it might require considerable labour. Also reading contents from the dictionary also needs to be done with wide character support.&lt;br /&gt;the alternative is shifting to a new OCR engine like OCRopus, which&lt;span style="font-style: italic;"&gt;&lt;span style="font-style: italic;"&gt; &lt;a href="http://crblpocr.blogspot.com/2008/08/how-to-install-ocropus-for-newbies.html"&gt;CRBLP&lt;/a&gt; &lt;/span&gt;&lt;/span&gt;folks seem to have done already.&lt;br /&gt;&lt;cite&gt;&lt;br /&gt;&lt;/cite&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-733516119991816676?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/733516119991816676/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/no-unicode-support-in-tesseract-ocr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/733516119991816676'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/733516119991816676'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/no-unicode-support-in-tesseract-ocr.html' title='No unicode support in Tesseract-OCR?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-2873860705277918079</id><published>2009-11-06T12:11:00.000-08:00</published><updated>2009-11-06T12:40:48.743-08:00</updated><title type='text'>Is a document suitable for OCR?</title><content type='html'>This is an important question for certain contexts.&lt;br /&gt;1) There may be an online web service that allows people to upload images to be OCRed. Some pranksters or bots may start uploading images with no or little text. The OCR engine tries to make sense of the image and wastes immense amounts of CPU cycles.&lt;br /&gt;&lt;br /&gt;2) The visually challenged may want to use the computer in this manner: Whenever they have an image with text infront of them, the software automatically recognises areas of text and OCRs it. Post-OCR A TTS system them reads out the text for them.&lt;br /&gt;&lt;br /&gt;Now how do we achieve this?&lt;br /&gt;&lt;br /&gt;There is a good method. The algorithm is called &lt;a href="http://crblpocr.blogspot.com/2007/07/constrained-run-length-algorithm-crla.html"&gt;Run Length Smearing Algorithm (RLSA)&lt;/a&gt; . What it does is it smears lines of text into black lines, and then looks for parallel black lines as a sign of lines of text in the image.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-2873860705277918079?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/2873860705277918079/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/is-document-suitable-for-ocr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2873860705277918079'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2873860705277918079'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/is-document-suitable-for-ocr.html' title='Is a document suitable for OCR?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-1526919410216956720</id><published>2009-11-06T11:48:00.000-08:00</published><updated>2009-11-06T12:09:34.889-08:00</updated><title type='text'>The Problem of Dotted Circles</title><content type='html'>Certain vowel signs in in Indic scripts have a dotted circle in them. For example :  ৈ, ে , া . When these are used in conjunction with consonants however, the dotted circles vanish. For example: কৈ, কে, কা .&lt;br /&gt;This is a problem doing &lt;a href="http://hacking-tesseract.blogspot.com/2009/04/my-old-training-methodology.html"&gt;automated training&lt;/a&gt;. The python script draws  ে and trains the engine to recognise the shape, along with the dotted circle. However, when we OCR a document, the dotted circle is no longer there.&lt;br /&gt;Hence we somehow need a method of automatically eliminating the dotted circles from vowel signs while generating training images. Any ideas?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-1526919410216956720?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/1526919410216956720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/problem-of-dotted-circles.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1526919410216956720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1526919410216956720'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/problem-of-dotted-circles.html' title='The Problem of Dotted Circles'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-4924027735733848110</id><published>2009-11-06T11:39:00.000-08:00</published><updated>2009-11-06T11:47:59.207-08:00</updated><title type='text'>Why is Vowel Reordering required?</title><content type='html'>Indic scripts have the concept of vowel signs. The peculiarity of these vowel signs with respect to OCR is that sometimes consonant + vowel sign = a glyph where the consonant comes later and the vowel sign first.&lt;br /&gt;Here I present just one simple example.&lt;br /&gt;That is (in Bengali):    ক + ে = কে&lt;br /&gt;&lt;br /&gt;Now when we OCR কে , the OCR engine first encounters the vowel sign (ে without the dotted circle) and then the consonant ক. It then tries to do a string concatenation of the two characters seen in order, and it ends up producing this as the output: েক .&lt;br /&gt;Since the OCR engine makes the same mistake all the time, its easy to write scripts which can move every such vowel sign to the appropriate place. This improves the OCR accuracy drastically.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-4924027735733848110?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/4924027735733848110/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/why-is-vowel-reordering-required.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4924027735733848110'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4924027735733848110'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/why-is-vowel-reordering-required.html' title='Why is Vowel Reordering required?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-1829207040418747469</id><published>2009-11-06T11:38:00.001-08:00</published><updated>2009-11-06T12:14:58.446-08:00</updated><title type='text'>OCRFeeder</title><content type='html'>I have been working on creating a complete OCR solution suite for Gnome. It tunrns out that &lt;a href="http://live.gnome.org/OCRFeeder"&gt;OCRFeeder&lt;/a&gt; is already a pretty good solution.&lt;br /&gt;Its a good thing that this exists, because I can now shift focus on adding Indic related code to OCRFeeder itself.&lt;br /&gt;When I say Indic related code, I mean modified tesseract shironaam clipper, vowel reordering, automated training and the crowd-sourcing data feedback learning mechanism.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-1829207040418747469?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/1829207040418747469/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/ocrfeeder.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1829207040418747469'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1829207040418747469'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/ocrfeeder.html' title='OCRFeeder'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-341583332465601337</id><published>2009-11-06T11:24:00.000-08:00</published><updated>2009-11-06T11:37:40.662-08:00</updated><title type='text'>Crowd Sourcing OCR development</title><content type='html'>One of the biggest challenges in OCR development is gathering training data and then feeding it to the OCR engine. The data is generally carefully chosen and some emphasis is laid on the quality of scans too. This often requires a team of people working in close proximity, and hence has traditionally been a blocker for the distributed development model.&lt;br /&gt;However, with proper planning in software development, such frameworks can be set up which allow end users to contribute to OCR training data.&lt;br /&gt;The interface to the OCR system may be either command line based, GUI based or web based. Say a user OCRs a particular document. Post-OCR the interface presents to him an opportunity to correct any errors and send it back to a centralised server where certain volunteers/contributors shall verify the data. Once the data has been verified, it is fed to the engine for incremental training.&lt;br /&gt;To check whether the data being added is improving the performance of the OCRs or not, we may run an automated nightly-OCR on a set of test image/text set and post the percentage daily.&lt;br /&gt;The challenge is that most OCR training systems are not incrementally trainable out of the box. Tesseract-OCR is one example. However, one may write some code and implement it.&lt;br /&gt;Crowd Sourcing training data is critical to align OCR development to a FOSS based model and hence free it from the clutches of research teams at big institutes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-341583332465601337?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/341583332465601337/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/crowd-sourcing-ocr-development.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/341583332465601337'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/341583332465601337'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/11/crowd-sourcing-ocr-development.html' title='Crowd Sourcing OCR development'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-8976590627258046041</id><published>2009-06-05T23:20:00.000-07:00</published><updated>2009-06-05T23:23:36.216-07:00</updated><title type='text'>How to train Tesseract-OCR</title><content type='html'>&lt;pre class="prettyprint"&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr id="sl_svn46_1"&gt;&lt;td class="source"&gt;Check out the source code of tesseractindic from http://code.google.com/p/tesseractindic. Then cd to tesseract_trainer and follow the directions below:&lt;br /&gt;&lt;br /&gt;Here is a demonstration of how you can create training data files for an arbitrary language for Tesseract-OCR and subsequently use it to perform OCR.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_2"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_3"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_4"&gt;&lt;td class="source"&gt;To create data files for , say, Bengali:&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_5"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_6"&gt;&lt;td class="source"&gt;1) Create a directory in tesseract_trainer/ and name it arbitrarily. This contains the symbols of the alphabet. I name it 'beng.alphabet'. In the directory you may create a maximum of 4 files:&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_7"&gt;&lt;td class="source"&gt;  a)consonants- Put all the consonants in your script/language in the file. eg, ক , খ (ka, kha) etc&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_8"&gt;&lt;td class="source"&gt;  b)pre_semivowels- Put all the semivowels (if any in your script) that come before a consonant. eg,  ি, ে, ৈ  (e kaar, a kaar, oi kaar )&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_9"&gt;&lt;td class="source"&gt;  c)post_semivowels- Put all the semivowels (if any in your script) that come after a consonant, eg,  া, ী (aa kaar, ee kaar)&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_10"&gt;&lt;td class="source"&gt;  d)rest- Put everything else here, like digits, punctuation, conjuncts, special characters, vowels. You could also choose not to create the      3 files above and put all the symbols in this file.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_11"&gt;&lt;td class="source"&gt;  &lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_12"&gt;&lt;td class="source"&gt;You need to have some fonts particular to your script installed on your system. On an Ubuntu system you will find them in /usr/share/fonts/truetype/ttf-bengali-fonts/. You will require the name(s) of the fonts later.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_13"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_14"&gt;&lt;td class="source"&gt;Now change directory to tesseract_trainer/ and execute the following on the shell(for bengali for example):  python generate.py -font Mitra -l Bengali -s 15 -a beng.alphabet/&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_15"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_16"&gt;&lt;td class="source"&gt;-font  takes the ttf font name you are trying to train&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_17"&gt;&lt;td class="source"&gt;-l takes the script name to be trained as input&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_18"&gt;&lt;td class="source"&gt;-s size of the characters generated in images in Bengali.images/&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_19"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_20"&gt;&lt;td class="source"&gt;This command will generate many images and corresponding box files in Bengali.images/. In the end it generates 5 files in Bengali.training_data/.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_21"&gt;&lt;td class="source"&gt;1)Bengali.unicharset&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_22"&gt;&lt;td class="source"&gt;2)Bengali.Microfeat&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_23"&gt;&lt;td class="source"&gt;3)Bengali.normproto&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_24"&gt;&lt;td class="source"&gt;4)Bengali.pffmtable&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_25"&gt;&lt;td class="source"&gt;5)Bengali.inttemp&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_26"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_27"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_28"&gt;&lt;td class="source"&gt;These 5 files are needed by Tesseract-OCR engine to add a new script support. In addition there are 3 more files required that ae to be created by you separately. These are :&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_29"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_30"&gt;&lt;td class="source"&gt;    * Bengali.freq-dawg&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_31"&gt;&lt;td class="source"&gt;    * Bengali.word-dawg&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_32"&gt;&lt;td class="source"&gt;    * Bengali.user-words&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_33"&gt;&lt;td class="source"&gt;You do not require the tesseract_trainer tool to create the files above. They can be created by following appropriate instructions at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_34"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_35"&gt;&lt;td class="source"&gt;Copy these files to /usr/local/share/tessdata and to tessdata/ folder of you tesseract-ocr source code.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_36"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_37"&gt;&lt;td class="source"&gt;Now lets OCR some image. Lets say the image is image.tif. Here is what you must execute at the terminal:&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_38"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_39"&gt;&lt;td class="source"&gt;tesseract image.tif ocr -l Bengali&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_40"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_41"&gt;&lt;td class="source"&gt;You get ocr.txt as the output.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_42"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_43"&gt;&lt;td class="source"&gt;Contact debayanin@gmail.com for clarifications.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_44"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_45"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_46"&gt;&lt;td class="source"&gt;Contact debayanin@gmail.com for further clarifications.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_47"&gt;&lt;td class="source"&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr id="sl_svn46_48"&gt;&lt;td class="source"&gt;And yeah, this is work in progress.&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/pre&gt;  &lt;pre class="prettyprint"&gt;&lt;table width="100%"&gt;&lt;tbody&gt;&lt;tr class="cursor_stop cursor_hidden"&gt;&lt;td&gt;&lt;br /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-8976590627258046041?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/8976590627258046041/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/06/how-to-train-tesseract-ocr.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/8976590627258046041'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/8976590627258046041'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/06/how-to-train-tesseract-ocr.html' title='How to train Tesseract-OCR'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-6817636810986592034</id><published>2009-05-11T08:36:00.000-07:00</published><updated>2009-05-11T08:52:23.511-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='indic meet'/><title type='text'>Issues for Indic Meet</title><content type='html'>1) Find out what others know&lt;br /&gt;2) Discuss the problems with OCR&lt;br /&gt;3) Discuss work done&lt;br /&gt;4) Discuss plans&lt;br /&gt;5) Discuss available tools&lt;br /&gt;6) Discuss tools to be developed&lt;br /&gt;7) Discuss application&lt;br /&gt;&lt;br /&gt;Objectives/Deliverables for Indic Meet&lt;br /&gt;&lt;br /&gt;I shall first demonstrate the working of the OCR on some sample images. Then I plan to explain the working of the OCR system on a higher level. It shall be followed by a demonstration of the problems that exist in the present system and potential solutions that I have in mind. I shall demonstrate how to train this OCR for a particular language. This should be over in 75 minutes.&lt;br /&gt;Then we move on to the problems I am facing. We have a discussion on possible solutions. Here are a few problems to tackle:&lt;br /&gt;&lt;br /&gt;1) Learning about the various efforts made in the past. BOCRA / Aksharbodh etc&lt;br /&gt;2) Dealing with the post-OCR spell-checker problem&lt;br /&gt;3) A better segmentation algorithm. Ocropus Curved cut segmenter. Merits/demerits&lt;br /&gt;3) Reducing number of character classes to be trained as explained at http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html&lt;br /&gt;4) Talk to Santhosh Thottingal about integrating the service to Silpa&lt;br /&gt;5) How to build a web interface that can train the OCR engine from user input.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-6817636810986592034?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/6817636810986592034/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/05/issues-for-indic-meet.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/6817636810986592034'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/6817636810986592034'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/05/issues-for-indic-meet.html' title='Issues for Indic Meet'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-2146496319284836363</id><published>2009-05-08T10:50:00.000-07:00</published><updated>2009-05-09T05:45:46.639-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ocr'/><title type='text'>Bengali Stats</title><content type='html'>Total character classes required to be trained:&lt;br /&gt;&lt;br /&gt;ক&lt;br /&gt;খ&lt;br /&gt;গ&lt;br /&gt;ঘ&lt;br /&gt;ঙ&lt;br /&gt;চ&lt;br /&gt;ছ&lt;br /&gt;জ&lt;br /&gt;ঝ&lt;br /&gt;ঞ&lt;br /&gt;ট&lt;br /&gt;ঠ&lt;br /&gt;ড&lt;br /&gt;ঢ&lt;br /&gt;ণ&lt;br /&gt;ত&lt;br /&gt;থ&lt;br /&gt;দ&lt;br /&gt;ধ&lt;br /&gt;ন&lt;br /&gt;প&lt;br /&gt;ফ&lt;br /&gt;ব&lt;br /&gt;ভ&lt;br /&gt;ম&lt;br /&gt;য&lt;br /&gt;র&lt;br /&gt;ল&lt;br /&gt;শ&lt;br /&gt;ষ&lt;br /&gt;স&lt;br /&gt;হ&lt;br /&gt;য&lt;br /&gt;য়&lt;br /&gt;ৰ&lt;br /&gt;ৱ&lt;br /&gt;&lt;br /&gt;অ&lt;br /&gt;আ&lt;br /&gt;ই&lt;br /&gt;ঈ&lt;br /&gt;উ&lt;br /&gt;ঊ&lt;br /&gt;ঋ&lt;br /&gt;এ&lt;br /&gt;ঐ&lt;br /&gt;ও&lt;br /&gt;ঔ&lt;br /&gt;&lt;br /&gt;০&lt;br /&gt;১&lt;br /&gt;২&lt;br /&gt;৩&lt;br /&gt;৪&lt;br /&gt;৫&lt;br /&gt;৬&lt;br /&gt;৭&lt;br /&gt;৮&lt;br /&gt;৯&lt;br /&gt;&lt;br /&gt;া&lt;br /&gt;ে&lt;br /&gt;ৈ&lt;br /&gt;ৌ (cant get to render the last symbol independently :()&lt;br /&gt;ং&lt;br /&gt;&lt;pre&gt; ঃ&lt;br /&gt;&lt;br /&gt;৷&lt;br /&gt;!&lt;br /&gt;"&lt;br /&gt;#&lt;br /&gt;$&lt;br /&gt;%&lt;br /&gt;&amp;amp;&lt;br /&gt;'&lt;br /&gt;(&lt;br /&gt;)&lt;br /&gt;*&lt;br /&gt;+&lt;br /&gt;,&lt;br /&gt;-&lt;br /&gt;.&lt;br /&gt;/&lt;br /&gt;0&lt;br /&gt;1&lt;br /&gt;2&lt;br /&gt;3&lt;br /&gt;4&lt;br /&gt;5&lt;br /&gt;6&lt;br /&gt;7&lt;br /&gt;8&lt;br /&gt;9&lt;br /&gt;:&lt;br /&gt;;&lt;br /&gt;&lt;&gt;&lt;br /&gt;?&lt;br /&gt;@&lt;br /&gt;=&lt;br /&gt;[&lt;br /&gt;\&lt;br /&gt;]&lt;br /&gt;^&lt;br /&gt;_&lt;br /&gt;`&lt;br /&gt;&lt;br /&gt;{&lt;br /&gt;|&lt;br /&gt;}&lt;br /&gt;~&lt;br /&gt;ৢ&lt;br /&gt;ৣ&lt;br /&gt;‘&lt;br /&gt;’&lt;br /&gt;“&lt;br /&gt;&lt;br /&gt;Here are semivowels that need to be trained combined with consonants/conjuncts:&lt;br /&gt;&lt;br /&gt;ি&lt;br /&gt;ী&lt;br /&gt;ু&lt;br /&gt;ূ&lt;br /&gt;ৃ&lt;br /&gt;ৄ&lt;br /&gt;্&lt;br /&gt;়&lt;br /&gt;&lt;br /&gt;Here are the conjuncts:&lt;br /&gt;&lt;br /&gt;ক্ক&lt;br /&gt;ক্ট&lt;br /&gt;ক্ত&lt;br /&gt;ক্ন&lt;br /&gt;ক্ম&lt;br /&gt;ক্র&lt;br /&gt;ক্ল&lt;br /&gt;ক্ব&lt;br /&gt;ক্ষ&lt;br /&gt;ক্স&lt;br /&gt;ক্ষ্ণ&lt;br /&gt;ক্ষ্ম&lt;br /&gt;ক্ট্র&lt;br /&gt;খ্র&lt;br /&gt;গ্গ&lt;br /&gt;গ্ধ&lt;br /&gt;গ্ন&lt;br /&gt;গ্ম&lt;br /&gt;গ্ল&lt;br /&gt;গ্ব&lt;br /&gt;গ্র&lt;br /&gt;ঘ্ন&lt;br /&gt;ঘ্র&lt;br /&gt;ঙ্ক&lt;br /&gt;ঙ্খ&lt;br /&gt;ঙ্গ&lt;br /&gt;ঙ্ঘ&lt;br /&gt;ঙ্ম&lt;br /&gt;ঙ্ক্ষ&lt;br /&gt;চ্চ&lt;br /&gt;চ্ছ&lt;br /&gt;চ্ঞ&lt;br /&gt;চ্ছ্র&lt;br /&gt;চ্ছ্ব&lt;br /&gt;ছ্ব&lt;br /&gt;ছ্র&lt;br /&gt;জ্জ&lt;br /&gt;জ্ঝ&lt;br /&gt;জ্ঞ&lt;br /&gt;জ্র&lt;br /&gt;জ্ব&lt;br /&gt;জ্জ্ব&lt;br /&gt;ঞ্চ&lt;br /&gt;ঞ্ছ&lt;br /&gt;ঞ্জ&lt;br /&gt;ঞ্ঝ&lt;br /&gt;ট্ট&lt;br /&gt;ট্র&lt;br /&gt;ঠ্র&lt;br /&gt;ড্ড&lt;br /&gt;ড্র&lt;br /&gt;ড়্গ&lt;br /&gt;ণ্ট&lt;br /&gt;ণ্ঠ&lt;br /&gt;ণ্ড&lt;br /&gt;ণ্ঢ&lt;br /&gt;ণ্ণ&lt;br /&gt;ণ্ম&lt;br /&gt;ণ্ব&lt;br /&gt;ণ্র&lt;br /&gt;ণ্ড্র&lt;br /&gt;ত্ত&lt;br /&gt;ত্থ&lt;br /&gt;ত্ন&lt;br /&gt;ত্ম&lt;br /&gt;ত্র&lt;br /&gt;ত্ব&lt;br /&gt;ত্ত্ব&lt;br /&gt;থ্র&lt;br /&gt;থ্ব&lt;br /&gt;দ্গ&lt;br /&gt;দ্ঘ&lt;br /&gt;দ্দ&lt;br /&gt;দ্ধ&lt;br /&gt;দ্ভ&lt;br /&gt;দ্ম&lt;br /&gt;দ্র&lt;br /&gt;দ্ব&lt;br /&gt;দ্দ্ব&lt;br /&gt;দ্ধ্ব&lt;br /&gt;ধ্ন&lt;br /&gt;ধ্র&lt;br /&gt;ধ্ব&lt;br /&gt;ন্ত&lt;br /&gt;ন্থ&lt;br /&gt;ন্দ&lt;br /&gt;ন্ধ&lt;br /&gt;ন্ন&lt;br /&gt;ন্য&lt;br /&gt;ন্ব&lt;br /&gt;ন্ম&lt;br /&gt;ন্স&lt;br /&gt;ন্ত্ব&lt;br /&gt;ন্ত্র&lt;br /&gt;ন্দ্ব&lt;br /&gt;ন্দ্র&lt;br /&gt;ন্ধ্র&lt;br /&gt;প্ট&lt;br /&gt;প্প&lt;br /&gt;প্ন&lt;br /&gt;প্ত&lt;br /&gt;প্ল&lt;br /&gt;প্স&lt;br /&gt;প্র&lt;br /&gt;ফ্র&lt;br /&gt;ফ্ল&lt;br /&gt;ব্জ&lt;br /&gt;ব্দ&lt;br /&gt;ব্ধ&lt;br /&gt;ব্ব&lt;br /&gt;ব্ল&lt;br /&gt;ব্র&lt;br /&gt;ব্দ্র&lt;br /&gt;ভ্র&lt;br /&gt;ম্ন&lt;br /&gt;ম্প&lt;br /&gt;ম্ফ&lt;br /&gt;ম্ব&lt;br /&gt;ম্ভ&lt;br /&gt;ম্ম&lt;br /&gt;ম্র&lt;br /&gt;ম্ল&lt;br /&gt;ম্ভ্র&lt;br /&gt;ম্প্র&lt;br /&gt;ল্ক&lt;br /&gt;ল্গ&lt;br /&gt;ল্ট&lt;br /&gt;ল্ড&lt;br /&gt;ল্প&lt;br /&gt;ল্ফ&lt;br /&gt;ল্ব&lt;br /&gt;ল্ম&lt;br /&gt;ল্ল&lt;br /&gt;শ্চ&lt;br /&gt;শ্ছ&lt;br /&gt;শ্ন&lt;br /&gt;শ্ম&lt;br /&gt;শ্ব&lt;br /&gt;শ্র&lt;br /&gt;শ্ল&lt;br /&gt;শ্য&lt;br /&gt;ষ্ক&lt;br /&gt;ষ্ট&lt;br /&gt;ষ্ঠ&lt;br /&gt;ষ্ণ&lt;br /&gt;ষ্প&lt;br /&gt;ষ্ফ&lt;br /&gt;ষ্ম&lt;br /&gt;ষ্ক্র&lt;br /&gt;ষ্ট্র&lt;br /&gt;ষ্য&lt;br /&gt;স্ক&lt;br /&gt;স্খ&lt;br /&gt;স্ট&lt;br /&gt;স্ত&lt;br /&gt;স্থ&lt;br /&gt;স্ন&lt;br /&gt;স্প&lt;br /&gt;স্ফ&lt;br /&gt;স্ম&lt;br /&gt;স্র&lt;br /&gt;স্ল&lt;br /&gt;স্ব&lt;br /&gt;স্ত্র&lt;br /&gt;স্ক্র&lt;br /&gt;স্ট্র&lt;br /&gt;স্য&lt;br /&gt;হ্ণ&lt;br /&gt;হ্ন&lt;br /&gt;হ্ম&lt;br /&gt;হ্র&lt;br /&gt;হ্ল&lt;br /&gt;হ্ব&lt;br /&gt;হ্য&lt;br /&gt;গু&lt;br /&gt;ন্তু&lt;br /&gt;নু&lt;br /&gt;সু&lt;br /&gt;রু&lt;br /&gt;রূ&lt;br /&gt;দু&lt;br /&gt;শু&lt;br /&gt;হৃ&lt;br /&gt;হু&lt;br /&gt;গ্রু&lt;br /&gt;গ্রূ&lt;br /&gt;ব্রু&lt;br /&gt;ভ্রু&lt;br /&gt;ভ্রূ&lt;br /&gt;শ্রু&lt;br /&gt;শ্রূ&lt;br /&gt;স্তু&lt;br /&gt;ন্দু&lt;br /&gt;ত্রু&lt;br /&gt;থ্রু&lt;br /&gt;থ্রূ&lt;br /&gt;দ্রু&lt;br /&gt;দ্রূ&lt;br /&gt;ধ্রু&lt;br /&gt;ধ্রূ&lt;br /&gt;ল্গু&lt;br /&gt;ন্ড&lt;br /&gt;ন্ট&lt;br /&gt;ন্ঠ&lt;br /&gt;চ্ন&lt;br /&gt;ট্ম&lt;br /&gt;ট্ব&lt;br /&gt;ড্ম&lt;br /&gt;ভ্ল&lt;br /&gt;ম্ত&lt;br /&gt;ম্থ&lt;br /&gt;ম্দ&lt;br /&gt;ল্ত&lt;br /&gt;ল্ধ&lt;br /&gt;শ্ত&lt;br /&gt;&lt;br /&gt;Total number of character classes to be trained:&lt;br /&gt;&lt;br /&gt;36 (number of consonants) + 11 (number of vowel) + 10 (digits) +&lt;br /&gt;6 (vowel-signs that can be rendered separately) + 49 (punctuations and symbols) + 215 (conjuncts)&lt;br /&gt;+ (215+36)x6 (for semi-vowels that can not be trained individually) = &lt;span style="color: rgb(255, 0, 0);"&gt;1833&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Hence the character classifier for an Indic OCR needs to comb through 1833 character classifications&lt;br /&gt;to find a character. For an english OCR on the other hand, this number is below 50.&lt;br /&gt;Hence the difficulties in Indic OCR.&lt;br /&gt;&lt;br /&gt;How to reduce number of character classes to be trained?&lt;br /&gt;&lt;br /&gt;In my conversation with &lt;a href="http://www.isical.ac.in/%7Ebbc/"&gt;Prof. B.B. Chaudhuri&lt;/a&gt; I learnt techniques to reduce the number of character classes.&lt;br /&gt;First we need to separate a word image into three parts, top, middle, bottom. The top part will have&lt;br /&gt;the rising part of vowel signs like ি ী , the middle part will have consonant, conjuncts, vowels, digits etc.&lt;br /&gt;The bottom part will have descending part of vowel-signs like ু.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j0a3i0txX5Q/SgUn5Xtb7sI/AAAAAAAAEQo/6d8bgHZQAvQ/s1600-h/Screenshot-2.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 304px; height: 91px;" src="http://4.bp.blogspot.com/_j0a3i0txX5Q/SgUn5Xtb7sI/AAAAAAAAEQo/6d8bgHZQAvQ/s400/Screenshot-2.png" alt="" id="BLOGGER_PHOTO_ID_5333713200260837058" border="0" /&gt;&lt;/a&gt;Ocropus already seems capable of achieving this. See &lt;a href="http://sites.google.com/site/ocropus/languages/devanagari-hindi-sanskrit"&gt;this&lt;/a&gt;. The image below has segmented rising part&lt;br /&gt;of a few vowel signs separately:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j0a3i0txX5Q/SgUo7tn6JFI/AAAAAAAAEQ4/8C6uF2DdGzI/s1600-h/line-cut.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 653px; height: 49px;" src="http://4.bp.blogspot.com/_j0a3i0txX5Q/SgUo7tn6JFI/AAAAAAAAEQ4/8C6uF2DdGzI/s400/line-cut.png" alt="" id="BLOGGER_PHOTO_ID_5333714340014597202" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;If we can successfully adopt this segmentation approach, we can reduce the number of trainable character&lt;br /&gt;classes to around 350.&lt;br /&gt;Now once we have segmented the image, how does the present Tesseract-OCR engine classify the new&lt;br /&gt;character classes. For example how to train the engine so it understands that the rising part of&lt;br /&gt;ি is part of another vowel-sign. In any case, Tesseract only understands characters with unicode&lt;br /&gt;values during training. Hence I dont think Tesseract-OCR will understand this segmentation.&lt;br /&gt;So what do we do. There are 2 possibilites:&lt;br /&gt;&lt;br /&gt;1) We use a different OCR engine. Will have to dig deeper into ocropus.&lt;br /&gt;2) We use the Tesseract-OCR classifier and the 1800 odd character classes augmented with a strong&lt;br /&gt;spell checker based correction mechanism.&lt;br /&gt;&lt;br /&gt;The 2nd method I am working on right now.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-2146496319284836363?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/2146496319284836363/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2146496319284836363'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2146496319284836363'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html' title='Bengali Stats'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_j0a3i0txX5Q/SgUn5Xtb7sI/AAAAAAAAEQo/6d8bgHZQAvQ/s72-c/Screenshot-2.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-1321730405448845584</id><published>2009-04-19T09:55:00.000-07:00</published><updated>2009-04-19T09:57:58.140-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='dawg'/><title type='text'>What is a DAWG file?</title><content type='html'>&lt;h1&gt;&lt;span style="font-size:100%;"&gt;&lt;a class="anchor" name="allaboutdawg"&gt;DAWG = Directed Acyclic WordGraph&lt;/a&gt;&lt;/span&gt;&lt;/h1&gt;First, we'll define DAWG (skip if you know already) and cover the specifics of tesseract below.&lt;p&gt; =========== this definition by Mark &amp;amp; Ceal Wutka, link below ============&lt;/p&gt;&lt;p&gt; A Directed Acyclic Word Graph, or DAWG, is a data structure that permits extremely fast word searches. The entry point into the graph represents the starting letter in the search. Each node represents a letter, and you can travel from the node to two other nodes, depending on whether you the letter matches the one you are searching for.&lt;/p&gt;&lt;p&gt; It's a &lt;em&gt;Directed&lt;/em&gt; graph because you can only move in a specific direction between two nodes. In other words, you can move from A to B, but you can't move from B to A. It's &lt;em&gt;Acyclic&lt;/em&gt; because there are no cycles. You cannot have a path from A to B to C and then back to A. The link back to A would create a cycle, and probably an endless loop in your search program.&lt;/p&gt;&lt;p&gt; The description is a little confusing without an example, so imagine we have a DAWG containing the words CAT, CAN, DO, and DOG. The graph woud look like this:&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;div class="fragment"&gt;&lt;pre class="fragment"&gt;     C --Child--&gt; A --Child--&gt; N (EOW)&lt;br /&gt;  |                         |&lt;br /&gt;  |                       Next&lt;br /&gt;Next                        |&lt;br /&gt;  |                         v&lt;br /&gt;  |                         T (EOW)&lt;br /&gt;  v&lt;br /&gt;  D--Child--&gt; O (EOW) --Child --&gt; G (EOW)&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt; Now, imagine that we want to see if CAT is in the DAWG. We start at the entry point (the C) in this case. Since C is also the letter we are looking for, we go to the child node of C. Now we are looking for the next letter in CAT, which is A. Again, the node we are on (A) has the letter we are looking for, so we again pick the child node which is now N. Since we are looking for T and the current node is not T, we take the Next node instead of the child. The Next node of N is T. T has the letter we want. Now, since we have processed all the letters in the word we are searching for, we need to make sure that the current node has an End-of-word flag (EOW) which it does, so CAT is stored in the graph.&lt;/p&gt;&lt;p&gt; One of the tricks with making a DAWG is trimming it down so that words with common endings all end at the same node. For example, suppose we want to store DOG and LOG in a DAWG. The ideal would be something like this:&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;div class="fragment"&gt;&lt;pre class="fragment"&gt;   D --Child--&gt; O --Child--&gt; G(EOW)&lt;br /&gt;|            ^&lt;br /&gt;Next          |&lt;br /&gt;|            |&lt;br /&gt;v            |&lt;br /&gt;L --Child----&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt; In other words, the OG in DOG and LOG is defined by the &lt;em&gt;same&lt;/em&gt; pair of nodes.&lt;/p&gt;&lt;p&gt; =========== Creating a DAWG ============&lt;/p&gt;&lt;p&gt; [...] The idea is to first create a tree, where a leaf would represent the end of a word and there can be multiple leaves that are identical. For example, DOG and LOG would be stored like this:&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;div class="fragment"&gt;&lt;pre class="fragment"&gt;  D --Child--&gt; O --Child--&gt; G (EOW)&lt;br /&gt;|&lt;br /&gt;Next&lt;br /&gt;|&lt;br /&gt;v&lt;br /&gt;L --Child-&gt; O --Child--&gt; G (EOW)&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt; Now, suppose you want to add DOGMA to the tree. You'd proceed as if you were doing a search. Once you get to G, you find it has no children, so you add a child M, and then add a child A to the M, making the graph look like:&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;div class="fragment"&gt;&lt;pre class="fragment"&gt;  D --Child--&gt; O --Child--&gt; G (EOW) --Child--&gt; M --Child--&gt; A (EOW)&lt;br /&gt;|&lt;br /&gt;Next&lt;br /&gt;|&lt;br /&gt;v&lt;br /&gt;L --Child-&gt; O --Child--&gt; G (EOW)&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt; As you can see, by adding nodes to the tree this way, you share common beginnings, but the endings are still separated. To shrink the size of the DAWG, you need to find common endings and combine them. To do this, you start at the leaf nodes (the nodes that have no children). If two leaf nodes are identical, you combine them, moving all the references from one node to the other. For two nodes to be identical, they not only must have the same letter, but if they have Next nodes, the Next nodes must also be identical (if they have child nodes, the child nodes must also be identical).&lt;/p&gt;&lt;p&gt; Take the following tree of CITIES, CITY, PITIES and PITY:&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;div class="fragment"&gt;&lt;pre class="fragment"&gt; C --Child--&gt; I --Child--&gt; T --Child--&gt; I --Child--&gt; E --Child--&gt; S (EOW)&lt;br /&gt;|                                      |&lt;br /&gt;|                                     Next&lt;br /&gt;Next                                    |&lt;br /&gt;|                                      v&lt;br /&gt;|                                      Y (EOW)&lt;br /&gt;P --Child--&gt; I --Child--&gt; T --Child--&gt; I --Child--&gt; E --Child--&gt; S (EOW)&lt;br /&gt;                                     |&lt;br /&gt;                                    Next&lt;br /&gt;                                     |&lt;br /&gt;                                     v&lt;br /&gt;                                     Y (EOW)&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt; Continue reading this explanation at: &lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.wutka.com/dawg.html"&gt;http://www.wutka.com/dawg.html&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt; See also: &lt;ul&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Directed_acyclic_word_graph"&gt;http://en.wikipedia.org/wiki/Directed_acyclic_word_graph&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Trie"&gt;http://en.wikipedia.org/wiki/Trie&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt; &lt;h2&gt;&lt;a class="anchor" name="tessdawguses"&gt; What does Tesseract use DAWG for?&lt;/a&gt;&lt;/h2&gt; Tesseract uses the Directed Acyclic Word Graphs to very compactly store and efficiently search several list(s) of words. There are four DAWGs in tesseract: (right?)&lt;p&gt; &lt;/p&gt;&lt;ul&gt;&lt;li&gt;1. word_dawg (pre-set/fixed list read in from "tessdata/word-dawg")&lt;br /&gt;(this one is read in raw/directly for speed, user can't change this right now) &lt;/li&gt;&lt;li&gt;2. document_words (document-words that have already been recognized)&lt;br /&gt;(built during execution; FIX: is/isn't cleared per-document/baseapi call) &lt;/li&gt;&lt;li&gt;3. pending_words (words tess is working on, at the moment, before they are added to document_word) &lt;/li&gt;&lt;li&gt;4. user_words (user-adjustable list read in from "tessdata/user-words")&lt;br /&gt;(add here custom words that tesseract tends to corrupt)&lt;/li&gt;&lt;/ul&gt; Disclosure: I don't know the order of preference - which DAWG does tesseract check first AND which DAWG over-rides the others. ex. "thls" is not in #1 but, say, is in #4 - will tesseract NOT jiggle the 'l' into an 'i' (which then matches in #1) or will it go with #4? Ray?&lt;p&gt; Let's say that tesseract thinks it found a word with four letters, "thls". Before this word is output, tesseract will: &lt;/p&gt;&lt;ul&gt;&lt;li&gt;look-up "thls" in DAWG #1 (see above) &lt;/li&gt;&lt;li&gt;(when does it check user-words?) &lt;/li&gt;&lt;li&gt;By looking through the sorted list for each of the classes, tesseract will note that the third character had a second-best choice to be an 'i' so it changes that letter and &lt;/li&gt;&lt;li&gt;look-up "this" in DAWG #1 and this time it DOES match. &lt;/li&gt;&lt;li&gt;(fmg has seen tess KEEP ON permuting even after a match in both #1 and #4 so is not sure what the ending conditions are - maybe someone who knows better can explain) which can only mean that: &lt;/li&gt;&lt;li&gt;until the certainty of the word isn't moved beyond some threshold, permuting of other letters continues...&lt;/li&gt;&lt;/ul&gt; So, the answer to "Why does tesseract bother with DAWGs" is that when a typical English word has one or two letters that have permutations possible, WITHOUT using the compact and fast DAWG's this lookup task would quickly become a huge bottle-neck.&lt;p&gt; =========== DAWG-related ToDo's ============&lt;/p&gt;&lt;p&gt; &lt;/p&gt;&lt;dl compact="compact"&gt;&lt;dt&gt;&lt;b&gt;&lt;a class="el" href="http://tesseract-ocr.repairfaq.org/todo.html#_todo000014"&gt;Todo:&lt;/a&gt;&lt;/b&gt;&lt;/dt&gt;&lt;dd&gt;Need to add info here on: &lt;ul&gt;&lt;li&gt;how to view/list words ALREADY IN "tessdata/word-dawg" &lt;/li&gt;&lt;li&gt;how to CREATE A NEW "tessdata/word-dawg" &lt;/li&gt;&lt;li&gt;which constants need to be tweaked when adding words to "tessdata/word-dawg" &lt;/li&gt;&lt;li&gt;which constants need to be tweaked when adding words to "tessdata/user-words" (because a poster on the forums said that after about 5000 words are added guano happens) &lt;/li&gt;&lt;li&gt;why/what for is rand() used in &lt;a class="el" href="http://tesseract-ocr.repairfaq.org/trie_8cpp.html#f75172b9118305084654fcae28cd57bb"&gt;add_word_to_dawg()&lt;/a&gt; &lt;/li&gt;&lt;li&gt;what to do when the dreaded "DAWG Table is too full" error occurs AFTER Ray Smith's patch is already applied...&lt;/li&gt;&lt;/ul&gt; &lt;/dd&gt;&lt;/dl&gt; &lt;hr size="1"&gt;&lt;address style=""&gt;&lt;small&gt;Generated on Wed Feb 28 19:49:30 2007 for Tesseract by  &lt;a href="http://www.doxygen.org/index.html"&gt; &lt;img src="http://tesseract-ocr.repairfaq.org/doxygen.png" alt="doxygen" align="middle" border="0" /&gt;&lt;/a&gt; 1.5.1&lt;br /&gt;&lt;br /&gt;COPIED VERBATIM FROM http://tesseract-ocr.repairfaq.org/&lt;br /&gt;&lt;/small&gt;&lt;/address&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-1321730405448845584?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/1321730405448845584/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/what-is-dawg-file.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1321730405448845584'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/1321730405448845584'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/what-is-dawg-file.html' title='What is a DAWG file?'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-7274779217278548270</id><published>2009-04-17T16:20:00.000-07:00</published><updated>2009-04-17T16:47:16.267-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='picasa'/><category scheme='http://www.blogger.com/atom/ns#' term='imagechops'/><category scheme='http://www.blogger.com/atom/ns#' term='ocr'/><title type='text'>Clipping accuracy</title><content type='html'>I had tried some time last year to push my matra clipping code to Tesseract-OCR upstream, but Ray Smith the lead developer of the project asked about the accuracy of the code and I never got around to calculating it. Well actually I still havent calculated it, but I did something new.&lt;br /&gt;Check the set of pictures I uploaded at &lt;http: com="" debayanin="" 5325782929614608690=""&gt; &lt;http: com="" debayanin="" 5325782929614608690=""&gt;. The first picture is the normal picture to be OCRed. The second picture is the clipped+thresholded image. The third image is the difference of the clipped+thresholded and thresholded images.&lt;br /&gt;&lt;br /&gt;Here is the Python code that creates a new image out of two input images:&lt;br /&gt;&lt;br /&gt;#!/usr/local/bin/python&lt;br /&gt;&lt;br /&gt;import ImageChops, Image&lt;br /&gt;&lt;br /&gt;th=Image.open("benth.tif")&lt;br /&gt;clip=Image.open("bentest.tif")&lt;br /&gt;&lt;br /&gt;new=ImageChops.difference(th,clip)&lt;br /&gt;new=ImageChops.invert(new)&lt;br /&gt;&lt;br /&gt;new.save("diff.tif","TIFF")&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I will now show this to Ray Smith. Lets see if he likes it.&lt;/http:&gt;&lt;/http:&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-7274779217278548270?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/7274779217278548270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/clipping-accuracy.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/7274779217278548270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/7274779217278548270'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/clipping-accuracy.html' title='Clipping accuracy'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-2807603093173696043</id><published>2009-04-17T12:15:00.001-07:00</published><updated>2009-04-17T12:15:52.543-07:00</updated><title type='text'>My old training methodology</title><content type='html'>&lt;p&gt;The principle on which this works is this: Tesseract needs two things to train itself, 1) An image of the character 2) The name of the character. This information is provided with the help of "box files". A box file contains the co-ordinates of the bounding boxes around characters with labels as to what those characters are. The traditional method of training the engine is to take a scanned image, meticulously create a box file using some tool such as tesseractrainer.py , edit the box file, and keep doing the same for several other images and fonts. This process was tedious enough to force me to seek new methods.&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Now lets do a little reverse engineering. What if we could take a list of characters in a text file, "generate" an image out of those characters, store the co-ordinates of the bounding boxes of those generated images in a file and then feed these to the OCR engine? It would work, right?&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Links:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://tesseractindic.googlecode.com/files/tesseract_trainer.beta.tar.gz" target="_blank"&gt;http://tesseractindic.&lt;wbr&gt;googlecode.com/files/&lt;wbr&gt;tesseract_trainer.beta.tar.gz&lt;/a&gt;  - The tar ball itself&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/readme" target="_blank"&gt;http://code.google.com/p/&lt;wbr&gt;tesseractindic/source/browse/&lt;wbr&gt;trunk/tesseract_trainer/readme&lt;/a&gt;&lt;wbr&gt;  - The readme file&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.youtube.com/watch?v=vuuVwm5ZjkI" target="_blank"&gt;http://www.youtube.com/watch?&lt;wbr&gt;v=vuuVwm5ZjkI&lt;/a&gt; - YouTube video of the tool working for Bengali&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;But there are problems. Tesseract-OCR has its quirks.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Tesseract wants one bounding box to enclose a single "blob" only. A blob is a wholly connected component. So ক is a blob, and ক খ are two blobs. There are cases where a consonant+vowel sign generates two blobs, for example the 3 images below have multiple blobs:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="http://hacking-tesseract.blogspot.com/imageani36.tif/imageani36-full;init:.tif" imageanchor="1"&gt;&lt;img src="http://hacking-tesseract.blogspot.com/imageani36.tif/imageani36-medium;init:.jpg" style="border: 0pt none ;" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="http://hacking-tesseract.blogspot.com/imageani37.tif/imageani37-full;init:.tif" imageanchor="1"&gt;&lt;img src="http://hacking-tesseract.blogspot.com/imageani37.tif/imageani37-medium;init:.jpg" style="border: 0pt none ;" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="http://hacking-tesseract.blogspot.com/imageani43.tif/imageani43-full;init:.tif" imageanchor="1"&gt;&lt;img src="http://hacking-tesseract.blogspot.com/imageani43.tif/imageani43-medium;init:.jpg" style="border: 0pt none ;" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;And hence Tesseract throws a "FATALITY" error during training.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;So i had to change my approach a little bit. Obviously there has to be some feedback mechanism where i parse the output of Tesseract during training to see if a particular set of characters threw errors. Once I know what they are, I can separate them and train them later. To accomplish this, I changed my approach of generating a strip of character images to generating just one image per character, so I can pin point the problems better.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The downside, too many images getting generated. To train a simple font it generates 405 images+405 box files+405 tr files. And all this when I have not included conjuncts yet. It is not much of a problem though, since the images generated are not required once the training files have been generated. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Well it leads me to new challenges. I remember Prof B.B. Choudhury saying that training all the conjuncts will kill any recogniser, ie, it will work very slowly while recognising. He also told me some cool ways to get past that. May have to implement that. Lets see.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-2807603093173696043?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/2807603093173696043/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/my-old-training-methodology.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2807603093173696043'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2807603093173696043'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/my-old-training-methodology.html' title='My old training methodology'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-3511624565105384108</id><published>2009-04-17T12:01:00.000-07:00</published><updated>2009-04-17T12:13:29.091-07:00</updated><title type='text'>My training methodology does not work :(</title><content type='html'>As much as I hate to admit it my training methodology of generating one image per akshar &lt;http: com="" v="vuuvwm5zjki"&gt; does not work. I hate to say it since I put some effort into writing the Python code that does this &lt;http: com="" p="" tesseractindic="" source="" browse="" svn="" trunk="" tesseract_trainer=""&gt;.&lt;br /&gt;Well the reason is probably that Tesseract OCR training code looks for characters on a single line during training as it also extracts base line metrics for rare/strange characters like numerals. As such it may not be able to extract all the information it needs for its training.&lt;br /&gt;Or may be Tesseract OCR training code accepts a very little number of .tr files and since my code generates thousands of tr files, it becomes useless.&lt;br /&gt;Let me show you an example of how miserably it failed.&lt;br /&gt;I decided to test the training on the string  " ভারত মাতা " (Bharat Mata which means Mother India). I generated the tiff image using Pango rendering.&lt;br /&gt;Then I generated 7 images per sample of ভ র ত ম &lt;http: com="" debayanin=""&gt; and used the subsequently generated training fils for OCR.&lt;br /&gt;The result was this: " মভতভ ভভভভ "&lt;br /&gt;Yes, I know. The result is absolutely outrageous.&lt;br /&gt;However, what if I still autogenerate images of characters but this time in single lines adjacently? Will it work?&lt;/http:&gt;&lt;/http:&gt;&lt;/http:&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-3511624565105384108?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/3511624565105384108/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/my-training-methodology-does-not-work.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/3511624565105384108'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/3511624565105384108'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/my-training-methodology-does-not-work.html' title='My training methodology does not work :('/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-3564623573838317044</id><published>2009-04-14T17:03:00.000-07:00</published><updated>2009-04-14T17:06:14.187-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='degradation'/><title type='text'>Image degradation</title><content type='html'>I added some code. The pango rendering works perfectly now. Also 1 pure image and 4 partially erased images are created per character.&lt;br /&gt;The degradation has been chosen to suit the code that clips matraas. The only degradation seen is a vertical white strip overlapping certain characters.&lt;br /&gt;Hence the same is done while generating training images.&lt;br /&gt;&lt;br /&gt;[1] http://code.google.com/p/tesseractindic/source/detail?r=41&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-3564623573838317044?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/3564623573838317044/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/image-degradation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/3564623573838317044'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/3564623573838317044'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/04/image-degradation.html' title='Image degradation'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-879003916997084874</id><published>2009-03-21T17:48:00.003-07:00</published><updated>2009-03-21T17:51:58.752-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sayamindu'/><category scheme='http://www.blogger.com/atom/ns#' term='pango'/><category scheme='http://www.blogger.com/atom/ns#' term='cairo'/><title type='text'>Pango Magic</title><content type='html'>I was getting crappy rendering through normal PIL for bengali conjuncts.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_j0a3i0txX5Q/ScWLAvNcybI/AAAAAAAADK8/sSA18HjGufU/s1600-h/Screenshot-1.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 250px;" src="http://2.bp.blogspot.com/_j0a3i0txX5Q/ScWLAvNcybI/AAAAAAAADK8/sSA18HjGufU/s400/Screenshot-1.png" alt="" id="BLOGGER_PHOTO_ID_5315807779969878450" border="0" /&gt;&lt;/a&gt;I then remembered that Sayamindu had sent me some Pango/Cairo code to help me put with rendering . I modified it somewhat and I am getting this:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_j0a3i0txX5Q/ScWLZwp9izI/AAAAAAAADLE/EkAKGwxvlrU/s1600-h/Screenshot.png"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 250px;" src="http://4.bp.blogspot.com/_j0a3i0txX5Q/ScWLZwp9izI/AAAAAAAADLE/EkAKGwxvlrU/s400/Screenshot.png" alt="" id="BLOGGER_PHOTO_ID_5315808209854630706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Its 6:21 AM now. Need to go to sleep. Shall work later today.&lt;br /&gt;&lt;br /&gt;&lt;http: com="" 2009="" 01="" 25="" automation="" comments=""&gt;&lt;br /&gt;&lt;/http:&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-879003916997084874?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/879003916997084874/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/pango-magic.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/879003916997084874'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/879003916997084874'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/pango-magic.html' title='Pango Magic'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_j0a3i0txX5Q/ScWLAvNcybI/AAAAAAAADK8/sSA18HjGufU/s72-c/Screenshot-1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-5187987650903994260</id><published>2009-03-21T15:00:00.000-07:00</published><updated>2009-03-21T15:05:03.993-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='bengali conjuncts'/><category scheme='http://www.blogger.com/atom/ns#' term='ocr'/><title type='text'>Bengali Conjuncts</title><content type='html'>I suddenly started looking for all the Bengali conjuncts in one single place on the web. After a lot of searching I found http://www.stat.wisc.edu/~deepayan/Bengali/FreeBangTemplate/juktolist.txt.&lt;br /&gt;I downloaded the file, removed all the comments and used this python code to get a list of all conjuncts:&lt;br /&gt;&lt;br /&gt;import os&lt;br /&gt;&lt;br /&gt;f=open("juktolistocr.txt",'r')&lt;br /&gt;fout=open("conjuncts.txt",'w')&lt;br /&gt;&lt;br /&gt;for lines in f.readlines():&lt;br /&gt;    conjunct=lines.split('=')[1]&lt;br /&gt;    conjunct=conjunct.strip()&lt;br /&gt;    fout.write(conjunct+"\n")&lt;br /&gt;    print conjunct&lt;br /&gt;&lt;br /&gt;In case you want a list of bengali conjuncts too download it from http://debayanin.googlepages.com/bengali_conjuncts.txt&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-5187987650903994260?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/5187987650903994260/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/bengali-conjuncts.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/5187987650903994260'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/5187987650903994260'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/bengali-conjuncts.html' title='Bengali Conjuncts'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-4621015612097018098</id><published>2009-03-15T04:03:00.000-07:00</published><updated>2009-03-15T04:05:07.343-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tesseractindic'/><title type='text'>Recent Commits</title><content type='html'>Check Out  &lt;a href="http://code.google.com/p/tesseractindic/source/"&gt;http://code.google.com/p/tesseractindic/source/&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-4621015612097018098?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/4621015612097018098/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/recent-commits.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4621015612097018098'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4621015612097018098'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/recent-commits.html' title='Recent Commits'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-4842256146522764022</id><published>2009-03-15T03:49:00.000-07:00</published><updated>2009-03-15T03:50:06.148-07:00</updated><title type='text'>Observations</title><content type='html'>1) Looks like creating per-font training file sets makes more sense&lt;br /&gt;2) Omni-font training file sets make no sense. The same symbols may look remarkably different in different fonts. Also the size of such a training set would be huge. The recogniser would "go to sleep" in words of Prof. B.B. Choudhuri.&lt;br /&gt;3) Must reduce training fatalities to zero. Strange errors creeping in. Yet to figure out why.&lt;br /&gt;4) How to generate test images? Create a few text files. Use PIL draw() to generate images? Use a text editor manually&lt;br /&gt;&lt;br /&gt;What after all this?&lt;br /&gt;&lt;br /&gt;1) A nice how-to. Points to cover in the how-to:&lt;br /&gt;&lt;br /&gt;    * How to run using existing training files.&lt;br /&gt;    * How to create new training files.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-4842256146522764022?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/4842256146522764022/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/observations.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4842256146522764022'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/4842256146522764022'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/observations.html' title='Observations'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6518118086872671696.post-2374597159903690225</id><published>2009-03-15T03:34:00.000-07:00</published><updated>2009-03-15T03:42:24.148-07:00</updated><title type='text'>Transferred Entries</title><content type='html'>&lt;p&gt;&lt;u&gt;&lt;b&gt;March 5, 4:58 AM&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I am dealing with the problems mentioned in the last post. There are several fatalities in training, but i have been successful in weeding them out using the following piece of code:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="highlight"&gt;&lt;br /&gt;&lt;pre&gt; qpipe &lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt; os&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;popen4(exec_string1) &lt;br&gt; o&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;qpipe[&lt;span style="color: rgb(64, 160, 112);"&gt;1&lt;/span&gt;]&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;readlines() &lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#returning your output.&lt;/span&gt;&lt;br&gt; &lt;br&gt; pos&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32);"&gt;str&lt;/span&gt;(o)&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;find(&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;#39;FAILURE&amp;#39;&lt;/span&gt;)&lt;br&gt; &lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#print len(str(o))&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;print&lt;/span&gt; pos&lt;br&gt; &lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;if&lt;/span&gt;(pos &lt;span style="color: rgb(102, 102, 102);"&gt;&amp;gt;&lt;/span&gt;&lt;span style="color: rgb(64, 160, 112);"&gt;0&lt;/span&gt;):&lt;br&gt; &lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#os.chdir(&amp;quot;images&amp;quot;)&lt;/span&gt;&lt;br&gt; fileout&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32);"&gt;open&lt;/span&gt;(&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;error&amp;quot;&lt;/span&gt;,&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;#39;a&amp;#39;&lt;/span&gt;)&lt;br&gt; filein&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32);"&gt;open&lt;/span&gt;(&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;beng.images/&amp;quot;&lt;/span&gt;&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;box,&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;#39;r&amp;#39;&lt;/span&gt;)&lt;br&gt; linein&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;filein&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;readline()&lt;br&gt; fileout&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;write(linein&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;&lt;/span&gt;&lt;span style="color: rgb(64, 112, 160); font-weight: bold;"&gt;\n&lt;/span&gt;&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;&lt;/span&gt;)&lt;br&gt; filein&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;close()&lt;br&gt; fileout&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;close()&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt; &lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;It parses the output of the string executed by popen4() and looks for characters&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;that failed while training. It writes those characters in a separate file.&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;I just need to work on generating one set of really good and flawless training data.&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;March 4, 5:07 AM&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its been more than a month since i recorded any of my work here, but I *have* been working and there are lots of updates.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;We now have 3 project members. Jinesh from IIIT Hyderabad wishes to add Malayalam support to TesseractIndic. Baali (Shantanu) from Sarai, Delhi wishes to add Devanagri support. So finally am not working alone.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Also, around a month back I had gone to ISI Kolkata to consult with Prof. B.B. Choudhury. &lt;b&gt;&lt;/b&gt;I had mailed Dr. B.B. Chaudhuri of ISI Kolkata to help&lt;br /&gt;me out regarding training data and testing ground truth data and thanx&lt;br /&gt;to Mr. Gora’s recommendation, he allowed me to meet him in Kolkata. I&lt;br /&gt;met him at ISI and spoke to him for about 40 minutes regarding&lt;br /&gt;different issues in Indic OCR. He discussed some really good ways to&lt;br /&gt;significantly reduce recognition time etc, and rued the lack of good&lt;br /&gt;research assistants.&lt;br&gt;&lt;br /&gt;He could not share data with me at that moment because of lack of&lt;br /&gt;manpower and copyright issues. I was returning dejected, but I met Dr.&lt;br /&gt;Mandar Mitra at the gates. Mr. Sankarshan, my mentor who originally&lt;br /&gt;started me down this path, had introduced me to him in Kolkata about a&lt;br /&gt;month back. It really helped and he took me to his lab. He mined last 7&lt;br /&gt;8 years his work and gave me everythin useful he had, includeing a lot&lt;br /&gt;of ground truth data and some images with bounding boxes information.&lt;br&gt;&lt;br /&gt;But that is not the most important thing i acquired there. While&lt;br /&gt;talking to Mandar Mitra, it dawned on us that the entire training&lt;br /&gt;process can be automated using python scripts, and there is no need of&lt;br /&gt;manually feeding data using scanners and all.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The principle on which this works is this: Tesseract needs two things to train itself, 1) An image of the character 2) The name of the character. This information is provided with the help of &amp;quot;box files&amp;quot;. A box file contains the co-ordinates of the bounding boxes around characters with labels as to what those characters are. The traditional method of training the engine is to take a scanned image, meticulously create a box file using some tool such as tesseractrainer.py , edit the box file, and keep doing the same for several other images and fonts. This process was tedious enough to force me to seek new methods.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Now lets do a little reverse engineering. What if we could take a list of characters in a text file, &amp;quot;generate&amp;quot; an image out of those characters, store the co-ordinates of the bounding boxes of those generated images in a file and then feed these to the OCR engine? It would work, right?&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Links:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://tesseractindic.googlecode.com/files/tesseract_trainer.beta.tar.gz" target="_blank"&gt;http://tesseractindic.&lt;wbr&gt;googlecode.com/files/&lt;wbr&gt;tesseract_trainer.beta.tar.gz&lt;/a&gt;  - The tar ball itself&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/tesseractindic/source/browse/trunk/tesseract_trainer/readme" target="_blank"&gt;http://code.google.com/p/&lt;wbr&gt;tesseractindic/source/browse/&lt;wbr&gt;trunk/tesseract_trainer/readme&lt;/a&gt;&lt;wbr&gt;  - The readme file&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;a href="http://www.youtube.com/watch?v=vuuVwm5ZjkI" target="_blank"&gt;http://www.youtube.com/watch?&lt;wbr&gt;v=vuuVwm5ZjkI&lt;/a&gt; - YouTube video of the tool working for Bengali&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;But there are problems. Tesseract-OCR has its quirks.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Tesseract wants one bounding box to enclose a single &amp;quot;blob&amp;quot; only. A blob is a wholly connected component. So ক is a blob, and ক খ are two blobs. There are cases where a consonant+vowel sign generates two blobs, for example the 3 images below have multiple blobs:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="imageani36.tif/imageani36-full;init:.tif" imageanchor="1"&gt;&lt;img src="imageani36.tif/imageani36-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="imageani37.tif/imageani37-full;init:.tif" imageanchor="1"&gt;&lt;img src="imageani37.tif/imageani37-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="imageani43.tif/imageani43-full;init:.tif" imageanchor="1"&gt;&lt;img src="imageani43.tif/imageani43-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;And hence Tesseract throws a &amp;quot;FATALITY&amp;quot; error during training.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;So i had to change my approach a little bit. Obviously there has to be some feedback mechanism where i parse the output of Tesseract during training to see if a particular set of characters threw errors. Once I know what they are, I can separate them and train them later. To accomplish this, I changed my approach of generating a strip of character images to generating just one image per character, so I can pin point the problems better.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The downside, too many images getting generated. To train a simple font it generates 405 images+405 box files+405 tr files. And all this when I have not included conjuncts yet. It is not much of a problem though, since the images generated are not required once the training files have been generated. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Well it leads me to new challenges. I remember Prof B.B. Choudhury saying that training all the conjuncts will kill any recogniser, ie, it will work very slowly while recognising. He also told me some cool ways to get past that. May have to implement that. Lets see.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;And yeah, Sarai cheque arrived. :)&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;January 29, 3:16 AM&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="highlight"&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;for&lt;/span&gt; akshar &lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;in&lt;/span&gt; alphabets:&lt;br&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#print akshar&lt;/span&gt;&lt;br&gt;&lt;br&gt;  draw&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;text((x, y), &lt;span style="color: rgb(0, 112, 32);"&gt;unicode&lt;/span&gt;(akshar,&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;#39;UTF-8&amp;#39;&lt;/span&gt;), font&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;font)&lt;br&gt;  leftx&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;x&lt;span style="color: rgb(102, 102, 102);"&gt;-&lt;/span&gt;&lt;span style="color: rgb(64, 160, 112);"&gt;20&lt;/span&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#the left end of the small bounding box&lt;/span&gt;&lt;br&gt;  box&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;(leftx,y,x&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;&lt;span style="color: rgb(64, 160, 112);"&gt;100&lt;/span&gt;,y&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;&lt;span style="color: rgb(64, 160, 112);"&gt;60&lt;/span&gt;) &lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#the box in the big image within which the small image of interest lies&lt;/span&gt;&lt;br&gt;  sub_im&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;im&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;crop(box)&lt;br&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#sub_im.show()&lt;/span&gt;&lt;br&gt;  bbox_sub&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;sub_im&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;getbbox() &lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#get the bounding box of the black pixels in the sub image, not the big image&lt;/span&gt;&lt;br&gt;  bbox_im&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;(leftx&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;bbox_sub[&lt;span style="color: rgb(64, 160, 112);"&gt;0&lt;/span&gt;],y&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;bbox_sub[&lt;span style="color: rgb(64, 160, 112);"&gt;1&lt;/span&gt;],leftx&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;bbox_sub[&lt;span style="color: rgb(64, 160, 112);"&gt;2&lt;/span&gt;],y&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;bbox_sub[&lt;span style="color: rgb(64, 160, 112);"&gt;3&lt;/span&gt;]) &lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#calculate relative to the big image&lt;/span&gt;&lt;br&gt;  draw&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;rectangle(bbox_im)&lt;br&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;print&lt;/span&gt; bbox_im&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;This block of code was instrumental in giving this:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="haha.jpg/haha-full;init:.jpg" imageanchor="1"&gt;&lt;img src="haha.jpg/haha-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;January 25, 4:54 PM&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I am so excited.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="highlight"&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#!/usr/local/bin/python&lt;/span&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#-*- coding:utf8 -*-&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;import&lt;/span&gt;&lt;span style="color: rgb(14, 132, 181); font-weight: bold;"&gt;ImageFont&lt;/span&gt;&lt;span style="color: rgb(102, 102, 102);"&gt;,&lt;/span&gt;&lt;span style="color: rgb(14, 132, 181); font-weight: bold;"&gt;ImageDraw&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;from&lt;/span&gt;&lt;span style="color: rgb(14, 132, 181); font-weight: bold;"&gt;PIL&lt;/span&gt;&lt;span style="color: rgb(0, 112, 32); font-weight: bold;"&gt;import&lt;/span&gt; Image&lt;br&gt;&lt;br&gt;im &lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt; Image&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;new(&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;RGB&amp;quot;&lt;/span&gt;,(&lt;span style="color: rgb(64, 160, 112);"&gt;400&lt;/span&gt;,&lt;span style="color: rgb(64, 160, 112);"&gt;400&lt;/span&gt;))&lt;br&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;#im.show()&lt;/span&gt;&lt;br&gt;&lt;br&gt;draw &lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt; ImageDraw&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;Draw(im)&lt;br&gt;&lt;br&gt;&lt;span style="color: rgb(96, 160, 176); font-style: italic;"&gt;# use a truetype font&lt;/span&gt;&lt;br&gt;font&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt; ImageFont&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;truetype(&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;/usr/share/fonts/truetype/ttf-bengali-fonts/lohit_bn.ttf&amp;quot;&lt;/span&gt;,&lt;span style="color: rgb(64, 160, 112);"&gt;50&lt;/span&gt;)&lt;br&gt;&lt;br&gt;txt1&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot;ক&amp;quot;&lt;/span&gt;&lt;br&gt;txt2&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;quot; ি&amp;quot;&lt;/span&gt;&lt;br&gt;txt&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;txt2&lt;span style="color: rgb(102, 102, 102);"&gt;+&lt;/span&gt;txt1&lt;br&gt;&lt;br&gt;&lt;br&gt;draw&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;text((&lt;span style="color: rgb(64, 160, 112);"&gt;10&lt;/span&gt;, &lt;span style="color: rgb(64, 160, 112);"&gt;10&lt;/span&gt;), &lt;span style="color: rgb(0, 112, 32);"&gt;unicode&lt;/span&gt;(txt,&lt;span style="color: rgb(64, 112, 160);"&gt;&amp;#39;UTF-8&amp;#39;&lt;/span&gt;), font&lt;span style="color: rgb(102, 102, 102);"&gt;=&lt;/span&gt;font)&lt;br&gt;im&lt;span style="color: rgb(102, 102, 102);"&gt;.&lt;/span&gt;show()&lt;br&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;generated:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Screenshot.png/Screenshot-full;init:.png" imageanchor="1"&gt;&lt;img src="Screenshot.png/Screenshot-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;That means the *entire* training+testing process can be automated :) :) :)&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Damn, I dint even have to go to  ISI. Ofcourse, going to ISI was quite an experience in itself.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;January 10, 2009&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;08:57 PM&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;&lt;/b&gt;&lt;/u&gt;&lt;br&gt;&lt;span style="font-size: large; font-weight: bold;"&gt;Forwarded conversation&lt;/span&gt;&lt;br&gt;Subject: &lt;b class="gmail_sendername"&gt;Regarding Bangla training data&lt;/b&gt;&lt;br&gt;------------------------&lt;br&gt;&lt;br&gt;&lt;span class="undefined"&gt;&lt;font color="#000000"&gt;From: &lt;b class="undefined"&gt;Debayan Banerjee&lt;/b&gt;&lt;span dir="ltr"&gt;&amp;lt;debayanin@gmail.com&amp;gt;&lt;/span&gt;&lt;br&gt;Date: 2009/1/9&lt;br&gt;To: mhasnat@gmail.com&lt;br&gt;&lt;/font&gt;&lt;br&gt;&lt;/span&gt;&lt;br&gt;Hi,&lt;br&gt;I was going through your work on ocropus and your training data. I have a few questions:&lt;br&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Can you share with me your training data?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Have you trained only for Solaiman lipi font?&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;The maatraa-clipping code in lua, what is the logic/pseudocode?&lt;br&gt;&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;What is the performance of the lua scripts on the 18 test images?&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Eagerly waiting for your reply.&lt;br&gt;&lt;br&gt;~Debayan&lt;br&gt;&lt;font color="#888888"&gt;&lt;br&gt;-- &lt;br&gt;BE INTELLIGENT, USE GNU/LINUX&lt;br&gt;&lt;a href="http://lug.nitdgp.ac.in/" target="_blank"&gt;http://lug.nitdgp.ac.in&lt;/a&gt;&lt;br&gt;&lt;a href="http://mukti09.in/" target="_blank"&gt;http://mukti09.in&lt;/a&gt;&lt;br&gt;&lt;a href="http://planet-india.randomink.org/" target="_blank"&gt;http://planet-india.randomink.&lt;wbr&gt;org&lt;/a&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;/font&gt;&lt;br&gt;----------&lt;br&gt;&lt;span class="undefined"&gt;&lt;font color="#000000"&gt;From: &lt;b class="undefined"&gt;Hasnat&lt;/b&gt;&lt;span dir="ltr"&gt;&amp;lt;mhasnat@gmail.com&amp;gt;&lt;/span&gt;&lt;br&gt;Date: 2009/1/10&lt;br&gt;To: Debayan Banerjee &amp;lt;debayanin@gmail.com&amp;gt;&lt;br&gt;&lt;/font&gt;&lt;br&gt;&lt;/span&gt;&lt;br&gt;Dear Debayan,&lt;br&gt;                     &lt;br /&gt;sorry for my late reply because of my traveling from Bangladesh to&lt;br /&gt;outside. By training data do you mean it for OCROpus or tesseract? For&lt;br /&gt;OCROpus we have created training data which was a complex process. I&lt;br /&gt;worked on that few months ago to test the training procedure working&lt;br /&gt;for Bangla script. I observed that this is quite complex process to&lt;br /&gt;prepare training data. Few more work need to be done to complete this.&lt;br /&gt;However, we (me and shouro) have tested basic training procedure and&lt;br /&gt;observed the performance which seems satisfactory to me but very&lt;br /&gt;sensitive. To make that training data very effective we had to collect&lt;br /&gt;a large amount of data and train. As the segmentation algorithm was not&lt;br /&gt;completed at that time as well as OCROpus was changing its procedures&lt;br /&gt;continuously, so I left that task until the next stable version of&lt;br /&gt;OCROpus (0.3). Now we again start looking at that and just finished the&lt;br /&gt;basic compilation and checking other procedure to integration. So,&lt;br /&gt;honestly there is no training data for version 0.3.&lt;br&gt;&lt;br&gt;We are considering SuttunyMJ font for training which is the most widely used font in the Bangla documents.&lt;br&gt;&lt;br&gt;I&lt;br /&gt;didn&amp;#39;t integrate any Matraa clipping code in Lua script. Rather I was&lt;br /&gt;focusing on embedding our own procedures with C++ files which is not&lt;br /&gt;completed yet. Matraa clipping is a big problem what I observed if you&lt;br /&gt;follow the general procedures. I have tested three different methods&lt;br /&gt;andobserved that nothing is giving 100% accuracy for all type of&lt;br /&gt;documents. I think its a big deal to solve yet.&lt;br&gt;&lt;br&gt;I have tested the images for tesseract. The test images is not&lt;br /&gt;following the training document font size and type. Hence for different&lt;br /&gt;images we are getting different results. From the feedback of different&lt;br /&gt;people at the end user level I have the realization that we have to&lt;br /&gt;work more for a market place standard OCR.&lt;br&gt;&lt;br&gt;I will return back to my country at the end of this month and start&lt;br /&gt;working on OCR. Then can concentrate on these issues and hopeful to&lt;br /&gt;find out solutions. As we have implemented the complete framework so it&lt;br /&gt;will be easier for us to solve the particular problems. Please do share&lt;br /&gt;your work with us and you find our work on the web link.Regards,&lt;br&gt;&lt;font color="#888888"&gt;-- &lt;br&gt;Hasnat&lt;br&gt;Center for Research on Bangla Language Processing (CRBLP)&lt;br&gt;&lt;a href="http://mhasnat.googlepages.com/" target="_blank"&gt;http://mhasnat.googlepages.&lt;wbr&gt;com/&lt;/a&gt;&lt;br&gt;&lt;/font&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;&lt;/b&gt;&lt;/u&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;December 22&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;05:07 hrs. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;a mce="" href="http://code.google.com/p/ocropus/source/browse/trunk/ocroscript/scripts/deskew.lua"&gt;This&lt;/a&gt; is the lua script in ocropus 0.3 release that deskews a page image. It did not work for me. Kept giving this error:&lt;/p&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;br /&gt;&lt;p&gt;ocroscript: ocroscript/scripts/deskew.lua:9: attempt to call global &amp;#39;make_DeskewPageByRAST&amp;#39; (a nil value)&lt;br&gt;stack traceback:&lt;br&gt;    ocroscript/scripts/deskew.lua:9: in main chunk&lt;br&gt;    [C]: ?&lt;/p&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;p&gt;I used google, and found &lt;a mce="" href="http://markmail.org/message/zb7riovavg6meuis"&gt;this&lt;/a&gt;. It worked well. The code is:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;-proc = make_DeskewPageByRAST()&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;+proc = ocr.make_DeskewPageByRAST()&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt; input = bytearray:new()&lt;br /&gt; output = bytearray:new()&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br /&gt;-read_image_gray(input,arg[1])&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;+iulib.read_image_gray(input,arg[1])&lt;br /&gt; &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;proc:cleanup(output,input)&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;-write_png(arg[2],output)&lt;br /&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Result is: &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;img mce="" src="http://debayan.wordpress.com/files/2008/12/tilt.png?w=227" alt="tilt" title="tilt" class="aligncenter size-medium wp-image-234" height="300" width="227"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;img mce="" src="http://debayan.wordpress.com/files/2008/12/tilt1.png?w=227" alt="tilt1" title="tilt1" class="aligncenter size-medium wp-image-235" height="300" width="227"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;October 28&lt;/b&gt;&lt;/u&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;My work till date:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Author: debayanin&lt;br&gt;&lt;br /&gt;Date: Mon Oct 27 16:41:10 2008&lt;br&gt;&lt;br /&gt;New Revision: 8&lt;br&gt;&lt;br&gt;&lt;br /&gt;Modified:&lt;br&gt;&lt;br /&gt;   trunk/ccmain/baseapi.cpp&lt;br&gt;&lt;br&gt;&lt;br /&gt;Log:&lt;br&gt;&lt;br /&gt;auto-indented baseapi.cpp&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br /&gt;Modified: trunk/ccmain/baseapi.cpp&lt;br&gt;&lt;br /&gt;==============================&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;div id=":zg" class="ArwC7c ckChnd"&gt;&lt;wbr&gt;==============================&lt;wbr&gt;==================&lt;br&gt;&lt;br /&gt;--- trunk/ccmain/baseapi.cpp    (original)&lt;br&gt;&lt;br /&gt;+++ trunk/ccmain/baseapi.cpp    Mon Oct 27 16:41:10 2008&lt;br&gt;&lt;br /&gt;@@ -409,161 +409,161 @@&lt;br&gt;&lt;br /&gt; ////////////DEBAYAN//Deskew begins//////////////////////&lt;br&gt;&lt;br /&gt; void deskew(float angle,int srcheight, int srcwidth)&lt;br&gt;&lt;br /&gt; {&lt;br&gt;&lt;br /&gt;-//angle=4;        //45° for example&lt;br&gt;&lt;br /&gt;-IMAGE tempimage;&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-IMAGELINE line;&lt;br&gt;&lt;br /&gt;-//Convert degrees to radians&lt;br&gt;&lt;br /&gt;-float radians=(2*3.1416*angle)/360;&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-float cosine=(float)cos(radians);&lt;br&gt;&lt;br /&gt;-float sine=(float)sin(radians);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-float Point1x=(srcheight*sine);&lt;br&gt;&lt;br /&gt;-float Point1y=(srcheight*cosine);&lt;br&gt;&lt;br /&gt;-float Point2x=(srcwidth*cosine-&lt;wbr&gt;srcheight*sine);&lt;br&gt;&lt;br /&gt;-float Point2y=(srcheight*cosine+&lt;wbr&gt;srcwidth*sine);&lt;br&gt;&lt;br /&gt;-float Point3x=(srcwidth*cosine);&lt;br&gt;&lt;br /&gt;-float Point3y=(srcwidth*sine);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-float minx=min(0,min(Point1x,min(&lt;wbr&gt;Point2x,Point3x)));&lt;br&gt;&lt;br /&gt;-float miny=min(0,min(Point1y,min(&lt;wbr&gt;Point2y,Point3y)));&lt;br&gt;&lt;br /&gt;-float maxx=max(Point1x,max(Point2x,&lt;wbr&gt;Point3x));&lt;br&gt;&lt;br /&gt;-float maxy=max(Point1y,max(Point2y,&lt;wbr&gt;Point3y));&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-int DestWidth=(int)ceil(fabs(maxx)&lt;wbr&gt;-minx);&lt;br&gt;&lt;br /&gt;-int DestHeight=(int)ceil(fabs(&lt;wbr&gt;maxy)-miny);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-tempimage.create(DestWidth,&lt;wbr&gt;DestHeight,1);&lt;br&gt;&lt;br /&gt;-line.init(DestWidth);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-for(int i=0;i&amp;lt;DestWidth;i++){ //A white line of length=DestWidth&lt;br&gt;&lt;br /&gt;-line.pixels[i]=1;&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-for(int y=0;y&amp;lt;DestHeight;y++){ //Fill the Destination image with white, else clipmatra wont work&lt;br&gt;&lt;br /&gt;-tempimage.put_line(0,y,&lt;wbr&gt;DestWidth,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-line.init(DestWidth);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-for(int y=0;y&amp;lt;DestHeight;y++) //Start filling the destination image pixels with corresponding source image pixels&lt;br&gt;&lt;br /&gt;-{&lt;br&gt;&lt;br /&gt;-  for(int x=0;x&amp;lt;DestWidth;x++)&lt;br&gt;&lt;br /&gt;-  {&lt;br&gt;&lt;br /&gt;-    int Srcx=(int)((x+minx)*cosine+(y+&lt;wbr&gt;miny)*sine);&lt;br&gt;&lt;br /&gt;-    int Srcy=(int)((y+miny)*cosine-(x+&lt;wbr&gt;minx)*sine);&lt;br&gt;&lt;br /&gt;-    if(Srcx&amp;gt;=0&amp;amp;&amp;amp;Srcx&amp;lt;srcwidth&amp;amp;&amp;amp;&lt;wbr&gt;Srcy&amp;gt;=0&amp;amp;&amp;amp;&lt;br&gt;&lt;br /&gt;-         Srcy&amp;lt;srcheight)&lt;br&gt;&lt;br /&gt;-    {&lt;br&gt;&lt;br /&gt;-      line.pixels[x]=&lt;br&gt;&lt;br /&gt;-          page_image.pixel(Srcx,Srcy);&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-   tempimage.put_line(0,y,&lt;wbr&gt;DestWidth,&amp;amp;line,0);  &lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-//tempimage.write(&amp;quot;tempimage.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;-page_image=tempimage;//Copy deskewed image to global page image, so it can be worked on further&lt;br&gt;&lt;br /&gt;-tempimage.destroy();&lt;br&gt;&lt;br /&gt;-//page_image.write(&amp;quot;page_&lt;wbr&gt;image.tif&amp;quot;);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;+       //angle=4;        //45° for example&lt;br&gt;&lt;br /&gt;+       IMAGE tempimage;&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       IMAGELINE line;&lt;br&gt;&lt;br /&gt;+       //Convert degrees to radians&lt;br&gt;&lt;br /&gt;+       float radians=(2*3.1416*angle)/360;&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       float cosine=(float)cos(radians);&lt;br&gt;&lt;br /&gt;+       float sine=(float)sin(radians);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       float Point1x=(srcheight*sine);&lt;br&gt;&lt;br /&gt;+       float Point1y=(srcheight*cosine);&lt;br&gt;&lt;br /&gt;+       float Point2x=(srcwidth*cosine-&lt;wbr&gt;srcheight*sine);&lt;br&gt;&lt;br /&gt;+       float Point2y=(srcheight*cosine+&lt;wbr&gt;srcwidth*sine);&lt;br&gt;&lt;br /&gt;+       float Point3x=(srcwidth*cosine);&lt;br&gt;&lt;br /&gt;+       float Point3y=(srcwidth*sine);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       float minx=min(0,min(Point1x,min(&lt;wbr&gt;Point2x,Point3x)));&lt;br&gt;&lt;br /&gt;+       float miny=min(0,min(Point1y,min(&lt;wbr&gt;Point2y,Point3y)));&lt;br&gt;&lt;br /&gt;+       float maxx=max(Point1x,max(Point2x,&lt;wbr&gt;Point3x));&lt;br&gt;&lt;br /&gt;+       float maxy=max(Point1y,max(Point2y,&lt;wbr&gt;Point3y));&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       int DestWidth=(int)ceil(fabs(maxx)&lt;wbr&gt;-minx);&lt;br&gt;&lt;br /&gt;+       int DestHeight=(int)ceil(fabs(&lt;wbr&gt;maxy)-miny);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       tempimage.create(DestWidth,&lt;wbr&gt;DestHeight,1);&lt;br&gt;&lt;br /&gt;+       line.init(DestWidth);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       for(int i=0;i&amp;lt;DestWidth;i++){ //A white line of length=DestWidth&lt;br&gt;&lt;br /&gt;+               line.pixels[i]=1;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       for(int y=0;y&amp;lt;DestHeight;y++){ //Fill the Destination image with white, else clipmatra wont work&lt;br&gt;&lt;br /&gt;+               tempimage.put_line(0,y,&lt;wbr&gt;DestWidth,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       line.init(DestWidth);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       for(int y=0;y&amp;lt;DestHeight;y++) //Start filling the destination image pixels with corresponding source image pixels&lt;br&gt;&lt;br /&gt;+       {&lt;br&gt;&lt;br /&gt;+               for(int x=0;x&amp;lt;DestWidth;x++)&lt;br&gt;&lt;br /&gt;+               {&lt;br&gt;&lt;br /&gt;+                       int Srcx=(int)((x+minx)*cosine+(y+&lt;wbr&gt;miny)*sine);&lt;br&gt;&lt;br /&gt;+                       int Srcy=(int)((y+miny)*cosine-(x+&lt;wbr&gt;minx)*sine);&lt;br&gt;&lt;br /&gt;+                       if(Srcx&amp;gt;=0&amp;amp;&amp;amp;Srcx&amp;lt;srcwidth&amp;amp;&amp;amp;&lt;wbr&gt;Srcy&amp;gt;=0&amp;amp;&amp;amp;&lt;br&gt;&lt;br /&gt;+                          Srcy&amp;lt;srcheight)&lt;br&gt;&lt;br /&gt;+                       {&lt;br&gt;&lt;br /&gt;+                               line.pixels[x]=&lt;br&gt;&lt;br /&gt;+                                       page_image.pixel(Srcx,Srcy);&lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               tempimage.put_line(0,y,&lt;wbr&gt;DestWidth,&amp;amp;line,0);      &lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       //tempimage.write(&amp;quot;tempimage.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;+       page_image=tempimage;//Copy deskewed image to global page image, so it can be worked on further&lt;br&gt;&lt;br /&gt;+               tempimage.destroy();&lt;br&gt;&lt;br /&gt;+       //page_image.write(&amp;quot;page_&lt;wbr&gt;image.tif&amp;quot;);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt; }&lt;br&gt;&lt;br /&gt; /////////////DEBAYAN//Deskew ends/////////////////////&lt;br&gt;&lt;br&gt;&lt;br /&gt; ////////////DEBAYAN//Find skew begins/////////////////&lt;br&gt;&lt;br /&gt; float findskew(int height, int width)&lt;br&gt;&lt;br /&gt; {&lt;br&gt;&lt;br /&gt;-int topx=0,topy=0,sign,count=0,&lt;wbr&gt;offset=1,ifcounter=0;&lt;br&gt;&lt;br /&gt;-float slope=-999,avg=0;&lt;br&gt;&lt;br /&gt;-IMAGELINE line;&lt;br&gt;&lt;br /&gt;-line.init(1);&lt;br&gt;&lt;br /&gt;-line.pixels[0]=0;&lt;br&gt;&lt;br /&gt;-///////Find the top most point of the page: begins///////////&lt;br&gt;&lt;br /&gt;-for(int y=height-1;y&amp;gt;0;y--){&lt;br&gt;&lt;br /&gt;-  for(int x=width-1;x&amp;gt;0;x--){&lt;br&gt;&lt;br /&gt;-    if(page_image.pixel(x,y)==0){&lt;br&gt;&lt;br /&gt;-      topx=x;topy=y;&lt;br&gt;&lt;br /&gt;-      break;&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-  if(topx&amp;gt;0){break;};&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-///////Find the top most point of the page: ends///////////&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-///////To find pages with no skew: begins//////////////&lt;br&gt;&lt;br /&gt;-int c1,c2=0;&lt;br&gt;&lt;br /&gt;-for(int x=1;x&amp;lt;.25*width;x++){&lt;br&gt;&lt;br /&gt;-  while(page_image.pixel((&lt;wbr&gt;width/2)+x,c1++)==1){ }&lt;br&gt;&lt;br /&gt;-  while(page_image.pixel((&lt;wbr&gt;width/2)-x,c2++)==1){ }&lt;br&gt;&lt;br /&gt;-  if(c1==c2){cout&amp;lt;&amp;lt;&amp;quot;0 ANGLE\n&amp;quot;;return (0);}&lt;br&gt;&lt;br /&gt;-  c1=c2=0;&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-///////To find pages with no skew: ends//////////////&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-cout&amp;lt;&amp;lt;&amp;quot;width=&amp;quot;&amp;lt;&amp;lt;width;&lt;br&gt;&lt;br /&gt;-if(topx&amp;gt;0 &amp;amp;&amp;amp; topx&amp;lt;.5*width){&lt;br&gt;&lt;br /&gt;-  sign=1;&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-if(topx&amp;gt;0 &amp;amp;&amp;amp; topx&amp;gt;.5*width){&lt;br&gt;&lt;br /&gt;-  sign=-1;&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-if(sign==-1){&lt;br&gt;&lt;br /&gt;-  while((topx-offset)&amp;gt;width/2){&lt;br&gt;&lt;br /&gt;-    while(page_image.pixel(topx-&lt;wbr&gt;offset,topy-count)==1){&lt;br&gt;&lt;br /&gt;-    //page_image.put_line(topx-&lt;wbr&gt;offset,topy-count,1,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;-    count++;&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-    if((180/3.142)*atan((float)&lt;wbr&gt;count/offset)&amp;lt;10){&lt;br&gt;&lt;br /&gt;-    slope=(float)count/offset;&lt;br&gt;&lt;br /&gt;-    ifcounter++;&lt;br&gt;&lt;br /&gt;-    avg=(avg+slope);&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-    count=0;&lt;br&gt;&lt;br /&gt;-    offset++;&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-    avg=(float)avg/ifcounter;&lt;br&gt;&lt;br /&gt;-    //cout&amp;lt;&amp;lt;&amp;quot;avg=&amp;quot;&amp;lt;&amp;lt;avg&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;-    page_image.write(&amp;quot;findskew.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;-    //cout&amp;lt;&amp;lt;&amp;quot;(180/3.142)*atan((&lt;wbr&gt;float)(count/offset)=&amp;quot;&amp;lt;&amp;lt;(180/&lt;wbr&gt;3.142)*atan(avg)&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;-    return (sign*(180/3.142)*atan(avg));&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-if(sign==1){&lt;br&gt;&lt;br /&gt;-  while((topx+offset)&amp;lt;width/2){&lt;br&gt;&lt;br /&gt;-    while(page_image.pixel(topx+&lt;wbr&gt;offset,topy-count)==1){&lt;br&gt;&lt;br /&gt;-    //page_image.put_line(topx+&lt;wbr&gt;offset,topy-count,1,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;-    count++;&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-    if((180/3.142)*atan((float)&lt;wbr&gt;count/offset)&amp;lt;10){&lt;br&gt;&lt;br /&gt;-    slope=(float)count/offset;&lt;br&gt;&lt;br /&gt;-    ifcounter++;&lt;br&gt;&lt;br /&gt;-    avg=(avg+slope);&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-    count=0;&lt;br&gt;&lt;br /&gt;-    offset++;&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-    avg=(float)avg/ifcounter;&lt;br&gt;&lt;br /&gt;-    //cout&amp;lt;&amp;lt;&amp;quot;avg=&amp;quot;&amp;lt;&amp;lt;avg&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;-    page_image.write(&amp;quot;findskew.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;-    //cout&amp;lt;&amp;lt;&amp;quot;(180/3.142)*atan((&lt;wbr&gt;float)(count/offset)=&amp;quot;&amp;lt;&amp;lt;(180/&lt;wbr&gt;3.142)*atan(avg)&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;-    return (sign*(180/3.142)*atan(avg));&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-if(sign==0)&lt;br&gt;&lt;br /&gt;-{return 0;}&lt;br&gt;&lt;br /&gt;-cout&amp;lt;&amp;lt;&amp;quot;SHIT&amp;quot;;&lt;br&gt;&lt;br /&gt;-return (0);&lt;br&gt;&lt;br /&gt;+       int topx=0,topy=0,sign,count=0,&lt;wbr&gt;offset=1,ifcounter=0;&lt;br&gt;&lt;br /&gt;+       float slope=-999,avg=0;&lt;br&gt;&lt;br /&gt;+       IMAGELINE line;&lt;br&gt;&lt;br /&gt;+       line.init(1);&lt;br&gt;&lt;br /&gt;+       line.pixels[0]=0;&lt;br&gt;&lt;br /&gt;+       ///////Find the top most point of the page: begins///////////&lt;br&gt;&lt;br /&gt;+               for(int y=height-1;y&amp;gt;0;y--){&lt;br&gt;&lt;br /&gt;+                       for(int x=width-1;x&amp;gt;0;x--){&lt;br&gt;&lt;br /&gt;+                               if(page_image.pixel(x,y)==0){&lt;br&gt;&lt;br /&gt;+                                       topx=x;topy=y;&lt;br&gt;&lt;br /&gt;+                                       break;&lt;br&gt;&lt;br /&gt;+                               }&lt;br&gt;&lt;br /&gt;+                               &lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+                       &lt;br&gt;&lt;br /&gt;+                       if(topx&amp;gt;0){break;};&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+       ///////Find the top most point of the page: ends///////////&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+               ///////To find pages with no skew: begins//////////////&lt;br&gt;&lt;br /&gt;+               int c1,c2=0;&lt;br&gt;&lt;br /&gt;+       for(int x=1;x&amp;lt;.25*width;x++){&lt;br&gt;&lt;br /&gt;+               while(page_image.pixel((width/&lt;wbr&gt;2)+x,c1++)==1){ }&lt;br&gt;&lt;br /&gt;+               while(page_image.pixel((width/&lt;wbr&gt;2)-x,c2++)==1){ }&lt;br&gt;&lt;br /&gt;+               if(c1==c2){cout&amp;lt;&amp;lt;&amp;quot;0 ANGLE\n&amp;quot;;return (0);}&lt;br&gt;&lt;br /&gt;+               c1=c2=0;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       ///////To find pages with no skew: ends//////////////&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+               cout&amp;lt;&amp;lt;&amp;quot;width=&amp;quot;&amp;lt;&amp;lt;width;&lt;br&gt;&lt;br /&gt;+       if(topx&amp;gt;0 &amp;amp;&amp;amp; topx&amp;lt;.5*width){&lt;br&gt;&lt;br /&gt;+               sign=1;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       if(topx&amp;gt;0 &amp;amp;&amp;amp; topx&amp;gt;.5*width){&lt;br&gt;&lt;br /&gt;+               sign=-1;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       if(sign==-1){&lt;br&gt;&lt;br /&gt;+               while((topx-offset)&amp;gt;width/2){&lt;br&gt;&lt;br /&gt;+                       while(page_image.pixel(topx-&lt;wbr&gt;offset,topy-count)==1){&lt;br&gt;&lt;br /&gt;+                               //page_image.put_line(topx-&lt;wbr&gt;offset,topy-count,1,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;+                               count++;&lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+                       &lt;br&gt;&lt;br /&gt;+                       if((180/3.142)*atan((float)&lt;wbr&gt;count/offset)&amp;lt;10){&lt;br&gt;&lt;br /&gt;+                               slope=(float)count/offset;&lt;br&gt;&lt;br /&gt;+                               ifcounter++;&lt;br&gt;&lt;br /&gt;+                               avg=(avg+slope);&lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+                       count=0;&lt;br&gt;&lt;br /&gt;+                       offset++;&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               avg=(float)avg/ifcounter;&lt;br&gt;&lt;br /&gt;+               //cout&amp;lt;&amp;lt;&amp;quot;avg=&amp;quot;&amp;lt;&amp;lt;avg&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;+               page_image.write(&amp;quot;findskew.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;+               //cout&amp;lt;&amp;lt;&amp;quot;(180/3.142)*atan((&lt;wbr&gt;float)(count/offset)=&amp;quot;&amp;lt;&amp;lt;(180/&lt;wbr&gt;3.142)*atan(avg)&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;+               return (sign*(180/3.142)*atan(avg));&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       if(sign==1){&lt;br&gt;&lt;br /&gt;+               while((topx+offset)&amp;lt;width/2){&lt;br&gt;&lt;br /&gt;+                       while(page_image.pixel(topx+&lt;wbr&gt;offset,topy-count)==1){&lt;br&gt;&lt;br /&gt;+                               //page_image.put_line(topx+&lt;wbr&gt;offset,topy-count,1,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;+                               count++;&lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+                       &lt;br&gt;&lt;br /&gt;+                       if((180/3.142)*atan((float)&lt;wbr&gt;count/offset)&amp;lt;10){&lt;br&gt;&lt;br /&gt;+                               slope=(float)count/offset;&lt;br&gt;&lt;br /&gt;+                               ifcounter++;&lt;br&gt;&lt;br /&gt;+                               avg=(avg+slope);&lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+                       count=0;&lt;br&gt;&lt;br /&gt;+                       offset++;&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               avg=(float)avg/ifcounter;&lt;br&gt;&lt;br /&gt;+               //cout&amp;lt;&amp;lt;&amp;quot;avg=&amp;quot;&amp;lt;&amp;lt;avg&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;+               page_image.write(&amp;quot;findskew.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;+               //cout&amp;lt;&amp;lt;&amp;quot;(180/3.142)*atan((&lt;wbr&gt;float)(count/offset)=&amp;quot;&amp;lt;&amp;lt;(180/&lt;wbr&gt;3.142)*atan(avg)&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;+               return (sign*(180/3.142)*atan(avg));&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       if(sign==0)&lt;br&gt;&lt;br /&gt;+       {return 0;}&lt;br&gt;&lt;br /&gt;+       cout&amp;lt;&amp;lt;&amp;quot;SHIT&amp;quot;;&lt;br&gt;&lt;br /&gt;+       return (0);&lt;br&gt;&lt;br /&gt; }&lt;br&gt;&lt;br /&gt; ////////////DEBAYAN//Find skew ends///////////////////&lt;br&gt;&lt;br&gt;&lt;br /&gt;@@ -573,101 +573,101 @@&lt;br&gt;&lt;br /&gt; //used only if the language belongs to devnagri, eg, ben, hin etc.&lt;br&gt;&lt;br /&gt; void TessBaseAPI::ClipMaatraa(int height, int width)&lt;br&gt;&lt;br /&gt; {&lt;br&gt;&lt;br /&gt;-IMAGELINE line;&lt;br&gt;&lt;br /&gt;-line.init(width);&lt;br&gt;&lt;br /&gt;-int count,count1=0,blackpixels[&lt;wbr&gt;height-1][2],arr_row=0,maxbp=&lt;wbr&gt;0,maxy=0,matras[100][3],char_&lt;wbr&gt;height;&lt;br&gt;&lt;br /&gt;-//cout&amp;lt;&amp;lt;&amp;quot;Connected Script=&amp;quot;&amp;lt;&amp;lt;connected_script&amp;lt;&amp;lt;&amp;quot;\&lt;wbr&gt;n&amp;quot;;&lt;br&gt;&lt;br /&gt;-       &lt;br&gt;&lt;br /&gt;-for(int y=0; y&amp;lt;height-1;y++){&lt;br&gt;&lt;br /&gt;-  count=0;     &lt;br&gt;&lt;br /&gt;-  for(int x=0;x&amp;lt;width-1;x++){&lt;br&gt;&lt;br /&gt;-   if(page_image.pixel(x,y)==0)&lt;br&gt;&lt;br /&gt;-     {count++;}&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-  if(count&amp;gt;=.05*width){&lt;br&gt;&lt;br /&gt;-    blackpixels[arr_row][0]=y;&lt;br&gt;&lt;br /&gt;-    blackpixels[arr_row][1]=&lt;wbr&gt;count;&lt;br&gt;&lt;br /&gt;-    arr_row++;&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-blackpixels[arr_row][0]=&lt;wbr&gt;blackpixels[arr_row][1]=&amp;#39;\0&amp;#39;;&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-for(int x=0;x&amp;lt;width-1;x++){  //Black Line&lt;br&gt;&lt;br /&gt;-  line.pixels[x]=0;&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-////////////line_through_&lt;wbr&gt;matra() begins//////////////////////&lt;br&gt;&lt;br /&gt;-count=1;&lt;br&gt;&lt;br /&gt;-//cout&amp;lt;&amp;lt;&amp;quot;\nHeight=&amp;quot;&amp;lt;&amp;lt;height&amp;lt;&amp;lt;&lt;wbr&gt;&amp;quot; arr_row=&amp;quot;&amp;lt;&amp;lt;arr_row&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;-char_height=blackpixels[0][0]&lt;wbr&gt;; //max character height per sentence&lt;br&gt;&lt;br /&gt;-//cout&amp;lt;&amp;lt;&amp;quot;Char Height Init=&amp;quot;&amp;lt;&amp;lt;char_height;&lt;br&gt;&lt;br /&gt;-while(count&amp;lt;=arr_row){&lt;br&gt;&lt;br /&gt;-         //if(count==0){max=&lt;wbr&gt;blackpixels[count][0];}&lt;br&gt;&lt;br /&gt;-  if((blackpixels[count][0]-&lt;wbr&gt;blackpixels[count-1][0]==1) &amp;amp;&amp;amp; (blackpixels[count][1]&amp;gt;=maxbp)&lt;wbr&gt;){&lt;br&gt;&lt;br /&gt;-           maxbp=blackpixels[count][1];&lt;br&gt;&lt;br /&gt;-    maxy=blackpixels[count][0];&lt;br&gt;&lt;br /&gt;-    //cout&amp;lt;&amp;lt;&amp;quot;\nMax=&amp;quot;&amp;lt;&amp;lt;maxy&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;maxbp;&lt;br&gt;&lt;br /&gt;-  }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-         if((blackpixels[count][0]-&lt;wbr&gt;blackpixels[count-1][0])!=1){&lt;br&gt;&lt;br /&gt;-           /////////////drawline(max)////&lt;wbr&gt;//////////////////&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-       //      cout&amp;lt;&amp;lt;&amp;quot;\nmax=&amp;quot;&amp;lt;&amp;lt;maxy&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;maxbp;&lt;br&gt;&lt;br /&gt;-//      page_image.put_line(0,maxy,&lt;wbr&gt;width,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;-             char_height=blackpixels[count-&lt;wbr&gt;1][0]-char_height;&lt;br&gt;&lt;br /&gt;-             matras[count1][0]=maxy; matras[count1][1]=maxbp; matras[count1][2]=char_height; count1++;&lt;br&gt;&lt;br /&gt;-      char_height=blackpixels[&lt;wbr&gt;count][0];&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-           //////////// drawline(max)/////////////////&lt;wbr&gt;////&lt;br&gt;&lt;br /&gt;-           maxbp=blackpixels[count][1];&lt;br&gt;&lt;br /&gt;-         }&lt;br&gt;&lt;br /&gt;-  count++;&lt;br&gt;&lt;br /&gt;-       }&lt;br&gt;&lt;br /&gt;-matras[count1][0]=matras[&lt;wbr&gt;count1][1]=matras[count1][2]=&amp;#39;&lt;wbr&gt;\0&amp;#39;;&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-//delete blackpixels;  &lt;br&gt;&lt;br /&gt;-////////////line_through_&lt;wbr&gt;matra() ends//////////////////////&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-       ////////////clip_matras() begins////////////////////////&lt;wbr&gt;///&lt;br&gt;&lt;br /&gt;-       for(int i=0;i&amp;lt;100;i++){ //where 100=max number of sentences per page&lt;br&gt;&lt;br /&gt;-  if(matras[i][0]==&amp;#39;\0&amp;#39;){break;&lt;wbr&gt;}&lt;br&gt;&lt;br /&gt;-  //cout&amp;lt;&amp;lt;&amp;quot;\nY=&amp;quot;&amp;lt;&amp;lt;matras[i][0]&amp;lt;&lt;wbr&gt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;matras[i][1]&amp;lt;&amp;lt;&amp;quot; chheight=&amp;quot;&amp;lt;&amp;lt;matras[i][2];&lt;br&gt;&lt;br /&gt;-  count=i;&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-       &lt;br&gt;&lt;br /&gt;-for(int i=0;i&amp;lt;=count;i++){&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-  for(int x=0;x&amp;lt;width-1;x++){&lt;br&gt;&lt;br /&gt;-    if(page_image.pixel(x,matras[&lt;wbr&gt;i][0])==0){&lt;br&gt;&lt;br /&gt;-             count1=0;&lt;br&gt;&lt;br /&gt;-      for(int y=0;y&amp;lt;matras[i][2] &amp;amp;&amp;amp; count1==0;y++){&lt;br&gt;&lt;br /&gt;-               if(page_image.pixel(x,matras[&lt;wbr&gt;i][0]-y)==1){count1++;&lt;br&gt;&lt;br /&gt;-                 for(int z=y+1;z&amp;lt;matras[i][2];z++){&lt;br&gt;&lt;br /&gt;-                   if(page_image.pixel(x,matras[&lt;wbr&gt;i][0]-z)==1){count1++;}//black pixel encountered... stop counting.&lt;br&gt;&lt;br /&gt;-                   else&lt;br&gt;&lt;br /&gt;-                   {break;}&lt;br&gt;&lt;br /&gt;-                 }&lt;br&gt;&lt;br /&gt;-            }&lt;br&gt;&lt;br /&gt;-       }&lt;br&gt;&lt;br /&gt;-      //cout&amp;lt;&amp;lt;&amp;quot;\nWPR @ &amp;quot;&amp;lt;&amp;lt;x&amp;lt;&amp;lt;&amp;quot;,&amp;quot;&amp;lt;&amp;lt;matras[i][0]&amp;lt;&amp;lt;&amp;quot;=&amp;quot;&amp;lt;&amp;lt;&lt;wbr&gt;count1;&lt;br&gt;&lt;br /&gt;-      if(count1&amp;gt;.8*matras[i][2]){&lt;br&gt;&lt;br /&gt;-        line.init(matras[i][2]+5);&lt;br&gt;&lt;br /&gt;-        for(int j=0;j&amp;lt;matras[i][2]+5;j++){&lt;wbr&gt;line.pixels[j]=1;}&lt;br&gt;&lt;br /&gt;-        page_image.put_column(x,&lt;wbr&gt;matras[i][0]-matras[i][2],&lt;wbr&gt;matras[i][2]+5,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;-      }&lt;br&gt;&lt;br /&gt;-    }&lt;br&gt;&lt;br /&gt;-         }&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-}&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-page_image.write(&amp;quot;bentest.&lt;wbr&gt;tif&amp;quot;);&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;+       IMAGELINE line;&lt;br&gt;&lt;br /&gt;+       line.init(width);&lt;br&gt;&lt;br /&gt;+       int count,count1=0,blackpixels[&lt;wbr&gt;height-1][2],arr_row=0,maxbp=&lt;wbr&gt;0,maxy=0,matras[100][3],char_&lt;wbr&gt;height;&lt;br&gt;&lt;br /&gt;+       //cout&amp;lt;&amp;lt;&amp;quot;Connected Script=&amp;quot;&amp;lt;&amp;lt;connected_script&amp;lt;&amp;lt;&amp;quot;\&lt;wbr&gt;n&amp;quot;;&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       for(int y=0; y&amp;lt;height-1;y++){&lt;br&gt;&lt;br /&gt;+               count=0;        &lt;br&gt;&lt;br /&gt;+               for(int x=0;x&amp;lt;width-1;x++){&lt;br&gt;&lt;br /&gt;+                       if(page_image.pixel(x,y)==0)&lt;br&gt;&lt;br /&gt;+                       {count++;}&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+               if(count&amp;gt;=.05*width){&lt;br&gt;&lt;br /&gt;+                       blackpixels[arr_row][0]=y;&lt;br&gt;&lt;br /&gt;+                       blackpixels[arr_row][1]=count;&lt;br&gt;&lt;br /&gt;+                       arr_row++;&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       blackpixels[arr_row][0]=&lt;wbr&gt;blackpixels[arr_row][1]=&amp;#39;\0&amp;#39;;&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       for(int x=0;x&amp;lt;width-1;x++){  //Black Line&lt;br&gt;&lt;br /&gt;+               line.pixels[x]=0;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       ////////////line_through_&lt;wbr&gt;matra() begins//////////////////////&lt;br&gt;&lt;br /&gt;+       count=1;&lt;br&gt;&lt;br /&gt;+       //cout&amp;lt;&amp;lt;&amp;quot;\nHeight=&amp;quot;&amp;lt;&amp;lt;height&amp;lt;&amp;lt;&amp;quot; arr_row=&amp;quot;&amp;lt;&amp;lt;arr_row&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;br&gt;&lt;br /&gt;+       char_height=blackpixels[0][0]; //max character height per sentence&lt;br&gt;&lt;br /&gt;+       //cout&amp;lt;&amp;lt;&amp;quot;Char Height Init=&amp;quot;&amp;lt;&amp;lt;char_height;&lt;br&gt;&lt;br /&gt;+       while(count&amp;lt;=arr_row){&lt;br&gt;&lt;br /&gt;+               //if(count==0){max=&lt;wbr&gt;blackpixels[count][0];}&lt;br&gt;&lt;br /&gt;+               if((blackpixels[count][0]-&lt;wbr&gt;blackpixels[count-1][0]==1) &amp;amp;&amp;amp; (blackpixels[count][1]&amp;gt;=maxbp)&lt;wbr&gt;){&lt;br&gt;&lt;br /&gt;+                       maxbp=blackpixels[count][1];&lt;br&gt;&lt;br /&gt;+                       maxy=blackpixels[count][0];&lt;br&gt;&lt;br /&gt;+                       //cout&amp;lt;&amp;lt;&amp;quot;\nMax=&amp;quot;&amp;lt;&amp;lt;maxy&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;maxbp;&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+               if((blackpixels[count][0]-&lt;wbr&gt;blackpixels[count-1][0])!=1){&lt;br&gt;&lt;br /&gt;+                       /////////////drawline(max)////&lt;wbr&gt;//////////////////&lt;br&gt;&lt;br /&gt;+                               &lt;br&gt;&lt;br /&gt;+                               //      cout&amp;lt;&amp;lt;&amp;quot;\nmax=&amp;quot;&amp;lt;&amp;lt;maxy&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;maxbp;&lt;br&gt;&lt;br /&gt;+                               //      page_image.put_line(0,maxy,&lt;wbr&gt;width,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;+                               char_height=blackpixels[count-&lt;wbr&gt;1][0]-char_height;&lt;br&gt;&lt;br /&gt;+                       matras[count1][0]=maxy; matras[count1][1]=maxbp; matras[count1][2]=char_height; count1++;&lt;br&gt;&lt;br /&gt;+                       char_height=blackpixels[count]&lt;wbr&gt;[0];&lt;br&gt;&lt;br /&gt;+                       &lt;br&gt;&lt;br /&gt;+                       //////////// drawline(max)/////////////////&lt;wbr&gt;////&lt;br&gt;&lt;br /&gt;+                       maxbp=blackpixels[count][1];&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               count++;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       matras[count1][0]=matras[&lt;wbr&gt;count1][1]=matras[count1][2]=&amp;#39;&lt;wbr&gt;\0&amp;#39;;&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       //delete blackpixels;   &lt;br&gt;&lt;br /&gt;+       ////////////line_through_&lt;wbr&gt;matra() ends//////////////////////&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       ////////////clip_matras() begins////////////////////////&lt;wbr&gt;///&lt;br&gt;&lt;br /&gt;+       for(int i=0;i&amp;lt;100;i++){ //where 100=max number of sentences per page&lt;br&gt;&lt;br /&gt;+               if(matras[i][0]==&amp;#39;\0&amp;#39;){break;}&lt;br&gt;&lt;br /&gt;+               //cout&amp;lt;&amp;lt;&amp;quot;\nY=&amp;quot;&amp;lt;&amp;lt;matras[i][0]&amp;lt;&amp;lt;&lt;wbr&gt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;matras[i][1]&amp;lt;&amp;lt;&amp;quot; chheight=&amp;quot;&amp;lt;&amp;lt;matras[i][2];&lt;br&gt;&lt;br /&gt;+               count=i;&lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       for(int i=0;i&amp;lt;=count;i++){&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+               for(int x=0;x&amp;lt;width-1;x++){&lt;br&gt;&lt;br /&gt;+                       if(page_image.pixel(x,matras[&lt;wbr&gt;i][0])==0){&lt;br&gt;&lt;br /&gt;+                               count1=0;&lt;br&gt;&lt;br /&gt;+                               for(int y=0;y&amp;lt;matras[i][2] &amp;amp;&amp;amp; count1==0;y++){&lt;br&gt;&lt;br /&gt;+                                       if(page_image.pixel(x,matras[&lt;wbr&gt;i][0]-y)==1){count1++;&lt;br&gt;&lt;br /&gt;+                                               for(int z=y+1;z&amp;lt;matras[i][2];z++){&lt;br&gt;&lt;br /&gt;+                                                       if(page_image.pixel(x,matras[&lt;wbr&gt;i][0]-z)==1){count1++;}//black pixel encountered... stop counting.&lt;br&gt;&lt;br /&gt;+                                                               else&lt;br&gt;&lt;br /&gt;+                                                               {break;}&lt;br&gt;&lt;br /&gt;+                                               }&lt;br&gt;&lt;br /&gt;+                                       }&lt;br&gt;&lt;br /&gt;+                               }&lt;br&gt;&lt;br /&gt;+                               //cout&amp;lt;&amp;lt;&amp;quot;\nWPR @ &amp;quot;&amp;lt;&amp;lt;x&amp;lt;&amp;lt;&amp;quot;,&amp;quot;&amp;lt;&amp;lt;matras[i][0]&amp;lt;&amp;lt;&amp;quot;=&amp;quot;&amp;lt;&amp;lt;&lt;wbr&gt;count1;&lt;br&gt;&lt;br /&gt;+                               if(count1&amp;gt;.8*matras[i][2]){&lt;br&gt;&lt;br /&gt;+                                       line.init(matras[i][2]+5);&lt;br&gt;&lt;br /&gt;+                                       for(int j=0;j&amp;lt;matras[i][2]+5;j++){&lt;wbr&gt;line.pixels[j]=1;}&lt;br&gt;&lt;br /&gt;+                                       page_image.put_column(x,&lt;wbr&gt;matras[i][0]-matras[i][2],&lt;wbr&gt;matras[i][2]+5,&amp;amp;line,0);&lt;br&gt;&lt;br /&gt;+                               }&lt;br&gt;&lt;br /&gt;+                       }&lt;br&gt;&lt;br /&gt;+               }&lt;br&gt;&lt;br /&gt;+               &lt;br&gt;&lt;br /&gt;+       }&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       page_image.write(&amp;quot;bentest.tif&amp;quot;&lt;wbr&gt;);&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;        ////////////clip_matras() ends//////////////////////////&lt;wbr&gt;///&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-/////////DEBAYAN/////////////&lt;wbr&gt;////&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;-&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       /////////DEBAYAN//////////////&lt;wbr&gt;///&lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt;+       &lt;br&gt;&lt;br /&gt; }&lt;br&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;October 22&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 4:45 AM. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;Task number one for Indic OCR workout participants at foss.in 2008&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Implement &lt;a href="http://www.scanhelp.com/ScanEdu/deskew/deskew.html"&gt;deskewing&lt;/a&gt; (basically straightening a tilted image) code in any language of choice. The algorithm may be any good standard one of your choice. The image to be tested on is this. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="tilt.tif/tilt-full;init:.tif" imageanchor="1"&gt;&lt;img src="tilt.tif/tilt-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Then mail to me at debayanin AT gmail DOT com , or on any mailing list.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I think &lt;a href="http://danthorpe.me.uk/blog/2005/02/24/Implementing_the_Hough_Transform"&gt;hough transforms&lt;/a&gt; would be the best way. I have been facing some difficulty in implementing this in python for the test images, but the theory is sound and will ultimately give good results.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;October 14 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 5:05 AM. Feels so nice to revisit this page after 4 long months. Have seen a lot in these 4 months. hmm....&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;So all stuff added to this page will act as a reference for the &lt;a href="http://www.indlinux.org/wiki/index.php/FOSS.IN2008"&gt;workout&lt;/a&gt; proposed at &lt;a href="http://foss.in/"&gt;foss.in 2008&lt;/a&gt;, and also the work for &lt;a href="http://www.sarai.net/fellowships/floss"&gt;Sarai FLOSS fellowship&lt;/a&gt; (which i am not sure about yet).&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Note: TesseractIndic is &lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;Tesseract-OCR&lt;/a&gt; with Indic script support. This will remain a separate project untill Tesseract-OCR actually decides to accept patches and merge Indic script support. TesseractIndic can be found &lt;a href="http://code.google.com/p/tesseractindic/"&gt;here&lt;/a&gt;. &lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;So lets see where we stand. We have Tesseract-OCR, which works great for english. I managed to apply &amp;quot;&lt;a href="http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf"&gt;maatraa&lt;/a&gt;&lt;a href="http://sites.google.com/site/ocropus/morphological-operations"&gt;clipping&lt;/a&gt;&amp;quot; (which is a new term/approach in the world of OCR i think!) successfully as a proof of concept to the image being fed to the Tesseract OCR engine. Accuracy obtained by this method, along with some really crappy training, stands at about 85%.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;A standard OCR process contains the following steps:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(1) Pre-processing, involving skew removal, etc. Pretty much&lt;br&gt;&lt;br /&gt;    language-independent, though features like the shirorekha&lt;br&gt;&lt;br /&gt;    might help here.&lt;br&gt;&lt;br /&gt;(2) Character extraction: Again, largely language-independent,&lt;br&gt;&lt;br /&gt;    though language dependency might come in because of&lt;br&gt;&lt;br /&gt;    features like shirorekha.&lt;br&gt;&lt;br /&gt;(3) Character identification: Language independent, maybe with&lt;br&gt;&lt;br /&gt;    specialised plugins to take advantage of language features,&lt;br&gt;&lt;br /&gt;    or items like known fonts.&lt;br&gt;&lt;br /&gt;(4) Post-processing, which involves things like spell-checking to&lt;br&gt;&lt;br /&gt;    improve accuracy.&lt;br /&gt;  &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The current available version of Tesseract OCR does steps 3, and 4 above for any language. But that it can only do if it can do step 2 properly, which it cant for connected script like Hindi, Bengali etc. So the approach is to take the scanned image, apply some pre-processing to it, and then do the &amp;quot;maatraa clipping&amp;quot; operation on it. Now feed this image to Tesseract-OCR engine.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;In detail, the things to do are:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(1) Pre-processing: &lt;a href="http://tesseractindic.googlecode.com/files/skew_deskew.pdf"&gt;Skew removal&lt;/a&gt;, Noise removal. Skew removal in particular is key for the &amp;quot;maatraa clipping&amp;quot; code to work.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(2) &amp;quot;maatraa clipping&amp;quot; : This enables the Tesseract-OCR engine to treat Devnagri connected script like any other script.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(3) &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract"&gt;Training&lt;/a&gt;: Very Important for getting good results. But well documented. Good tools exist for training Tesseract-OCR.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(4) Web Interface: We need to create a web interface so people can freely OCR their documents online. No big deal. &lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Now my intention is to implement skew removal using &lt;a href="http://yaniv.leviathanonline.com/blog/math/understanding-soccer/"&gt;Hough transforms&lt;/a&gt;. &lt;a href="http://danthorpe.me.uk/blog/2005/02/24/Implementing_the_Hough_Transform"&gt;Hough transforms&lt;/a&gt; are really good in finding staright lines (among other shapes) in images. So all we need to do is, find the &amp;quot;maatraas&amp;quot; and calculate thier slope. We have the skew angle, and we just rotate the page to correct the skew.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I had implemented &amp;quot;maatraa clipping&amp;quot; using projection based methods. It seems there is a better digital image processing method called &amp;quot;Morphological Operations&amp;quot; that is a better way of doing it. Well, actually i am not that sure about it yet. Still researching and trying out stuff.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Now, I had done all this work in C++, as the Tesseract-OCR code is also in C++. But, of late, i have been mesmerised by the simplicity and power of Python , and the Python image library. All the work i am doing now, including Hough transfroms, is in Python. So now we have 2 options:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(1) Do the pre-processing and &amp;quot;maatraa clipping&amp;quot; in Python and feed the page to the Tesseract-OCR (will be easy and quicker to implement)&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(2) Do the entire thing in C++ (will execute much faster)&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Again, we will probably end up doing both. In foss.in, I will probably bring along Python code that already works, and ask people to port it to C++ and merge upstream to TesseractIndic. Or we could ask people to implement algorithms of their choice in the language of their choice on a common set of test images and then shall convert that stuff to C++ and add.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Will go sleep now. This page will keep increasing in content on a daily basis now. So keep checking this.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;PS: Special thanks to Mr. Sankarshan and Mr. Gora Mohanty for supporting me through out.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;June 24 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 01:45 AM. Here is a mail i received from Mr. Gora Mohanty as a reply to my last post below on 23 June, and also to a mail i sent to the Aksharbodh mailing list:.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;&lt;br&gt;&lt;br /&gt;1. What were the issues with displaying Bengali fonts on Linux? Sample&lt;br&gt;&lt;br /&gt;   images would help, as you do not give enough details to go on. People&lt;br&gt;&lt;br /&gt;   on the indlinux-group mailing list (got to &lt;a href="http://indlinux.org/" target="_blank"&gt;http://indlinux.org&lt;/a&gt; and the&lt;br&gt;&lt;br /&gt;   Mailing Lists link towards the bottom to subscribe), and the Ankur&lt;br&gt;&lt;br /&gt;   Bangla folk ought to be able to help you out here.&lt;br&gt;&lt;br /&gt;   What GUI toolkit is tesseractTrainer.py using? Both gtk, and QT should&lt;br&gt;&lt;br /&gt;   work fine for Bengali text, at least in any UTF-8 locale.&lt;br&gt;&lt;br&gt;&lt;br /&gt;2. Could you give figures for % accuracy? Not all of us can read Bangla.&lt;br&gt;&lt;br&gt;&lt;br /&gt;3. Is there any documentation on what training involves, and what kind&lt;br&gt;&lt;br /&gt;   of training text you need? Could you the copious amount of Bangla text&lt;br&gt;&lt;br /&gt;   in the GNOME/KDE .po files for Bengali?&lt;br&gt;&lt;br&gt;&lt;br /&gt;Regards,&lt;br&gt;&lt;br /&gt;Gora&lt;u&gt;&lt;b&gt;&lt;/b&gt;&lt;/u&gt;&lt;/i&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;Ans 1) &lt;/i&gt;The issues with displaying bengali fonts in tesseractTrainer.py has mysteriously solved itself!! It does display Bengali text now. Here is a screenshot:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Screenshot-3.png/Screenshot-3-full;init:.png" imageanchor="1"&gt;&lt;img src="Screenshot-3.png/Screenshot-3-large.png" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;It shows that i can type and read in bengali anywhere on my linux installation now, including gedit, mozilla, terminal and tesseractTrainer.py.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;Ans 2) &lt;/i&gt;The two test images below gave an accuracy of 89% and 85% respectively. These are not accurate, i just did a quick one time count of the errors. Some of the errors have occured because i havent trained the engine with the particular character yet, and some because i fed the wrong character.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;Ans 3) &lt;/i&gt;Well, the entire training process is mentioned at &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract"&gt;http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract&lt;/a&gt;, but what i really need is an image, either scanned, or electronically rendered from some text editor, that contains as many samples of characters as possible, including conjuncts for every character, and then its box file. So the steps are:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;1) Type all the possible characters+conjucts in a editor (for example, type ক কি কী কা কু ক then খ খা খী খি  etc). Increase the font size a little bit.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;2) Use a screen capture software, and from the image generated crop the part that contains the characters, and nothing else.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;3) Convert the image to tif format. Then run it through tesseract downloaded from &lt;a href="http://tesseractindic.googlecode.com/files/tesseract_indic_alpha%200.1.tar.gz"&gt;here&lt;/a&gt; using this command:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;span class="pln"&gt;&lt;span class="pln"&gt;tesseract fontfile&lt;/span&gt;&lt;/span&gt;&lt;span class="pun"&gt;&lt;span class="pun"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;span class="pln"&gt;&lt;span class="pln"&gt;tif fontfile batch&lt;/span&gt;&lt;/span&gt;&lt;span class="pun"&gt;&lt;span class="pun"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;span class="pln"&gt;&lt;span class="pln"&gt;nochop makebox&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="prettyprint"&gt;&lt;span class="pln"&gt;&lt;span class="pln"&gt;&lt;br&gt;&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;where fontfile is the name of the image. This will create a new file named fontfile.txt. Change the extension to .box .&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;4) Now download tesseractTrainer.py, or bbtesseract and open this image in it and then edit the box files.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;5) Mail the image and box file to me.&lt;br&gt;&lt;u&gt;&lt;b&gt;&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;June 23 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 11:54 AM. Have been training Tesseract_indic with images with bengali text. A lot of time was spent fixing seg faults, memory alloc errors etc. It did slow me down. I also have my college placements to prepare for. So could not devote a lot of time.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Training was made difficult by the fact that &lt;a href="http://tesseract-ocr.googlegroups.com/web/tesseractTrainer.py?gda=35MVt0QAAAAvxi9ZSUJhM_2igjg08MDnyfRYBGjZguMPIdJdItjJS2G1qiJ7UbTIup-M2XPURDRcZZ0vhkHhli_lSDaLjUiAnB_jHy27AG4zjqSr60EL7w"&gt;tesseractTrainer.py&lt;/a&gt; did not display bengali fonts properly on linux, although i have the locales set properly and i can type in bengali anywhere on my linux installation. Initially i had to frequently swap between Windows and Linux, since i was using bbtesseract, a utility that edits box files for Tesseract training images. Both the utilities are very useful and i can imagine how hard it would have been otherwise.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The test results are poor as of now. I havent trained it properly, and the maatraa clipping code has to be improved for the results to be of acceptable quality.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Here are the three images i trained it with:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="bentest1.tif/bentest1-full;init:.tif" imageanchor="1"&gt;&lt;img src="bentest1.tif/bentest1-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="train1.tif/train1-full;init:.tif" imageanchor="1"&gt;&lt;img src="train1.tif/train1-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="train2.jpg/train2-full;init:.jpg" imageanchor="1"&gt;&lt;img src="train2.jpg/train2-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;The corresponding box files are at http://tesseractindic.googlecode.com/files/imges_boxes.tar.gz . &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;If you want to help out by training it, download &lt;a href="http://tesseractindic.googlecode.com/files/tesseract_indic_alpha.tar.gz"&gt;this&lt;/a&gt;, and then follow &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract"&gt;this&lt;/a&gt;.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;The patches &lt;a href="http://www.win.tue.nl/%7Eaeb/linux/ocr/tesseract.html"&gt;here&lt;/a&gt; helped a lot. I do not know why these havent been integrated into Tesseract. Also i managed to get a 1,52,000 words strong wordlist from &lt;a href="http://www.bengalinux.org/projects/dictionary/"&gt;Ankur bangla project&lt;/a&gt;. It will improve the accuracy a lot. Initially it had some some strange characters, but i used sed and merged all the words into one big file.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;I need people to start submitting training data, either to me or to tesseract group. i will make a few changes to the maatraa clipping code and mail the patch to Ray Smith. Lets see what happens.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;Initial results are here:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="bengali5.jpg/bengali5-full;init:.jpg" imageanchor="1"&gt;&lt;img src="bengali5.jpg/bengali5-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;OCRed text:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;খযাব শনবীর খরওজায়&lt;br&gt;মাওলানা েমাহাম্‌মদ অাব্‌দুল েমাতালিব খশহীদ&lt;br&gt;েক অাছ ভাইখ নবীর েপ্রমিক চল নিডয় অামাডক&lt;br&gt;সাত সমুদ্র েপড়িেয় যাব নূর নবীজির রওজােত া ঐ&lt;br&gt;পাপি অামি কি েয করি রওজায় য৮ব েনই েকা কড়ি (২ব৮রহ&lt;br&gt;পাখা যদি থাকত অামার ভর করিতাম পাখােত া ঐ&lt;br&gt;দয়াল নবী দয়ায় ভরা েপালােমের দাওেপা ধরা (২বারহ&lt;br&gt;মাওক নিনা অাডশক হডয় েকমডন থযাব জারIডত া ঐ&lt;br&gt;ওেগা নবী গেলর মালা দ্রহর কের খদাও দ্রমদ্রেনদ্রর  খজাত াা (২বা রদ্রহ&lt;br&gt;ঠভাইদ্র দিডয় চরন তডল বনর কর অামাডক া ঐ&lt;br&gt;পবীর রাডত কনদি বডস নবীজিডক পাইব৮র ত াাডশ (২বারহঠ&lt;br&gt;নবীজির দানি রিডন চাইখনা েযডত জায়াডত া ঐ&lt;br&gt;অবীন শহীেদ ভােক েক যাও েতারা রওজা পােক (২বারহ&lt;br&gt;অামায় সােথ নিেয় চল যাব নবীর খজার া েত I ঐ&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;Accuracy: 89% approx&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="bengali1.tif/bengali1-full;init:.tif" imageanchor="1"&gt;&lt;img src="bengali1.tif/bengali1-medium;init:.jpg" style="border: 0pt none ;"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;OCRed text:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;জন গণ মন অধি নায়ক জয় েহ&lt;br&gt;ভ\রত ভIগ্য বিধাত\I&lt;br&gt;পঞ্জ\ব   গুজর\ত মর\ঠ\&lt;br&gt;দ্রারিড় উহকল বতমা া&lt;br&gt;বিংধ্য হিম\চল য়মুIা গংা&lt;br&gt;উচ্ছল জদ্রনরি তরহপা া&lt;br&gt;তব মুভ নাহম জােগ মু&lt;br&gt;তব গুভ অ\াদ্রাধ ম\াগ ৮&lt;br&gt;গাাহ তব ড়াব গ\থাI&lt;br&gt;জন গণ মংলদ\য়ক জয় েহ&lt;br&gt;ভ\রত ভদ্রগ্য বিধাতাI&lt;br&gt;ভহয় ৮ঠহ া জয় চহ ই ভহয় াঠহা&lt;br&gt;জয় জহা ড়ায় জয় েহ ভ &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;Accuracy: 85 % approx&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;Ya ya i know. Its not that great. But its only going to get better, And i dint train it properly, so be cheerful!!&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Go to http://code.google.com/p/tesseractindic for all the downloadable stuff.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;(PS: Some of my friends seem to think i made this software. Well, for them, i would like to say this; i am trying to add a pinch of salt to an already cooked and fine meal. Nothing more and nothing less :)  )&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;June 12 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 11:50 AM. Latest work done is &lt;a href="http://tesseractindic.googlecode.com/files/skew_deskew.pdf"&gt;here&lt;/a&gt;. &lt;a href="http://tesseractindic.googlecode.com/files/tesseract_indic_alpha%200.1.tar.gz"&gt;Download&lt;/a&gt; is here. Patch &lt;a href="http://tesseractindic.googlecode.com/files/indic_patch0.1.diff"&gt;here&lt;/a&gt;.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Next is training. The most important part. It will finally make it usable.&lt;br&gt;&lt;a href="http://code.google.com/u/tmbdev/"&gt;Tom from OCRopus/Tesseract&lt;/a&gt; has been &lt;a href="http://sites.google.com/site/ocropus/morphological-operations"&gt;kind enough&lt;/a&gt; to help  me out with maatraa clipping after going through some of my work.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;June 8, 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 11:40 PM. This is my first release of the Indic script supported Tesseract OCR engine. &lt;a href="tesseract_indic_alpha.tar.gz"&gt;Download&lt;/a&gt; the tarred gz file or the &lt;a href="indic_patch.diff"&gt;patch&lt;/a&gt; if you already have the Tesseract 2.03.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The release is very much in its alpha stage right now. Infact, after downloading the archive, you will have to train the engine with your language of choice. &lt;a href="http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract"&gt;Here is how to train it&lt;/a&gt;. I will soon add the complete archive, with training data for Bengali, and later for Hindi.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;You *must* download this &lt;a href="http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz"&gt;english training data&lt;/a&gt; for the engine to work in recognising english text. My advice is, wait for a few more days before i release a fully working version. I have also applied for sourceforge hosting space.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I will mail the patches to the Tesseract maintainers only after i have the training data ready. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;And i have decided i will go to sleep by 4:30 AM everyday.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;June 7, 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Its 7:54 AM. From now on i shall document everything in a pretty formal manner. &lt;a href="clipmatra_pseudocode.pdf"&gt;Here&lt;/a&gt; is the algorithm for the &lt;i&gt;maatraa &lt;/i&gt;clipping code.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;In a few days i shall provide the diff file, ie the code itself. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;u&gt;&lt;b&gt;June 3, 2008&lt;/b&gt;&lt;/u&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Spent the entire night experimenting with the code. Its 05:28 hrs now. No big deal though. I usually sleep at 6:30 in the morning.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Made my first box file for the &amp;quot;national anthem&amp;quot; image. I first made it with the normal Tesseract engine. As expected, it classified whole words as blobs. Then made box files after adding my code, and I was delighted by the results. Here are related screen shots.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The first two images show boxes made by bbtesseract which uses box files generated by the generic Tesseract engine. &lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_003.jpg/Capture_003-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_003.jpg/Capture_003-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_004.jpg/Capture_004-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_004.jpg/Capture_004-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;For all the boxes generated below, bbtesseract used the boxfiles generated by the modified Tesseract engine, which included the &lt;i&gt;maatraa &lt;/i&gt;clipping code. The results are quite good. Character segmentation is good. &lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_006.jpg/Capture_006-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_006.jpg/Capture_006-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_007.jpg/Capture_007-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_007.jpg/Capture_007-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_008.jpg/Capture_008-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_008.jpg/Capture_008-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_007.jpg/Capture_007-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_007.jpg/Capture_007-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;Below are the incorrect segmentations, which show flaws in the algorithm i came up with (which i am sure any other guy would also come up with, hence i am not boasting).  &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="Capture_010.jpg/Capture_010-full;init:.jpg" imageanchor="1"&gt;&lt;img src="Capture_010.jpg/Capture_010-large.jpg" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;The image below suffers the same boxing problem. The solution to this problem is still not apparent, because i do not know how i can box the &lt;i&gt;hossoi, &lt;/i&gt;since it overlaps the ordinate of the second letter. I will figure something out though :)  . &lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="cap1.JPG/cap1-full;init:.JPG" imageanchor="1"&gt;&lt;img src="cap1.JPG/cap1-large.JPG" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;Here is another problem, but the opposite of the other. Here the loop below the &amp;quot;ga&amp;quot; overlaps the ordinate of the 2nd letter, but in the bottom. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="cap2.JPG/cap2-full;init:.JPG" imageanchor="1"&gt;&lt;img src="cap2.JPG/cap2-large.JPG" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;Same problem. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="cap3.JPG/cap3-full;init:.JPG" imageanchor="1"&gt;&lt;img src="cap3.JPG/cap3-large.JPG" style="border: 0pt none ;" height="262" width="420"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left;"&gt;What really remains is training the engine, which is simple, and all of a sudden we have a working, free, Indic OCR engine!! &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt; WIll sleep now, and update this page later.&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;b&gt;&lt;u&gt;May 27, 2008&lt;/u&gt;&lt;/b&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;I have had this fetish of working on digital image processing/OCR related projects since my 2nd year of college. That was when some professors from ISI Kolkata came down to our college for a workshop on DIP. Now, after 18 months or so, I finally did something in the direction.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;Tesseract&lt;/a&gt; was developed by HP labs and then transferred to Google which open sourced it. Some parts of it are still proprietary, like the feature recognition algos, and also it is covered by the Apache license which is somewhat restrictive, but it was great for me to work on. The two main developers for Tesseract are Ray Smith and Luc Vicent, both legends, and engineers at Google.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Well, lets get down to the problem. Tesseract currently does not support connected scripts or handwritten text. Devnagri scripts such as Bengali and Hindi hava a &lt;i&gt;matra (মাত্রো)&lt;/i&gt;, which is like an underline, but in this case on top of the word, not under it. The Tesseract engine recognises machine print english script, but it relies on the gaps between successive characters in english to classify each character into a &lt;i&gt;blob. &lt;/i&gt;Hence, theoretically, every isolated character is a blob.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The problem with classifying devnagri is that the &lt;i&gt;matra (মাত্রো) &lt;/i&gt;connects the entire word and hence the entire character recognition system fails. To this end, solution could be found if the &lt;i&gt;matras  &lt;/i&gt;were clipped between two successive characters. That way, the same engine could look at each character as an isolated blob.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;The steps involved were:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Go through the Tesseract source code and identify the place where this could be added.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Once the function(s) have been identified, think about the algorithm that would allow us to clip matras.&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;The algorithm itself is as follows:&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Threshold the image. Tesseract already took care of this.&lt;br&gt;&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Read each row of the image, starting from bottom(y=0). Note the black pixel count on each line.&lt;br&gt;&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Find the line with the maximum black pixel count between 2 successive zones of 0 black pixel count. This line is the matra.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="matraline.tif/matraline-full;init:.tif" imageanchor="1"&gt;&lt;img src="matraline.tif/matraline-full.tif" style="border: 0pt none ;" height="379" width="287"&gt;&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Do the same for the entire page and store the Y co-ordinates of each such &lt;i&gt;matra &lt;/i&gt;found.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Now, take such a &lt;i&gt;matra&lt;/i&gt; Y co-ordinate. Iterate over each X co-ordinate and note the &lt;i&gt;number of pixels having a continuous run of white pixels &lt;/i&gt;. If this number is greater than 90% of the character width, the column is a region between two characters. Clip this column and the matra above it.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="clippable_positions.tif/clippable_positions-full;init:.tif" imageanchor="1"&gt;&lt;img src="clippable_positions.tif/clippable_positions-full.tif" style="border: 0pt none ;" height="379" width="287"&gt;&lt;/a&gt;&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;u&gt;Black lines between successive characters signifies that these spaces have been marked for clipping&lt;/u&gt;&lt;br&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Proceed in the same manner. &lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Here is a tif image. It contains India&amp;#39;s national anthem in Bengali.&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="bengali1.tif/bengali1-full;init:.tif" imageanchor="1"&gt;&lt;img src="bengali1.tif/bengali1-full.tif" style="border: 0pt none ;" height="379" width="287"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;Here is the image after clipping the matras:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center; clear: both;" class="separator"&gt;&lt;a style="border: 0pt none ; background-color: transparent; margin-left: 1em; margin-right: 1em;" href="final.tif/final-full;init:.tif" imageanchor="1"&gt;&lt;img src="final.tif/final-full.tif" style="border: 0pt none ;" height="379" width="287"&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;If you look carefully, you will see that the &lt;i&gt;matras &lt;/i&gt;have been clipped between successive characters. Now, it is more or less ready to be fed to the Tesseract engine. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;Now for the all important code. This is the only function i altered. It is in  tesseract-2.03/ccmain/baseapi.cpp. I did not provide the diff file becuase i made some more changes in other parts that i am not sure of. &lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;&lt;span style="font-family: times new roman,serif;"&gt;// Threshold the given grey or color image into the tesseract global&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;// image ready for recognition. Requires thresholds and hi_value&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;// produced by OtsuThreshold above.&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;void TessBaseAPI::ThresholdRect(const unsigned char* imagedata,&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                                int bytes_per_pixel,&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                                int bytes_per_line,&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                                int left, int top,&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                                int width, int height,&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                                const int* thresholds,&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                                const int* hi_values) {&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  IMAGELINE line;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  page_image.create(width, height, 1);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  line.init(width);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  int count,count1=0,blackpixels[height-1][2],arr_row=0,maxbp=0,maxy=0,matras[100][3],char_height;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  // For each line in the image, fill the IMAGELINE class and put it into the&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  // Tesseract global page_image. Note that Tesseract stores images with the&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  // bottom at y=0 and 0 is black, so we need 2 kinds of inversion.&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  const unsigned char* data = imagedata + top*bytes_per_line +&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                              left*bytes_per_pixel;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  for (int y = height - 1 ; y &amp;gt;= 0; --y) {&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    const unsigned char* pix = data;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    for (int x = 0; x &amp;lt; width; ++x, pix += bytes_per_pixel) {&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      line.pixels[x] = 1;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      for (int ch = 0; ch &amp;lt; bytes_per_pixel; ++ch) {&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        if (hi_values[ch] &amp;gt;= 0 &amp;amp;&amp;amp;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            (pix[ch] &amp;gt; thresholds[ch]) == (hi_values[ch] == 0)) {&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          line.pixels[x] = 0;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          break;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    page_image.put_line(0, y, width, &amp;amp;line, 0);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    data += bytes_per_line;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;  }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;/////////DEBAYAN//////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    for(int y=0; y&amp;lt;height-1;y++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      count=0;      &lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      for(int x=0;x&amp;lt;width-1;x++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            if(page_image.pixel(x,y)==0)&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          {count++;}&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      if(count&amp;gt;0){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            blackpixels[arr_row][0]=y;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            blackpixels[arr_row][1]=count;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        arr_row++;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    blackpixels[arr_row][0]=blackpixels[arr_row][1]=&amp;#39;\0&amp;#39;;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    for(int x=0;x&amp;lt;width-1;x++){  //Black Line&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;              line.pixels[x]=0;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    ////////////line_through_matra() begins//////////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    count=1; cout&amp;lt;&amp;lt;&amp;quot;\nHeight=&amp;quot;&amp;lt;&amp;lt;height&amp;lt;&amp;lt;&amp;quot; arr_row=&amp;quot;&amp;lt;&amp;lt;arr_row&amp;lt;&amp;lt;&amp;quot;\n&amp;quot;;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    char_height=blackpixels[0][0]; //max character width per sentence&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    while(count&amp;lt;=arr_row){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          //if(count==0){max=blackpixels[count][0];}&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      if((blackpixels[count][0]-blackpixels[count-1][0]==1) &amp;amp;&amp;amp; (blackpixels[count][1]&amp;gt;=maxbp)){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            maxbp=blackpixels[count][1];&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        maxy=blackpixels[count][0];&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        cout&amp;lt;&amp;lt;&amp;quot;\nMax=&amp;quot;&amp;lt;&amp;lt;maxy&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;maxbp;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          if((blackpixels[count][0]-blackpixels[count-1][0])!=1){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            /////////////drawline(max)//////////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;              cout&amp;lt;&amp;lt;&amp;quot;\nmax=&amp;quot;&amp;lt;&amp;lt;maxy&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;maxbp;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    //      page_image.put_line(0,maxy,width,&amp;amp;line,0);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          char_height=blackpixels[count-1][0]-char_height;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;              matras[count1][0]=maxy; matras[count1][1]=maxbp; matras[count1][2]=char_height; count1++;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          char_height=blackpixels[count][0];&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            //////////// drawline(max)/////////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            maxbp=blackpixels[count][1];&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          } &lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      count++;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    matras[count1][0]=matras[count1][1]=matras[count1][2]=&amp;#39;\0&amp;#39;;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    //delete blackpixels;    &lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    ////////////line_through_matra() ends//////////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        ////////////clip_matras() begins///////////////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        for(int i=0;i&amp;lt;100;i++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      if(matras[i][0]==&amp;#39;\0&amp;#39;){break;}&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      cout&amp;lt;&amp;lt;&amp;quot;\nY=&amp;quot;&amp;lt;&amp;lt;matras[i][0]&amp;lt;&amp;lt;&amp;quot; bpc=&amp;quot;&amp;lt;&amp;lt;matras[i][1]&amp;lt;&amp;lt;&amp;quot; chheight=&amp;quot;&amp;lt;&amp;lt;matras[i][2];&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      count=i;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    for(int i=0;i&amp;lt;=count;i++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;      for(int x=0;x&amp;lt;width-1;x++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;         count1=0;        &lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        for(int y=0;y&amp;lt;matras[i][2];y++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;          if(page_image.pixel(x,matras[i][0]-y)==1){count1++;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                for(int k=y+1;k&amp;lt;matras[i][2];k++){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;                   if(page_image.pixel(x,matras[i][0]-k)==1){count1++;}&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;               else{break;}&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        break;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;              }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        cout&amp;lt;&amp;lt;&amp;quot;\nWPR @ &amp;quot;&amp;lt;&amp;lt;x&amp;lt;&amp;lt;&amp;quot;,&amp;quot;&amp;lt;&amp;lt;matras[i][0]&amp;lt;&amp;lt;&amp;quot;=&amp;quot;&amp;lt;&amp;lt;count1;  &lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        if(count1&amp;gt;.8*matras[i][2] &amp;amp;&amp;amp; count1&amp;lt;matras[i][2]){&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        line.init(matras[i][2]+5);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        for(int j=0;j&amp;lt;matras[i][2]+5;j++)&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;              {line.pixels[j]=1;}&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        cout&amp;lt;&amp;lt;&amp;quot;GA&amp;quot;;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;            page_image.put_column(x,matras[i][0]-matras[i][2],matras[i][2]+5,&amp;amp;line,0);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;        }&lt;/span&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;       }&lt;/span&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    }&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;page_image.write(&amp;quot;bentest.tif&amp;quot;);&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;    ////////////clip_matras() ends/////////////////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;/////////DEBAYAN/////////////////&lt;/span&gt;&lt;br style="font-family: times new roman,serif;"&gt;&lt;span style="font-family: times new roman,serif;"&gt;}&lt;/span&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;Problems with the code:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;It assumes the image is perfectly straight. This assumption is obviously wrong, but Tesseract already has an inbuilt function to correct this. In any case because devnagri scripts have a matra, finding the angle of tilt is pretty simple.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;It run this &amp;quot;matra-clipping&amp;quot; code on all languages, which is totally wrong. One just has to add an &amp;quot;if-then-else&amp;quot; statement to make it run only for devnagri scripts such as hin_in, ben_in etc.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;I implemented the entire thing in one block of code. Will break it into a few functions.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;The algorithm has gaping flaws. Will plug em up.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;Many more, some i know, most i dont :)&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Long term goal:&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Adding Devnagri script support to Tesseract, which includes making the above code error free, and then training Tesseract. As far as i understand, Tesseract developers are not concentrating on adding support to adding connected script support yet, hence i hope my work does not overlap with Google&amp;#39;s work.&lt;/li&gt;&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;And i solemnly swear that i did not plagiarise/steal the code from anywhere, and i came up with the algo myself (which is why it is full of bugs) .&lt;br&gt;&lt;br&gt;&lt;br /&gt;&lt;p&gt;Honestly, &lt;i&gt;kaam kar ke mazaa aa gaya yaar!!&lt;/i&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;(ps: &lt;/i&gt;What happened to &lt;i&gt;aksharbodh? &lt;/i&gt;If anybody knows please tell me.)&lt;i&gt;&lt;br&gt;&lt;/i&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;i&gt;&lt;/i&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: left; clear: both;" class="separator"&gt;&lt;br&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p style="text-align: center;"&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;br&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6518118086872671696-2374597159903690225?l=hacking-tesseract.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacking-tesseract.blogspot.com/feeds/2374597159903690225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/may-27-2008.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2374597159903690225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6518118086872671696/posts/default/2374597159903690225'/><link rel='alternate' type='text/html' href='http://hacking-tesseract.blogspot.com/2009/03/may-27-2008.html' title='Transferred Entries'/><author><name>Debayan</name><uri>http://www.blogger.com/profile/01662690894816727096</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
