Check out the source code of tesseractindic from http://code.google.com/p/tesseractindic. Then cd to tesseract_trainer and follow the directions below: Here is a demonstration of how you can create training data files for an arbitrary language for Tesseract-OCR and subsequently use it to perform OCR. |
To create data files for , say, Bengali: |
1) Create a directory in tesseract_trainer/ and name it arbitrarily. This contains the symbols of the alphabet. I name it 'beng.alphabet'. In the directory you may create a maximum of 4 files: |
a)consonants- Put all the consonants in your script/language in the file. eg, ক , খ (ka, kha) etc |
b)pre_semivowels- Put all the semivowels (if any in your script) that come before a consonant. eg, ি, ে, ৈ (e kaar, a kaar, oi kaar ) |
c)post_semivowels- Put all the semivowels (if any in your script) that come after a consonant, eg, া, ী (aa kaar, ee kaar) |
d)rest- Put everything else here, like digits, punctuation, conjuncts, special characters, vowels. You could also choose not to create the 3 files above and put all the symbols in this file. |
|
You need to have some fonts particular to your script installed on your system. On an Ubuntu system you will find them in /usr/share/fonts/truetype/ttf-bengali-fonts/. You will require the name(s) of the fonts later. |
Now change directory to tesseract_trainer/ and execute the following on the shell(for bengali for example): python generate.py -font Mitra -l Bengali -s 15 -a beng.alphabet/ |
-font takes the ttf font name you are trying to train |
-l takes the script name to be trained as input |
-s size of the characters generated in images in Bengali.images/ |
This command will generate many images and corresponding box files in Bengali.images/. In the end it generates 5 files in Bengali.training_data/. |
1)Bengali.unicharset |
2)Bengali.Microfeat |
3)Bengali.normproto |
4)Bengali.pffmtable |
5)Bengali.inttemp |
These 5 files are needed by Tesseract-OCR engine to add a new script support. In addition there are 3 more files required that ae to be created by you separately. These are : |
* Bengali.freq-dawg |
* Bengali.word-dawg |
* Bengali.user-words |
You do not require the tesseract_trainer tool to create the files above. They can be created by following appropriate instructions at http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract. |
Copy these files to /usr/local/share/tessdata and to tessdata/ folder of you tesseract-ocr source code. |
Now lets OCR some image. Lets say the image is image.tif. Here is what you must execute at the terminal: |
tesseract image.tif ocr -l Bengali |
You get ocr.txt as the output. |
Contact debayanin@gmail.com for clarifications. |
Contact debayanin@gmail.com for further clarifications. |
And yeah, this is work in progress. |
Friday, June 5, 2009
How to train Tesseract-OCR
Subscribe to:
Posts (Atom)