Friday, November 6, 2009

Crowd Sourcing OCR development

One of the biggest challenges in OCR development is gathering training data and then feeding it to the OCR engine. The data is generally carefully chosen and some emphasis is laid on the quality of scans too. This often requires a team of people working in close proximity, and hence has traditionally been a blocker for the distributed development model.
However, with proper planning in software development, such frameworks can be set up which allow end users to contribute to OCR training data.
The interface to the OCR system may be either command line based, GUI based or web based. Say a user OCRs a particular document. Post-OCR the interface presents to him an opportunity to correct any errors and send it back to a centralised server where certain volunteers/contributors shall verify the data. Once the data has been verified, it is fed to the engine for incremental training.
To check whether the data being added is improving the performance of the OCRs or not, we may run an automated nightly-OCR on a set of test image/text set and post the percentage daily.
The challenge is that most OCR training systems are not incrementally trainable out of the box. Tesseract-OCR is one example. However, one may write some code and implement it.
Crowd Sourcing training data is critical to align OCR development to a FOSS based model and hence free it from the clutches of research teams at big institutes.

1 comment:

  1. Debayan,

    I suggest you take a look at the distributed proofreaders project ( The source code for this website is supposedly available as a sourceforge project. It may be possible to adapt it for this purpose and then host it on a server, which would make it easier for people to help out with providing ground truth data.