C# – OCR library candidates

Although I can read German and English but I always prefer reading computer magazine in my mother language. Reading in mother language is always faster than foreign language – no argument. So I often visits this site http://pctipsvn.org/ to read computer magazines in Vietnam, just to see if there is something interesting or something new to learn. The articles contain many external links to many websites and it makes me really crazy when I want to go to a specific link in the articles. I must tip each character by hand and validate if I enter them correctly. So I would like to write a small tool to make a snapshot of that link and open link in my web browser.

1. Test

Making a snapshot of a given area should not be a big problem, there are a lot of examples on internet in C# so that I can “steal” one and use it for my tool. What I still didn’t know is which ocr library should I use to extract the link from the snapshot and open it in the web browser. I don’t intend to write a library myself because I don’t have time and talent to do it. Therfore I started to search for an open source OCR library which I can fast integrate in my tool. For someone who doesn’t know what OCR is: Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. (from Wikipedia)
If you want to be professional with OCR, you can read this book Character Recognition Systems: A Guide for Students and Practitioners

After some searches I found that the tesseract library of Google is warmly introduced http://code.google.com/p/tesseract-ocr/.
Install it and I start SnagIt make a snapshot of a link in computer magazine


Then start tesseract to convert this image to text with command below

C:\Program Files (x86)\Tesseract-OCR>tesseract.exe e:\Temp\snipsnip.png e:\temp\output

and that is what I receive in output.txt file


Oh, that is not bad. Only 4 characters over 18 characters was false recognized. It is 77,78% correct. I think may be it will be better if there is less noise. The image above has too much noise. So I made another snapshot of the banner in tesseract’s homepage and I hope it will recognize correctly.

Tesseract banner

and here is what I had in output.txt

An OCR Engme thalwas developed al HP Labs between |935 and 1995 and naw al Gung\e

It’s about 90% correct but it makes me really disappointed although the accuracy is higher. Because the image is well taken with least noise. Anyone can read it correctly without any small problem. I expect that the output should be 100% correct but it’s not. So tesseract is good and free but accuracy in some simple cases is not 100%. I can’t make a solid decision that I will use it because its inaccuracy. Therefore I would like to test one more library so that I can decide which should I use.

This time is a close-source library of Microsoft. Indeed it is not a library itself but a component of Microsoft Office packages. If you have Microsoft Office installed, you’ll have it too. In Visual Studio, you can make a reference to “Microsoft Office Document Imaging x.0 Type Library” (x is the version of Microsoft Office) to use this component.

Microsoft Office Document Imaging 11.0 Type Library

Then call it like this

static void Main(string[] args)
	string strText = "";
	MODI.Document md = new MODI.Document();
	MODI.Image image = (MODI.Image)md.Images[0];
	MODI.Layout layout = image.Layout;
	for (int i = 0; i < layout.Words.Count; i++)
		MODI.Word word = (MODI.Word)layout.Words[i];
		if (strText.Length > 0)
			strText += " ";
		strText += word.Text;

I create a MODI document and give the path to image as input. English is defined as the language of the text in image and I start to process the image with .OCR function. After processing, I make a for-loop running through all recognized words and append them to a string variable. At the end of program, I close MODI document and print out the result. Here is what I receive in console output


it’s about 77,78% correct and

tesse ract-ocr
An OCR Engine that was developed at HP Labs between 1935 and 1995 and now at Google

it’s about 98,90% correct.

As you see, the OCR component of Microsoft works better than OCR library of Google. For the hard case the component can recognize one more character correctly but miss the dot “.” character. And for the simple case the component recognizes all words correctly but he makes a mistake with a space in “tesseract” word. However using this component we’ll depend on Microsoft Office. Requiring Microsoft Office to be installed in order for our softwares to work may or may not fit a situation. But if our clients can guarantee that machines that softwares will run on have Office 2007 installed, we’re gold. Of course we can use both of them then check if the computer has Office, if yes , use OCR component of Microsoft if not use the other. We are having a super-mega OCR libraries in hand to use.
The tool I want is still on draft at the time I write this blog. Hope that I have time to implement it but if you write a same one then share it with me. Then I don’t have to write it myself anymore. ^_^

2. Updates

2.1 29.12.2014

Maybe the best one http://www.abbyy.com/ocr_sdk/


Leave a comment

Your email address will not be published. Required fields are marked *