C# – An example of OCR web service

Years ago I wrote a small post about C# – OCR library candidates for comparing between OCR libraries of Tesseract and Microsoft Office Document Imaging. Tesseract is an open source OCR framework. Unfortunately, its inaccuracy is still high and can’t be used in commercial products. Last week I would like to make a small OCR web service for training myself and to test Tesseract again. The result is still as bad as last time (I guess FineReader of Abby may be the best OCR SDK but I have no full version for testing). Although Tesseract is not able to recognize complex documents, I also used it for this example because there is no other better candidate. The sample OCR web service works pretty simply, he receives a file which is uploaded from the client, runs OCR and returns text back. No big deal.

1. Prerequisites

The web service is made by ASP.NET Web API. If you follow my blog, you must be very familiar with this framework. If you have no idea what it is? Then get acquainted with this framework with these articles

How to consume ASP.NET Web API RC with HttpClient?

Android – Upload files to ASP.NET Web API service

2. Code review

In this demo, I support 3 file types for OCR: Pdf, Tiff and Zip.
– Tiff: Tesseract will be used to extract text from this file type.
– Pdf: iTextSharp will extract text from pdf file. You can read more about iTextSharp at C#,iTextSharp – PDF file – Insert/extract image,text,font, text highlighting and auto fillin.
– Zip: DotNetZip will decompress .zip file to temporary folder, loop through all extracted files. If the file is .tiff or .pdf, the corresponding OCR algorithm will be executed to get text from that file.

2.1 Tiff

The TiffController basing on BaseController hat only one Post action to receive file from clients.

public class TiffController : BaseController
{
	public Task<IEnumerable<HDFile>> Post()
	{
		return Handle(new List<string>() { ".tif", ".tiff" });
	}
}

When this Post function gets called, I’ll call the Handle function from base class with parameters to define which file types I want to handle. For example, in TiffController I only want to handle “.tif” or “.tiff” files.

protected virtual Task<IEnumerable<HDFile>> Handle(IEnumerable<string> fileExtensions)
{
	try
	{
		var uploadFolderPath = HostingEnvironment.MapPath("~/App_Data/" + UploadFolder);
		log.Debug(uploadFolderPath);

		if (Request.Content.IsMimeMultipartContent())
		{
			var streamProvider = new WithExtensionMultipartFormDataStreamProvider(uploadFolderPath);
			var task = Request.Content.ReadAsMultipartAsync(streamProvider).ContinueWith<IEnumerable<HDFile>>(t =>
			{
				if (t.IsFaulted || t.IsCanceled)
				{
					throw new HttpResponseException(HttpStatusCode.InternalServerError);
				}

				return Handle(streamProvider.FileData.Select(x => new HDFile(x.Headers.ContentDisposition.FileName, null, x.LocalFileName)), fileExtensions);
			});

			return task;
		}
		else
		{
			throw new HttpResponseException(Request.CreateResponse(HttpStatusCode.NotAcceptable, "This request is not properly formatted"));
		}
	}
	catch (Exception ex)
	{
		log.Error(ex);
		throw new HttpResponseException(Request.CreateResponse(HttpStatusCode.BadRequest, ex.Message));
	}
}

protected IEnumerable<HDFile> Handle(IEnumerable<HDFile> files, IEnumerable<string> fileExtensions)
{
	files = files.Where(x => fileExtensions.Contains(Path.GetExtension(x.Name), StringComparer.OrdinalIgnoreCase)).ToList();

	foreach (var item in files)
	{
		foreach (var engine in OCREngines.GetDefaultInstance().AllRegisteredEngines)
		{
			if (engine.CanHandle(Path.GetExtension(item.Name)))
			{
				item.Text = engine.GetText(item.Tag);
				break;
			}
		}
	}
	return files;
}

In Handle function, I’ll upload file to App_Data/uploads folder and run OCR on it. For each file type I define an OCR engine for it. All of these engines implement IOCREngine with following functions

public interface IOCREngine
{
	bool CanHandle(string fileExtensions);

	string GetText(string filePath);
}

public class TiffOCREngine : IOCREngine
{
	public bool CanHandle(string fileExtensions)
	{
		return fileExtensions.Equals(".tif", System.StringComparison.OrdinalIgnoreCase) || fileExtensions.Equals(".tiff", System.StringComparison.OrdinalIgnoreCase);
	}

	public string GetText(string filePath)
	{
		return TesseractUtil.GetText(filePath);
	}
}

The engine is only a wrapper for 3rd OCR library which I use for testing. In this case, TiffOCREngine just simply call function of Tesseract framework.

2. Pdf

The PdfController works exactly like TiffController, except that this controller supports of course only “.pdf” file extension. For this controller, PdfOCREngine will take over the mission to extract text from PDF file.

public class PdfController : BaseController
{
	public Task<IEnumerable<HDFile>> Post()
	{
		return Handle(new List<string>() { ".pdf" });
	}
}

public class PdfOCREngine : IOCREngine
{
	private ILog log = log4net.LogManager.GetLogger(typeof(PdfOCREngine));

	public bool CanHandle(string fileExtensions)
	{
		return fileExtensions.Equals(".pdf", StringComparison.OrdinalIgnoreCase);
	}

	public string GetText(string filePath)
	{
		using (Stream newpdfStream = new FileStream(filePath, FileMode.Open, FileAccess.ReadWrite))
		{
			StringBuilder result = new StringBuilder();
			PdfReader pdfReader = new PdfReader(newpdfStream);

			for (int index = 1; index < pdfReader.NumberOfPages + 1; index++)
			{
				try
				{
					result.Append(PdfTextExtractor.GetTextFromPage(pdfReader, index, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy()));
				}
				catch (Exception ex)
				{
					log.Error(ex);
				}
			}

			return result.ToString();
		}
	}
}

3. Zip

The ZipController doesn’t work like Tiff/PdfController because the files can’t be directly handled. I have to override the Handle function to extract the files first and let selective strategy run through all extracted files in temporary folder.

public class ZipController : BaseController
{
	private ILog log = log4net.LogManager.GetLogger(typeof(ZipController));

	public Task<IEnumerable<HDFile>> Post()
	{
		return Handle(new List<string>() { ".zip" });
	}

	protected override Task<IEnumerable<HDFile>> Handle(IEnumerable<string> fileExtensions)
	{
		try
		{
			var uploadFolderPath = HostingEnvironment.MapPath("~/App_Data/" + UploadFolder);
			log.Debug(uploadFolderPath);

			if (Request.Content.IsMimeMultipartContent())
			{
				var streamProvider = new WithExtensionMultipartFormDataStreamProvider(uploadFolderPath);
				var task = Request.Content.ReadAsMultipartAsync(streamProvider).ContinueWith<IEnumerable<HDFile>>(t =>
				{
					if (t.IsFaulted || t.IsCanceled)
					{
						throw new HttpResponseException(HttpStatusCode.InternalServerError);
					}

					IEnumerable<string> zipFilePaths = streamProvider.FileData.Where(x => fileExtensions.Contains(Path.GetExtension(x.LocalFileName), StringComparer.OrdinalIgnoreCase)).Select(x => x.LocalFileName);

					List<string> files = new List<string>();
					foreach (var zipFilePath in zipFilePaths)
					{
						string tempFolder = Path.Combine(Path.GetTempPath(), CryptoUtil.MD5(DateTime.Now.Ticks.ToString() + zipFilePath));
						if (!Directory.Exists(tempFolder))
							Directory.CreateDirectory(tempFolder);
						ZipFile zipFile = new ZipFile(zipFilePath);
						zipFile.ExtractAll(tempFolder);
						files.AddRange(Directory.GetFiles(tempFolder));
					}

					IList<HDFile> result = new List<HDFile>();
					foreach (var item in files)
					{
						foreach (var engine in OCREngines.GetDefaultInstance().AllRegisteredEngines)
						{
							if (engine.CanHandle(Path.GetExtension(item)))
							{
								result.Add(new HDFile(Path.GetFileName(item), engine.GetText(item)));
								break;
							}
						}
					}
					return result;
				});

				return task;
			}
			else
			{
				throw new HttpResponseException(Request.CreateResponse(HttpStatusCode.NotAcceptable, "This request is not properly formatted"));
			}
		}
		catch (Exception ex)
		{
			log.Error(ex);
			throw new HttpResponseException(Request.CreateResponse(HttpStatusCode.BadRequest, ex.Message));
		}
	}
}

You can extend this function to delete temp folder when it’s not in used anymore.

3. Client

When the web service is ready, we can easily consume it from the client. Below is an example of .NET console client

private static void Main(string[] args)
{
	Post("http://localhost:49912/api/tiff", "Tif", "HintDesk.tif");
	Post("http://localhost:49912/api/pdf", "Pdf", "HintDesk.pdf");
	Post("http://localhost:49912/api/zip", "Zip", "HintDesk.zip");

	Console.ReadLine();
}

private static void Post(string url, string name, string fileName)
{
	Uri server = new Uri(url);
	HttpClient httpClient = new HttpClient();
	httpClient.Timeout = new TimeSpan(httpClient.Timeout.Ticks * 5);
	StreamContent streamConent = new StreamContent(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read));
	MultipartFormDataContent multipartFormDataContent = new MultipartFormDataContent();
	multipartFormDataContent.Add(streamConent, name, fileName);

	HttpResponseMessage responseMessage = httpClient.PostAsync(server, multipartFormDataContent).Result;

	if (responseMessage.IsSuccessStatusCode)
	{
		IList<HDFile> hdFiles = responseMessage.Content.ReadAsAsync<IList<HDFile>>().Result;

		foreach (var item in hdFiles)
		{
			Console.WriteLine(item.Text);
		}
	}
}

The client will post sequentially files to corresponding controllers, get the result and print extracted text to the console. The image below shows the result of Tesseract. The text is really incomprehensible.

Result from Tesseract

3. Conclusion

It’s pretty simple to write a web service for OCR. However, the accuracy depends a lot of on OCR library. Making an OCR library is a work of years and requires a lot of investment. There are some commercial products on the market but unfortunately, I don’t have any full version of them to test. Until now Microsoft Office Document Imaging is still better than Tesseract but we have to install Office on server to get this library work. I will update this post when I have a chance to test the others.

Source code: https://bitbucket.org/hintdesk/dotnet-an-example-of-ocr-web-service

2 thoughts on “C# – An example of OCR web service”

  1. Hello!

    Thank you for such an awesome project. Right now I`m trying to understand and use this code in a university project. I´m trying to run the server side in Azure but can´t manage to connect successfully to it. I would appreciate if you could contact me to discuss what i could be doing wrong.

    Thank you very much for any help you could provide. Hope to hear about you soon.

Leave a Reply

Your email address will not be published. Required fields are marked *