C#,iTextSharp – PDF file – Insert/extract image,text,font, text highlighting and auto fillin

Nowadays, Portable Document Format (PDF) is a most popular standard for document exchange. Created by Adobe System in 1993, this format independent of platform is used for representing contents including text, font, images and other information. However the PDF format could only be created by Adobe Acrobat Professional and does not allow user to edit the content of file. But then there were more and more wishes to create PDF without Adobe Acrobat Professional or to edit a PDF file. These wishes led to the birth of many open source libraries for PDF. One of them is iText, a library that allows creating and manipulating PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation. In this small blog I would like to illustrate some features of iTextSharp (http://sourceforge.net/projects/itextsharp/ – a port of iText in .Net platform) through small examples.

1. Insert image and text to PDF

Let’s think about a case that we are in a big company which has a lot of documentations in PDF format. We have a task that on the top of all of these documents should have the company’s logo. We have only PDF format the original editable documentations are not available. It’s will be nightmare if we open each one with Adobe Acrobat Professional and insert the logo but thanks to iTextSharp we can do this easily

private static void InsertImageToPdf(string sourceFileName, string imageFileName, string newFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	using (Stream imageStream = new FileStream(imageFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		PdfContentByte pdfContentByte = pdfStamper.GetOverContent(1);
		iTextSharp.text.Image image = iTextSharp.text.Image.GetInstance(imageStream);
		image.SetAbsolutePosition(300, 600);
		pdfContentByte.AddImage(image);
		pdfStamper.Close();
	}
}

Anytime if we want to insert any object or edit anything of PDF file with iTextSharp, we should use PdfStamper plus PdfContentByte like code above. The PdfStamper allow us to get current content GetOverContent() and add object through its functions Addxxx(). Closing the pdfStamper will save all changes back to PDF file. Image inserting makes nothing than that. Create an instance of iTextSharp image from normal image and follow the routine above. Text inserting is as same as image inserting. However we can set some more attributes for text like font, size, color, rotation, etc…. before “pasting” it to a specific location in PDF file.

private static void InsertTextToPdf(string sourceFileName, string newFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		PdfContentByte pdfContentByte = pdfStamper.GetOverContent(1);
		BaseFont baseFont = BaseFont.CreateFont(BaseFont.TIMES_ROMAN, BaseFont.CP1250, BaseFont.NOT_EMBEDDED);
		pdfContentByte.SetColorFill(BaseColor.BLUE);
		pdfContentByte.SetFontAndSize(baseFont, 8);
		pdfContentByte.BeginText();
		pdfContentByte.ShowTextAligned(PdfContentByte.ALIGN_CENTER, "Kevin Cheng - A Hong Kong actor", 400, 600, 0);
		pdfContentByte.EndText();
		pdfStamper.Close();
	}
}

Before

iTextSharp insert image and text

After

iTextSharp insert image and text

2. Extract text from PDF

Text object extracting from PDF with iTextSharp is also pretty simple. Initializing a PdfReader() and call GetTextFromPage() of PdfTextExtractor() with appropriate strategy, we’ll get all text we need.

private static void ExtractTextFromPdf(string newFileNameWithImageAndText, string extractedTextFileName)
{
	using (Stream newpdfStream = new FileStream(newFileNameWithImageAndText, FileMode.Open, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(newpdfStream);
		string text = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy());
		File.WriteAllText(extractedTextFileName, text);
	}
}

Interesting is that we don’t have to read all of text of a page or of complete PDF file but we can define that we would like to read only text of specific region. It’s very useful if we just want to read the address of letter in PDF. We don’t need to read all of letter which is time and resource consuming. Just define region where address is and read it out. For example, we would like to extract text from region as image below

iTextSharp extract text from region

private static void ExtractTextFromRegionOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		System.util.RectangleJ rect = new System.util.RectangleJ(50, 650, 250, 140);
		RenderFilter[] renderFilter = new RenderFilter[1];
		renderFilter[0] = new RegionTextRenderFilter(rect);
		ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
		Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, textExtractionStrategy));
	}
}

The code for reading from region uses same function GetTextFromPage() but with another strategy LocationTextExtractionStrategy(). This strategy will be parsed in an instance of FilteredTextRenderListener() with a RegionTextRenderFilter(). This filter contains the defined region which we want to extract exactly the text from, in this case is a rectangle.

3. Auto Fill-in PDF form

Let’s think about this case that we have an interactive form in PDF format. We would like to send this template to many users with some pre-fill-in fields, for example, the username and their addresses will be automatically filled. It’s suitable when we would like to make a survey or a new contract from current data. We can also accomplish it with help of iTextSharp, just get all Acrobat fields of PDF out, set their values and save them back. Of course as I say before, we should use PdfStamper for any editing action on PDF file. The image below shows an interactive form in PDF as example

Interactive PDF form

private static void AutoFillInFormOfPdf(string fillableFormFileName, string newfillableFormFileName)
{
	using (Stream pdfStream = new FileStream(fillableFormFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newfillableFormFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		foreach (KeyValuePair<string, iTextSharp.text.pdf.AcroFields.Item> pair in pdfReader.AcroFields.Fields)
		{
			Console.WriteLine(pair.Key + " - " + pair.Value);
		}
	
		AcroFields acroFields = pdfStamper.AcroFields;
		acroFields.SetField("Text_01", "ServusKevin");
		acroFields.SetField("Radio Button_01", acroFields.GetAppearanceStates("Radio Button_01")[0]);
		acroFields.SetField("Radio Button_02", acroFields.GetAppearanceStates("Radio Button_02")[1]);
		acroFields.SetField("Radio Button_03", acroFields.GetAppearanceStates("Radio Button_03")[2]);
		acroFields.SetField("Check Box_03", acroFields.GetAppearanceStates("Check Box_03")[0]);
		acroFields.SetField("Combo Box_01", pdfReader.AcroFields.GetListOptionDisplay("Combo Box_01")[4]);

		pdfStamper.Close();
	}
}

First I use a loop to list all AcroFields with his name plus his current value and then I set them with what I want. “Text_01”, “Radio Button_01”, “Radio Button_02”, “Radio Button_03″… are the names of the controls in form. Although we can easily set text of text box, the other components are not the case. The radio button and check box have custom defined values. Only setting correct value will display the control correctly (check or not checked). If we set false value, the control will be displayed as default (normally as unchecked, it depends on author of the form). These values can be enumerated with the GetAppearanceStates() function with name of field as argument. However this function is not available for combo box because he is again another case. To enumerate combo box’s values, we should use the GetListOptionDisplay(). This function will return all available choices of combo box. It’s a little confused that each component has its own behavior, but if you know the functions then it’s not complicated anymore.

Source code : https://bitbucket.org/hintdesk/dotnet-itextsharp-pdf-file-insertextract-imagetext-and-auto

4. Updates

4.1 Extract list of fonts used in PDF file – 27.04.2012

try listing below

private static void ExtractFontNameOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		List<BaseFont> set = new List<BaseFont>();
		PdfDictionary resources;

		for (int index = 1; index <= pdfReader.NumberOfPages; index++)
		{
			resources = pdfReader.GetPageN(index).GetAsDict(PdfName.RESOURCES);
			ProcessResource(set, resources);
		}

		foreach (BaseFont item in set)
			Console.WriteLine(item.PostscriptFontName + " " + item.FontType.ToString());
	}
}

private static void ProcessResource(List<BaseFont> set, PdfDictionary resources)
{
	if (resources == null)
		return;
	PdfDictionary xObjects = resources.GetAsDict(PdfName.XOBJECT);
	if (xObjects != null)
	{
		foreach (PdfName key in xObjects.Keys)
		{
			ProcessResource(set, xObjects.GetAsDict(key));
		}
	}

	PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);

	if (fonts == null)
		return;
	foreach (PdfName key in fonts.Keys)
	{
		PdfDictionary fontDict = (PdfDictionary)PdfReader.GetPdfObject(fonts.Get(key));
		PdfName baseFontName = (PdfName)PdfReader.GetPdfObject(fontDict.Get(PdfName.BASEFONT));
		PRIndirectReference iRef = (PRIndirectReference)fonts.Get(key);
		if (iRef != null)
			set.Add(BaseFont.CreateFont(iRef));
	}
}

4.2 Highlighting text in existing PDF file – 30.07.2012

private static void ChangeTextColorOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		using (PdfStamper stamper = new PdfStamper(pdfReader, pdfStream))
		{
			iTextSharp.text.Rectangle rect = new Rectangle(130, 635, 230, 650);
			float[] quadPoints = { rect.Left, rect.Bottom, rect.Right, rect.Bottom, rect.Left, rect.Top, rect.Right, rect.Top };
			PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quadPoints);
			highlight.Color = BaseColor.GREEN;
			stamper.AddAnnotation(highlight, 1);
		}
		Console.WriteLine("Text was highlighted");
	}
}

4.3 Extract all text from .pdf – 18.08.2013

Using Parallel to extract text from .pdf file

using (Stream newpdfStream = new FileStream(newFileNameWithImageAndText, FileMode.Open, FileAccess.ReadWrite))
{
	PdfReader pdfReader = new PdfReader(newpdfStream);

	int pageSize = (int)Math.Ceiling((double)pdfReader.NumberOfPages / (double)(Environment.ProcessorCount * 2));
	int numberOfThreads = (int)Math.Ceiling((double)pdfReader.NumberOfPages / (double)pageSize);
	IList<Task> tasks = new List<Task>();
	for (int index = 0; index < numberOfThreads; index++)
	{
		int currentIndex = index;
		int page = Math.Min((index + 1) * pageSize, pdfReader.NumberOfPages);
		tasks.Add(Task.Factory.StartNew<string>(() =>
			{
				StringBuilder taskResult = new StringBuilder();
				for (int subIndex = currentIndex * pageSize + 1; subIndex <= page; subIndex++)
					taskResult.Append(PdfTextExtractor.GetTextFromPage(pdfReader, subIndex, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy()));
				return taskResult.ToString();
			})
			.ContinueWith((t) => File.WriteAllText(currentIndex.ToString() + ".txt", t.Result)));
	}

	Task.WaitAll(tasks.ToArray());
	Console.WriteLine("Finish");
}

4.4 Get roman page numbers – 21.03.2015

Get roman page numbers of first pages such as cover, back cover, table of contents…

private static void ExtractRomanPageNumbers(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		
		foreach (string s in GetRomanPageNumbers(pdfReader))
			Console.WriteLine(s);

	}
}

private static IEnumerable<string> GetRomanPageNumbers(PdfReader pdfReader)
{
	int n = pdfReader.NumberOfPages;

	PdfDictionary dict = pdfReader.Catalog;
	PdfDictionary labels = (PdfDictionary)PdfReader.GetPdfObjectRelease(dict.Get(PdfName.PAGELABELS));
	if (labels == null)
		return null;

	String[] labelstrings = new String[n];
	Dictionary<int, PdfObject> numberTree = PdfNumberTree.ReadTree(labels);

	int pagecount = 1;
	String prefix = "";
	char type = 'D';
	for (int i = 0; i < n; i++)
	{
		if (numberTree.ContainsKey(i))
		{
			PdfDictionary d = (PdfDictionary)PdfReader.GetPdfObjectRelease(numberTree[i]);
			if (d.Contains(PdfName.ST))
			{
				pagecount = ((PdfNumber)d.Get(PdfName.ST)).IntValue;
			}
			else
			{
				pagecount = 1;
			}
			if (d.Contains(PdfName.P))
			{
				prefix = ((PdfString)d.Get(PdfName.P)).ToUnicodeString();
			}
			if (d.Contains(PdfName.S))
			{
				type = ((PdfName)d.Get(PdfName.S)).ToString()[1];
			}
		}
		switch (type)
		{
			default:
				labelstrings[i] = pagecount.ToString();
				break;
			case 'R':
				labelstrings[i] = RomanNumberFactory.GetUpperCaseString(pagecount);
				break;
			case 'r':
				labelstrings[i] = RomanNumberFactory.GetLowerCaseString(pagecount);
				break;
			case 'A':
				labelstrings[i] =  RomanAlphabetFactory.GetUpperCaseString(pagecount);
				break;
			case 'a':
				labelstrings[i] = RomanAlphabetFactory.GetLowerCaseString(pagecount);
				break;
		}
		pagecount++;
	}
	return labelstrings;
}

4.5 Add annotation – 13.03.2017

private static void AddAnnotation(string fileName)
{
	var result = StampPDFDocument(File.ReadAllBytes(fileName), "hintdesk annotation");
	File.WriteAllBytes("stampedTest.pdf",result);
	Console.WriteLine("Add annotations successfully");
}

private static byte[] StampPDFDocument(byte[] pdf, string stampString)
{
	using (var ms = new MemoryStream())
	{
		var reader = new iTextSharp.text.pdf.PdfReader(pdf);
		var stamper = new iTextSharp.text.pdf.PdfStamper(reader, ms);

		int rotation = reader.GetPageRotation(1);

		var box = reader.GetPageSizeWithRotation(1);
		var cropbox = reader.GetCropBox(1);

		float left = cropbox.Left;
		float top = cropbox.Top;

		if (rotation == 90)
		{
			left = cropbox.Bottom;
			top = box.Height - cropbox.Left;
			cropbox = new iTextSharp.text.Rectangle(left, top, left + cropbox.Height, top - cropbox.Width);
		}
		else if (rotation == 180)
		{
			left = box.Width - cropbox.Left - cropbox.Width;
			top = box.Height - cropbox.Bottom;
			cropbox = new iTextSharp.text.Rectangle(left, top, left + cropbox.Width, top - cropbox.Height);
		}
		else if (rotation == 270)
		{
			left = box.Width - cropbox.Top;
			top = cropbox.Right;
			cropbox = new iTextSharp.text.Rectangle(left, top, left + cropbox.Height, top - cropbox.Width);
		}

		iTextSharp.text.Rectangle newRectangle = new iTextSharp.text.Rectangle(left + 20, top - 20, left + 250, top - 40);

		var pcb = new iTextSharp.text.pdf.PdfContentByte(stamper.Writer);
		pcb.SetColorFill(iTextSharp.text.BaseColor.RED);

		var annot = iTextSharp.text.pdf.PdfAnnotation.CreateFreeText(stamper.Writer, newRectangle, stampString, pcb);
		annot.Flags = iTextSharp.text.pdf.PdfAnnotation.FLAGS_PRINT;
		annot.Rotate = reader.GetPageRotation(1);

		annot.BorderStyle = new iTextSharp.text.pdf.PdfBorderDictionary(0, 0);
		stamper.AddAnnotation(annot, 1);
		stamper.Close();
		return ms.ToArray();
	}
}

C#, AForge.Net – Examples for average color and motion detection

As I was a student at the chair in Data and Signal processing of TUM (http://tum.de), I had a course of Computer Vision which discusses the image processing and his uses in the real application. Matlab is often used for calculating, evaluating the algorithms and displaying data on the chart. However, in .NET I would like to introduce another library which is also powerful for image processing. That’s AForge.NET (http://aforgenet.com).

AForge.NET is a C# framework designed for developers and researchers in the fields of Computer Vision and Artificial Intelligence – image processing, neural networks, genetic algorithms, machine learning, robotics, etc.

Continue reading C#, AForge.Net – Examples for average color and motion detection

Tor – How to install for specific country and reset identity with C#?

For requirements in my private tool, I need to change my IP according to demand and only use IPs of a specific country. I don’t have much money to buy a private VPN or a premium service to get HTTP/SOCKs proxy. Even that I can buy a service like that, the resources of IPs are still very limit. I think we can’t get more than 10 new IPs per day. That’s not enough for me. Besides I have no inspiration to expand the current source code to login to a service, get list of proxy and apply to my web browser. I need an integrated solution which requires once configuration, works forever and not too much time-consuming for implementing. Therefore I think of Tor (https://www.torproject.org/).

Tor is free software and an open network that helps you defend against a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security known as traffic analysis

. Using Tor, I can completely hide me behind a proxy and thanks to his powerful flexible configuration, I can set from which exit nodes I will go out. In this small blog, I will guide you how to install Tor, set the exit nodes to specific country and requires new identity with C#.

1. Go to Tor homepage (https://www.torproject.org/), click on “Download Tor”, you will see a direct link to download “Tor Browser Bundle”. Don’t download it because it’s only for normal user who needs an anonymous identity when surfing over internet through an integrated browser. I need something more powerful because I would like to build a proxy and use this proxy for any web browser I like. Therefore, click on “View All Downloads” as image below

Tor View all downloads

2. Choose “Vidalia Bundle”, download and install it with all default settings.

Vidalia Bundle

3. After installation is finish, start Vidalia – a cross-platform graphical controller for the Tor software, built using the Qt framework, wait until authentication process is done. If not successful, then check your firewall if he allows Vidalia to access internet. If successful, you’ll have “Connected to the Tor network”

4. Now we have a local Sock proxy running on our computer, we would like to check if it works. Start Internet Explorer –> Internet Options –> Connections –> LAN settings –> Check on “Use a proxy server for your LAN…” and click on Advanced. At the Socks textbox, enter “127.0.0.1” for address and 9050 for port. Click OK to finish. Then go to any IP checking website to check if we are now anonymous.

Internet Explorer Tor Settings

5. This integrated solution is much better than we add proxy manually/programmatically to web browser. Because if we add proxy ourselves, we must first check if proxy work and if it is too slow to use, that requires a lot of programming and really time consuming when running in real time. Tor integrated solution is not that case because we don’t have to change proxy and the exit nodes will be examined by Tor for working node and as fast as possible. Moreover with Tor, every time when we start new instance of Internet Explorer, we’ll get new identity.

6. Now the proxy runs smoothly, but it will choose any exit nodes around the world for us. That’s again not what we want. We want, for example, only use exit nodes from Germany. Thanks to flexible configuration of Tor, we can exclude the exit nodes of other countries which we don’t believe or don’t want to go through. In Vidalia Control Panel, Settings –> Advanced –> Edit current torrc –> Insert Exclude Nodes to configuration file –> Click OK.

ExcludeNodes {be},{pl},{ca},{za},{vn},{uz},{ua},{tw},{tr},{th},{sk},{sg},{se},{sd},{sa},{ru},{ro},{pt},{ph},{pa},{nz},{np},{no},{my},{mx},{md},{lv},{lu},{kr},{jp},{it},{ir},{il},{ie},{id},{hr},{hk},{gr},{gi},{gb},{fi},{es},{ee},{dk},{cz},{cy},{cr},{co},{cn},{cl},{ci},{ch},{by},{br},{bg},{au},{at},{ar},{aq},{ao},{ae},{nl},{us},{fr},{lt}

7. Let’s start some instances of Internet Explorer; we’ll see that we are now always exited from Germany nodes. You can find in Internet another solution to exclude nodes, for example through name or fingerprint. It depends on your requirement how you would like to hide yourself then choose one appropriate method for you. So now is the last requirement. It’s ok to start new instance with new identity but what should I do if I want to get new identity within an instance. Let’s consider that you are programmer, you have already start an instance of web browser control. You want now to get new identity but you can’t not dispose the old one and initialize a new one because it’s really time-consuming and resources-consuming. We must find out how we can force Tor to change his identity. Vidalia has a built-in function for us to get new identity

New Identity
Of course it’s completely possible to call this function from our program. But again I don’t want to make it so complex. I would like somehow talk with Tor to order him reset identity, somehow through remotely sending message from Socket to Tor. Man calls this function as Telnet function and Tor allows us to “telnet” to his host for remotely controlling. But we need to configure Tor for allowing remote configuration.

8. Go to Settings –> Advanced –> Remove ticks from “Randomly Generate” and enter your password. You have to use this password to connect to control host.

Tor Control Host Password Settings

9. In your program, connect to control host and send message to reset identity

private bool RequestNewIdentityFromTor()
{
	IPEndPoint ip = new IPEndPoint(IPAddress.Parse("127.0.0.1"), 9051);
	Socket client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
	try
	{
		client.Connect(ip);
	}
	catch (SocketException e)
	{
		MessageBox.Show("Unable to connect to server of Tor.");
		return false;
	}

	client.Send(Encoding.ASCII.GetBytes("AUTHENTICATE \"YourPassword\"\n"));
	byte[] data = new byte[1024];
	int receivedDataLength = client.Receive(data);
	string stringData = Encoding.ASCII.GetString(data, 0, receivedDataLength);

	if (stringData.Contains("250"))
	{
		client.Send(Encoding.ASCII.GetBytes("SIGNAL NEWNYM\r\n"));
		data = new byte[1024];
		receivedDataLength = client.Receive(data);
		stringData = Encoding.ASCII.GetString(data, 0, receivedDataLength);
		if (!stringData.Contains("250"))
		{
			MessageBox.Show("Unable to signal new user to server of Tor.");
			client.Shutdown(SocketShutdown.Both);
			client.Close();
			return false;
		}
	}
	else
	{
		MessageBox.Show("Unable to authenticate to server of Tor.");
		client.Shutdown(SocketShutdown.Both);
		client.Close();
		return false;
	}
	client.Shutdown(SocketShutdown.Both);
	client.Close();
	return true;
}

10. We now fulfill all requirements. But there are still some considerations about the security of Tor that we should discuss. Because we connect to end website through a lot of nodes and our messages will be much more vulnerable to be attacked or decrypted. If a hacker operates a node in network, he can follow the network traffic or do whatever only god knows to get our data, for example mail password, ftp password, bank password… So I recommend use Tor only for transmitting not sensible data, don’t use Tor for email checking or online banking. You know that nothing is free. If you’re interested in Tor security, there’re some helpful links:
https://www.torproject.org/about/overview.html.en
http://en.wikipedia.org/wiki/Tor_%28anonymity_network%29