iTextSharp – Frequently asked questions – Part 2

In previous part C#,iTextSharp – PDF file – Insert/extract image,text,font, text highlighting and auto fillin, I listed some code listings for typical features of iTextSharp in editing PDF files. That post just gets growing up and too long, therefore I would like to continue the work in this second one.

Continue reading iTextSharp – Frequently asked questions – Part 2

C#,iTextSharp – PDF file – Insert/extract image,text,font, text highlighting and auto fillin

Nowadays, Portable Document Format (PDF) is a most popular standard for document exchange. Created by Adobe System in 1993, this format independent of platform is used for representing contents including text, font, images and other information. However the PDF format could only be created by Adobe Acrobat Professional and does not allow user to edit the content of file. But then there were more and more wishes to create PDF without Adobe Acrobat Professional or to edit a PDF file. These wishes led to the birth of many open source libraries for PDF. One of them is iText, a library that allows creating and manipulating PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation. In this small blog I would like to illustrate some features of iTextSharp (http://sourceforge.net/projects/itextsharp/ – a port of iText in .Net platform) through small examples.

1. Insert image and text to PDF

Let’s think about a case that we are in a big company which has a lot of documentations in PDF format. We have a task that on the top of all of these documents should have the company’s logo. We have only PDF format the original editable documentations are not available. It’s will be nightmare if we open each one with Adobe Acrobat Professional and insert the logo but thanks to iTextSharp we can do this easily

private static void InsertImageToPdf(string sourceFileName, string imageFileName, string newFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	using (Stream imageStream = new FileStream(imageFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		PdfContentByte pdfContentByte = pdfStamper.GetOverContent(1);
		iTextSharp.text.Image image = iTextSharp.text.Image.GetInstance(imageStream);
		image.SetAbsolutePosition(300, 600);
		pdfContentByte.AddImage(image);
		pdfStamper.Close();
	}
}

Anytime if we want to insert any object or edit anything of PDF file with iTextSharp, we should use PdfStamper plus PdfContentByte like code above. The PdfStamper allow us to get current content GetOverContent() and add object through its functions Addxxx(). Closing the pdfStamper will save all changes back to PDF file. Image inserting makes nothing than that. Create an instance of iTextSharp image from normal image and follow the routine above. Text inserting is as same as image inserting. However we can set some more attributes for text like font, size, color, rotation, etc…. before “pasting” it to a specific location in PDF file.

private static void InsertTextToPdf(string sourceFileName, string newFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		PdfContentByte pdfContentByte = pdfStamper.GetOverContent(1);
		BaseFont baseFont = BaseFont.CreateFont(BaseFont.TIMES_ROMAN, BaseFont.CP1250, BaseFont.NOT_EMBEDDED);
		pdfContentByte.SetColorFill(BaseColor.BLUE);
		pdfContentByte.SetFontAndSize(baseFont, 8);
		pdfContentByte.BeginText();
		pdfContentByte.ShowTextAligned(PdfContentByte.ALIGN_CENTER, "Kevin Cheng - A Hong Kong actor", 400, 600, 0);
		pdfContentByte.EndText();
		pdfStamper.Close();
	}
}

Before

iTextSharp insert image and text

After

iTextSharp insert image and text

2. Extract text from PDF

Text object extracting from PDF with iTextSharp is also pretty simple. Initializing a PdfReader() and call GetTextFromPage() of PdfTextExtractor() with appropriate strategy, we’ll get all text we need.

private static void ExtractTextFromPdf(string newFileNameWithImageAndText, string extractedTextFileName)
{
	using (Stream newpdfStream = new FileStream(newFileNameWithImageAndText, FileMode.Open, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(newpdfStream);
		string text = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy());
		File.WriteAllText(extractedTextFileName, text);
	}
}

Interesting is that we don’t have to read all of text of a page or of complete PDF file but we can define that we would like to read only text of specific region. It’s very useful if we just want to read the address of letter in PDF. We don’t need to read all of letter which is time and resource consuming. Just define region where address is and read it out. For example, we would like to extract text from region as image below

iTextSharp extract text from region

private static void ExtractTextFromRegionOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		System.util.RectangleJ rect = new System.util.RectangleJ(50, 650, 250, 140);
		RenderFilter[] renderFilter = new RenderFilter[1];
		renderFilter[0] = new RegionTextRenderFilter(rect);
		ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
		Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, textExtractionStrategy));
	}
}

The code for reading from region uses same function GetTextFromPage() but with another strategy LocationTextExtractionStrategy(). This strategy will be parsed in an instance of FilteredTextRenderListener() with a RegionTextRenderFilter(). This filter contains the defined region which we want to extract exactly the text from, in this case is a rectangle.

3. Auto Fill-in PDF form

Let’s think about this case that we have an interactive form in PDF format. We would like to send this template to many users with some pre-fill-in fields, for example, the username and their addresses will be automatically filled. It’s suitable when we would like to make a survey or a new contract from current data. We can also accomplish it with help of iTextSharp, just get all Acrobat fields of PDF out, set their values and save them back. Of course as I say before, we should use PdfStamper for any editing action on PDF file. The image below shows an interactive form in PDF as example

Interactive PDF form

private static void AutoFillInFormOfPdf(string fillableFormFileName, string newfillableFormFileName)
{
	using (Stream pdfStream = new FileStream(fillableFormFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newfillableFormFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		foreach (KeyValuePair<string, iTextSharp.text.pdf.AcroFields.Item> pair in pdfReader.AcroFields.Fields)
		{
			Console.WriteLine(pair.Key + " - " + pair.Value);
		}
	
		AcroFields acroFields = pdfStamper.AcroFields;
		acroFields.SetField("Text_01", "ServusKevin");
		acroFields.SetField("Radio Button_01", acroFields.GetAppearanceStates("Radio Button_01")[0]);
		acroFields.SetField("Radio Button_02", acroFields.GetAppearanceStates("Radio Button_02")[1]);
		acroFields.SetField("Radio Button_03", acroFields.GetAppearanceStates("Radio Button_03")[2]);
		acroFields.SetField("Check Box_03", acroFields.GetAppearanceStates("Check Box_03")[0]);
		acroFields.SetField("Combo Box_01", pdfReader.AcroFields.GetListOptionDisplay("Combo Box_01")[4]);

		pdfStamper.Close();
	}
}

First I use a loop to list all AcroFields with his name plus his current value and then I set them with what I want. “Text_01”, “Radio Button_01”, “Radio Button_02”, “Radio Button_03″… are the names of the controls in form. Although we can easily set text of text box, the other components are not the case. The radio button and check box have custom defined values. Only setting correct value will display the control correctly (check or not checked). If we set false value, the control will be displayed as default (normally as unchecked, it depends on author of the form). These values can be enumerated with the GetAppearanceStates() function with name of field as argument. However this function is not available for combo box because he is again another case. To enumerate combo box’s values, we should use the GetListOptionDisplay(). This function will return all available choices of combo box. It’s a little confused that each component has its own behavior, but if you know the functions then it’s not complicated anymore.

Source code : https://bitbucket.org/hintdesk/dotnet-itextsharp-pdf-file-insertextract-imagetext-and-auto

4. Updates

4.1 Extract list of fonts used in PDF file – 27.04.2012

try listing below

private static void ExtractFontNameOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		List<BaseFont> set = new List<BaseFont>();
		PdfDictionary resources;

		for (int index = 1; index <= pdfReader.NumberOfPages; index++)
		{
			resources = pdfReader.GetPageN(index).GetAsDict(PdfName.RESOURCES);
			ProcessResource(set, resources);
		}

		foreach (BaseFont item in set)
			Console.WriteLine(item.PostscriptFontName + " " + item.FontType.ToString());
	}
}

private static void ProcessResource(List<BaseFont> set, PdfDictionary resources)
{
	if (resources == null)
		return;
	PdfDictionary xObjects = resources.GetAsDict(PdfName.XOBJECT);
	if (xObjects != null)
	{
		foreach (PdfName key in xObjects.Keys)
		{
			ProcessResource(set, xObjects.GetAsDict(key));
		}
	}

	PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);

	if (fonts == null)
		return;
	foreach (PdfName key in fonts.Keys)
	{
		PdfDictionary fontDict = (PdfDictionary)PdfReader.GetPdfObject(fonts.Get(key));
		PdfName baseFontName = (PdfName)PdfReader.GetPdfObject(fontDict.Get(PdfName.BASEFONT));
		PRIndirectReference iRef = (PRIndirectReference)fonts.Get(key);
		if (iRef != null)
			set.Add(BaseFont.CreateFont(iRef));
	}
}

4.2 Highlighting text in existing PDF file – 30.07.2012

private static void ChangeTextColorOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		using (PdfStamper stamper = new PdfStamper(pdfReader, pdfStream))
		{
			iTextSharp.text.Rectangle rect = new Rectangle(130, 635, 230, 650);
			float[] quadPoints = { rect.Left, rect.Bottom, rect.Right, rect.Bottom, rect.Left, rect.Top, rect.Right, rect.Top };
			PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quadPoints);
			highlight.Color = BaseColor.GREEN;
			stamper.AddAnnotation(highlight, 1);
		}
		Console.WriteLine("Text was highlighted");
	}
}

4.3 Extract all text from .pdf – 18.08.2013

Using Parallel to extract text from .pdf file

using (Stream newpdfStream = new FileStream(newFileNameWithImageAndText, FileMode.Open, FileAccess.ReadWrite))
{
	PdfReader pdfReader = new PdfReader(newpdfStream);

	int pageSize = (int)Math.Ceiling((double)pdfReader.NumberOfPages / (double)(Environment.ProcessorCount * 2));
	int numberOfThreads = (int)Math.Ceiling((double)pdfReader.NumberOfPages / (double)pageSize);
	IList<Task> tasks = new List<Task>();
	for (int index = 0; index < numberOfThreads; index++)
	{
		int currentIndex = index;
		int page = Math.Min((index + 1) * pageSize, pdfReader.NumberOfPages);
		tasks.Add(Task.Factory.StartNew<string>(() =>
			{
				StringBuilder taskResult = new StringBuilder();
				for (int subIndex = currentIndex * pageSize + 1; subIndex <= page; subIndex++)
					taskResult.Append(PdfTextExtractor.GetTextFromPage(pdfReader, subIndex, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy()));
				return taskResult.ToString();
			})
			.ContinueWith((t) => File.WriteAllText(currentIndex.ToString() + ".txt", t.Result)));
	}

	Task.WaitAll(tasks.ToArray());
	Console.WriteLine("Finish");
}

4.4 Get roman page numbers – 21.03.2015

Get roman page numbers of first pages such as cover, back cover, table of contents…

private static void ExtractRomanPageNumbers(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		
		foreach (string s in GetRomanPageNumbers(pdfReader))
			Console.WriteLine(s);

	}
}

private static IEnumerable<string> GetRomanPageNumbers(PdfReader pdfReader)
{
	int n = pdfReader.NumberOfPages;

	PdfDictionary dict = pdfReader.Catalog;
	PdfDictionary labels = (PdfDictionary)PdfReader.GetPdfObjectRelease(dict.Get(PdfName.PAGELABELS));
	if (labels == null)
		return null;

	String[] labelstrings = new String[n];
	Dictionary<int, PdfObject> numberTree = PdfNumberTree.ReadTree(labels);

	int pagecount = 1;
	String prefix = "";
	char type = 'D';
	for (int i = 0; i < n; i++)
	{
		if (numberTree.ContainsKey(i))
		{
			PdfDictionary d = (PdfDictionary)PdfReader.GetPdfObjectRelease(numberTree[i]);
			if (d.Contains(PdfName.ST))
			{
				pagecount = ((PdfNumber)d.Get(PdfName.ST)).IntValue;
			}
			else
			{
				pagecount = 1;
			}
			if (d.Contains(PdfName.P))
			{
				prefix = ((PdfString)d.Get(PdfName.P)).ToUnicodeString();
			}
			if (d.Contains(PdfName.S))
			{
				type = ((PdfName)d.Get(PdfName.S)).ToString()[1];
			}
		}
		switch (type)
		{
			default:
				labelstrings[i] = pagecount.ToString();
				break;
			case 'R':
				labelstrings[i] = RomanNumberFactory.GetUpperCaseString(pagecount);
				break;
			case 'r':
				labelstrings[i] = RomanNumberFactory.GetLowerCaseString(pagecount);
				break;
			case 'A':
				labelstrings[i] =  RomanAlphabetFactory.GetUpperCaseString(pagecount);
				break;
			case 'a':
				labelstrings[i] = RomanAlphabetFactory.GetLowerCaseString(pagecount);
				break;
		}
		pagecount++;
	}
	return labelstrings;
}