C#,iTextSharp – PDF file – Insert/extract image,text,font, text highlighting and auto fillin

Nowadays, Portable Document Format (PDF) is a most popular standard for document exchange. Created by Adobe System in 1993, this format independent of platform is used for representing contents including text, font, images and other information. However the PDF format could only be created by Adobe Acrobat Professional and does not allow user to edit the content of file. But then there were more and more wishes to create PDF without Adobe Acrobat Professional or to edit a PDF file. These wishes led to the birth of many open source libraries for PDF. One of them is iText, a library that allows creating and manipulating PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation. In this small blog I would like to illustrate some features of iTextSharp (http://sourceforge.net/projects/itextsharp/ – a port of iText in .Net platform) through small examples.

1. Insert image and text to PDF

Let’s think about a case that we are in a big company which has a lot of documentations in PDF format. We have a task that on the top of all of these documents should have the company’s logo. We have only PDF format the original editable documentations are not available. It’s will be nightmare if we open each one with Adobe Acrobat Professional and insert the logo but thanks to iTextSharp we can do this easily

private static void InsertImageToPdf(string sourceFileName, string imageFileName, string newFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	using (Stream imageStream = new FileStream(imageFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		PdfContentByte pdfContentByte = pdfStamper.GetOverContent(1);
		iTextSharp.text.Image image = iTextSharp.text.Image.GetInstance(imageStream);
		image.SetAbsolutePosition(300, 600);
		pdfContentByte.AddImage(image);
		pdfStamper.Close();
	}
}

Anytime if we want to insert any object or edit anything of PDF file with iTextSharp, we should use PdfStamper plus PdfContentByte like code above. The PdfStamper allow us to get current content GetOverContent() and add object through its functions Addxxx(). Closing the pdfStamper will save all changes back to PDF file. Image inserting makes nothing than that. Create an instance of iTextSharp image from normal image and follow the routine above. Text inserting is as same as image inserting. However we can set some more attributes for text like font, size, color, rotation, etc…. before “pasting” it to a specific location in PDF file.

private static void InsertTextToPdf(string sourceFileName, string newFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		PdfContentByte pdfContentByte = pdfStamper.GetOverContent(1);
		BaseFont baseFont = BaseFont.CreateFont(BaseFont.TIMES_ROMAN, BaseFont.CP1250, BaseFont.NOT_EMBEDDED);
		pdfContentByte.SetColorFill(BaseColor.BLUE);
		pdfContentByte.SetFontAndSize(baseFont, 8);
		pdfContentByte.BeginText();
		pdfContentByte.ShowTextAligned(PdfContentByte.ALIGN_CENTER, "Kevin Cheng - A Hong Kong actor", 400, 600, 0);
		pdfContentByte.EndText();
		pdfStamper.Close();
	}
}

Before

iTextSharp insert image and text

After

iTextSharp insert image and text

2. Extract text from PDF

Text object extracting from PDF with iTextSharp is also pretty simple. Initializing a PdfReader() and call GetTextFromPage() of PdfTextExtractor() with appropriate strategy, we’ll get all text we need.

private static void ExtractTextFromPdf(string newFileNameWithImageAndText, string extractedTextFileName)
{
	using (Stream newpdfStream = new FileStream(newFileNameWithImageAndText, FileMode.Open, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(newpdfStream);
		string text = PdfTextExtractor.GetTextFromPage(pdfReader, 1, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy());
		File.WriteAllText(extractedTextFileName, text);
	}
}

Interesting is that we don’t have to read all of text of a page or of complete PDF file but we can define that we would like to read only text of specific region. It’s very useful if we just want to read the address of letter in PDF. We don’t need to read all of letter which is time and resource consuming. Just define region where address is and read it out. For example, we would like to extract text from region as image below

iTextSharp extract text from region

private static void ExtractTextFromRegionOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		System.util.RectangleJ rect = new System.util.RectangleJ(50, 650, 250, 140);
		RenderFilter[] renderFilter = new RenderFilter[1];
		renderFilter[0] = new RegionTextRenderFilter(rect);
		ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
		Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, textExtractionStrategy));
	}
}

The code for reading from region uses same function GetTextFromPage() but with another strategy LocationTextExtractionStrategy(). This strategy will be parsed in an instance of FilteredTextRenderListener() with a RegionTextRenderFilter(). This filter contains the defined region which we want to extract exactly the text from, in this case is a rectangle.

3. Auto Fill-in PDF form

Let’s think about this case that we have an interactive form in PDF format. We would like to send this template to many users with some pre-fill-in fields, for example, the username and their addresses will be automatically filled. It’s suitable when we would like to make a survey or a new contract from current data. We can also accomplish it with help of iTextSharp, just get all Acrobat fields of PDF out, set their values and save them back. Of course as I say before, we should use PdfStamper for any editing action on PDF file. The image below shows an interactive form in PDF as example

Interactive PDF form

private static void AutoFillInFormOfPdf(string fillableFormFileName, string newfillableFormFileName)
{
	using (Stream pdfStream = new FileStream(fillableFormFileName, FileMode.Open))
	using (Stream newpdfStream = new FileStream(newfillableFormFileName, FileMode.Create, FileAccess.ReadWrite))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		PdfStamper pdfStamper = new PdfStamper(pdfReader, newpdfStream);
		foreach (KeyValuePair<string, iTextSharp.text.pdf.AcroFields.Item> pair in pdfReader.AcroFields.Fields)
		{
			Console.WriteLine(pair.Key + " - " + pair.Value);
		}
	
		AcroFields acroFields = pdfStamper.AcroFields;
		acroFields.SetField("Text_01", "ServusKevin");
		acroFields.SetField("Radio Button_01", acroFields.GetAppearanceStates("Radio Button_01")[0]);
		acroFields.SetField("Radio Button_02", acroFields.GetAppearanceStates("Radio Button_02")[1]);
		acroFields.SetField("Radio Button_03", acroFields.GetAppearanceStates("Radio Button_03")[2]);
		acroFields.SetField("Check Box_03", acroFields.GetAppearanceStates("Check Box_03")[0]);
		acroFields.SetField("Combo Box_01", pdfReader.AcroFields.GetListOptionDisplay("Combo Box_01")[4]);

		pdfStamper.Close();
	}
}

First I use a loop to list all AcroFields with his name plus his current value and then I set them with what I want. “Text_01”, “Radio Button_01”, “Radio Button_02”, “Radio Button_03″… are the names of the controls in form. Although we can easily set text of text box, the other components are not the case. The radio button and check box have custom defined values. Only setting correct value will display the control correctly (check or not checked). If we set false value, the control will be displayed as default (normally as unchecked, it depends on author of the form). These values can be enumerated with the GetAppearanceStates() function with name of field as argument. However this function is not available for combo box because he is again another case. To enumerate combo box’s values, we should use the GetListOptionDisplay(). This function will return all available choices of combo box. It’s a little confused that each component has its own behavior, but if you know the functions then it’s not complicated anymore.

Source code : https://bitbucket.org/hintdesk/dotnet-itextsharp-pdf-file-insertextract-imagetext-and-auto

4. Updates

4.1 Extract list of fonts used in PDF file – 27.04.2012

try listing below

private static void ExtractFontNameOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		List<BaseFont> set = new List<BaseFont>();
		PdfDictionary resources;

		for (int index = 1; index <= pdfReader.NumberOfPages; index++)
		{
			resources = pdfReader.GetPageN(index).GetAsDict(PdfName.RESOURCES);
			ProcessResource(set, resources);
		}

		foreach (BaseFont item in set)
			Console.WriteLine(item.PostscriptFontName + " " + item.FontType.ToString());
	}
}

private static void ProcessResource(List<BaseFont> set, PdfDictionary resources)
{
	if (resources == null)
		return;
	PdfDictionary xObjects = resources.GetAsDict(PdfName.XOBJECT);
	if (xObjects != null)
	{
		foreach (PdfName key in xObjects.Keys)
		{
			ProcessResource(set, xObjects.GetAsDict(key));
		}
	}

	PdfDictionary fonts = resources.GetAsDict(PdfName.FONT);

	if (fonts == null)
		return;
	foreach (PdfName key in fonts.Keys)
	{
		PdfDictionary fontDict = (PdfDictionary)PdfReader.GetPdfObject(fonts.Get(key));
		PdfName baseFontName = (PdfName)PdfReader.GetPdfObject(fontDict.Get(PdfName.BASEFONT));
		PRIndirectReference iRef = (PRIndirectReference)fonts.Get(key);
		if (iRef != null)
			set.Add(BaseFont.CreateFont(iRef));
	}
}

4.2 Highlighting text in existing PDF file – 30.07.2012

private static void ChangeTextColorOfPdf(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		using (PdfStamper stamper = new PdfStamper(pdfReader, pdfStream))
		{
			iTextSharp.text.Rectangle rect = new Rectangle(130, 635, 230, 650);
			float[] quadPoints = { rect.Left, rect.Bottom, rect.Right, rect.Bottom, rect.Left, rect.Top, rect.Right, rect.Top };
			PdfAnnotation highlight = PdfAnnotation.CreateMarkup(stamper.Writer, rect, null, PdfAnnotation.MARKUP_HIGHLIGHT, quadPoints);
			highlight.Color = BaseColor.GREEN;
			stamper.AddAnnotation(highlight, 1);
		}
		Console.WriteLine("Text was highlighted");
	}
}

4.3 Extract all text from .pdf – 18.08.2013

Using Parallel to extract text from .pdf file

using (Stream newpdfStream = new FileStream(newFileNameWithImageAndText, FileMode.Open, FileAccess.ReadWrite))
{
	PdfReader pdfReader = new PdfReader(newpdfStream);

	int pageSize = (int)Math.Ceiling((double)pdfReader.NumberOfPages / (double)(Environment.ProcessorCount * 2));
	int numberOfThreads = (int)Math.Ceiling((double)pdfReader.NumberOfPages / (double)pageSize);
	IList<Task> tasks = new List<Task>();
	for (int index = 0; index < numberOfThreads; index++)
	{
		int currentIndex = index;
		int page = Math.Min((index + 1) * pageSize, pdfReader.NumberOfPages);
		tasks.Add(Task.Factory.StartNew<string>(() =>
			{
				StringBuilder taskResult = new StringBuilder();
				for (int subIndex = currentIndex * pageSize + 1; subIndex <= page; subIndex++)
					taskResult.Append(PdfTextExtractor.GetTextFromPage(pdfReader, subIndex, new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy()));
				return taskResult.ToString();
			})
			.ContinueWith((t) => File.WriteAllText(currentIndex.ToString() + ".txt", t.Result)));
	}

	Task.WaitAll(tasks.ToArray());
	Console.WriteLine("Finish");
}

4.4 Get roman page numbers – 21.03.2015

Get roman page numbers of first pages such as cover, back cover, table of contents…

private static void ExtractRomanPageNumbers(string sourceFileName)
{
	using (Stream pdfStream = new FileStream(sourceFileName, FileMode.Open))
	{
		PdfReader pdfReader = new PdfReader(pdfStream);
		
		foreach (string s in GetRomanPageNumbers(pdfReader))
			Console.WriteLine(s);

	}
}

private static IEnumerable<string> GetRomanPageNumbers(PdfReader pdfReader)
{
	int n = pdfReader.NumberOfPages;

	PdfDictionary dict = pdfReader.Catalog;
	PdfDictionary labels = (PdfDictionary)PdfReader.GetPdfObjectRelease(dict.Get(PdfName.PAGELABELS));
	if (labels == null)
		return null;

	String[] labelstrings = new String[n];
	Dictionary<int, PdfObject> numberTree = PdfNumberTree.ReadTree(labels);

	int pagecount = 1;
	String prefix = "";
	char type = 'D';
	for (int i = 0; i < n; i++)
	{
		if (numberTree.ContainsKey(i))
		{
			PdfDictionary d = (PdfDictionary)PdfReader.GetPdfObjectRelease(numberTree[i]);
			if (d.Contains(PdfName.ST))
			{
				pagecount = ((PdfNumber)d.Get(PdfName.ST)).IntValue;
			}
			else
			{
				pagecount = 1;
			}
			if (d.Contains(PdfName.P))
			{
				prefix = ((PdfString)d.Get(PdfName.P)).ToUnicodeString();
			}
			if (d.Contains(PdfName.S))
			{
				type = ((PdfName)d.Get(PdfName.S)).ToString()[1];
			}
		}
		switch (type)
		{
			default:
				labelstrings[i] = pagecount.ToString();
				break;
			case 'R':
				labelstrings[i] = RomanNumberFactory.GetUpperCaseString(pagecount);
				break;
			case 'r':
				labelstrings[i] = RomanNumberFactory.GetLowerCaseString(pagecount);
				break;
			case 'A':
				labelstrings[i] =  RomanAlphabetFactory.GetUpperCaseString(pagecount);
				break;
			case 'a':
				labelstrings[i] = RomanAlphabetFactory.GetLowerCaseString(pagecount);
				break;
		}
		pagecount++;
	}
	return labelstrings;
}

49 thoughts on “C#,iTextSharp – PDF file – Insert/extract image,text,font, text highlighting and auto fillin”

  1. Hi Daten,
    Thank you so much, I figured out from your article how to get PDF hyperlink (Annotation) text quickest way using iTextSharp.

    Thanks
    Chenna Basappa C

  2. @Chenna: I’m very glad that my post helps you. If there is nothing secret, it’s great if you share your code to get Annotation. You know just for someone who needs it like you.

  3. Hi,

    Any idea how to extract Font Embedded status ( or Non embeded font list) using iTextSharp?

    Thanks

    Basappa

  4. @Basappa: I updated code for extracting font names from pdf. Hope it helps. If you find a better one, then please share it here.

  5. Great article
    I am trying to change the color of text from an existing PDF. I need to select some part of the text from a given page using co-ordinates and make it blue.
    Is this possible using itextsharp?
    Thanks

  6. @apex: Generally it’s hard to edit component in existing PDF. I don’t know how to change color of text of existing PDF file but I update the code for highlighting text with given coordinates with green. I know it doesn’t fulfill the requirement but it’s also an alternative solution.

  7. Hi Frinds,
    Many thnaks advance.Now i am working on pdf text extraction.I already attained the result of text extraction on particular pdf file.but,i need to extract the pdf contents by using pdf objects.I am working on C# domain..pls explain if any….Thanks

  8. @Arunprakash R: I don’t understand what you mean with “extract the pdf contents by using pdf objects”. Would you like to explain more what you’re trying to do?

  9. Hi, I have a problem with extracting the list of the font. I’m using iTextSharp for windows phone.

    The problem is on this line:
    set.Add(BaseFont.CreateFont(iRef));

    It gives null for the baseFontName. It’s because my fontDict doesn’t have an entry with key of “/BaseFont”.

    I’m new at this thing, so I don’t know what kind of font I am really dealing with.

    Do you have any ideas/solutions for my problem? Thank you for your help.

  10. Using itextsharp dll one can extract all the text of a PDF file page wise i.e. contents of a specific page or page contents from the first page to the last, but I want to read/extract all the contents of a pdf file ate once only. I mean in ur example you have used a for loop from the first page to the last page to get the contents of a pdf file but my requrement is a bit different I want all of the content at once only. Is this possible? If yes then How?

  11. @Siddhartha Mishra: I don’t know if it’s possible. However looping is also pretty fast, you can use it until you find another solution. You can also try multithread for extracting text and then append the result together like my update above. It takes seconds with Pro.Android.4.2012.eBook.pdf (more than 1000 pages)

  12. @Admin : I replaced my previous code (without threads) with your updated code, and then on executing the application I found out that its taking exactly the same time that previous code was taking for returning the results. What could be the issue with this?

  13. @Siddhartha Mishra: My code uses Parallel for executing long-time task which should run in background and doesn’t freeze your UI. When the task is not so long, you won’t get better performance (in some case, even worse). In your case, if your old code works already perfectly (I mean the performance), you don’t have to replace it with my own.
    Back to my code, if you want to get a better performance, just try to move PdfReader into Parallel loop. Maybe sharing the PdfReader between tasks makes a bottle-neck. I’m not sure if my suggestion makes any difference because PdfReader initialization time will be counted too. So just try, measure time and share us your best solution.
    Regards,

  14. Hi All,
    I want to retrieve all text from PDF file but not in a single string. I want to retrieve line by line text and want to convert into a text document. I want Copy of PDF Document in TEXT File help me.

  15. @Amit Yadav: I don’t understand your point. Why do you have to read out line by line. You can read all text of pages in to a string object and then split them by Environment.NewLine character.

  16. Hi, I have this trouble making me mad, what i need is to parse only the text from one PDF to another, with the same design and fonts but only the text without any image in the PDF, how can I achive this?

  17. Ok, done but my issue persist : let me explain: I have a program that extract page by page from a big PDF (from 500 to 1500 pages or more) so.. when the mother file is bigger so the childs are) My first gueess was the image in every page were increasing the size…I did what you do but the PDFstill have the same pathern,

    So.. When I extract the pages from a PDF with 2 sheets the new extracted PDFs size its about 80KB. but if I extract from one with 800 sheets the size of the children are 5MB or if extract from file with 1500 size is 9MB (like it was adding data but in the file its only 1 page)


    while (contPag <= numPag)
    {
    Document docIn = new Document(readerpages.GetPageSizeWithRotation(contPag));
    using (FileStream outputPdfStream = new FileStream(@carpeta + "\\" + nombreSinExtencion + "\\recibo" + contPag + ".pdf", FileMode.Create, FileAccess.Write, FileShare.None))
    {
    PdfWriter pdfOut = PdfWriter.GetInstance(docIn, outputPdfStream);
    pdfOut.SetPdfVersion(PdfWriter.PDF_VERSION_1_6);
    pdfOut.CompressionLevel = PdfStream.BEST_COMPRESSION;
    docIn.Open();
    PdfContentByte content = pdfOut.DirectContent;
    //Same pages size
    docIn.SetPageSize(readerpages.GetPageSizeWithRotation(1));
    docIn.NewPage();
    PdfImportedPage page = pdfOut.GetImportedPage(readerpages, contPag);
    int rotacion = readerpages.GetPageRotation(contPag);
    if (rotacion == 90 || rotacion == 270)
    {
    content.AddTemplate(page, 0, -1f, 1f, 0, 0, 0);
    }
    else
    {
    content.AddTemplate(page, 1f, 0, 0, 1f, 0, 0);
    }
    docIn.Close();
    docIn.Dispose();
    pdfOut.Close();
    pdfOut.Dispose();
    outputPdfStream.Close();
    outputPdfStream.Dispose();
    }
    contPag++;
    }

  18. @Robk: Sorry but I don’t understand what you mean even that I read your comments many time. Can you give just a short description about what you really want to archieve?

  19. How to extract text of the PDF column wise.

    Structure of of PDF is as follows :

    Category :Demo
    Name : abc
    M :123

    Category :Demo
    Name : xyz
    M :9090

    And this information is repeated in two column format.

    So how to extract the data column wise ?

  20. @swamini: You can read out the text and parse it as you want. I can’t tell you exactly how it works because it depends on your data structure.

  21. Great Article! I mostly use PDFfiller to create or edit PDF forms. Its not the same thing, but maybe you know someone that needs it. It also allows you to erase in a pdf, esign, efax, add logos, pics to pdfs, etc.
    Its pretty easy to use and its pretty cheap. I think you can get a free week if you and a friend both register. http://goo.gl/fXBVaO

  22. Update a barcode with dynamically generated barcode.

    I have multiple QR barcodes on multiple PDFs. I have these QR barcodes as placeholders for the dynamically created QR barcodes. I am having problems trying to get the replacepushbuttonfield to work.

    Thanks,

  23. @Michael J Clinton : You can post a sample file and your sample code so that I can help you. It’s difficult to figure out where your problem is without sample file and code.

  24. Great article. I have a pdf file which was generated by CutePDF writer. When I extract the text from the PDF File, it does not come out in the same sequence as the form. Any reason for this?

  25. Hi everyone, do you know if through ITextSharp there is a way to tell if a Pdf page comes from the introduction/contents (those with roman numbers) or from the actual text (like the first page of chapter 1, which is , let’s say, page 34 of the book)? I’d like to fill a list of all pages using books conventions.

  26. Hello, I am trying to merge multiple pdf files into 1 pdf file using itextsharp library. I am having an issue with page numbers in my merged output file. Since each input file has its own page numbers, the merged output file is not having proper page numbers. For instance, Page 1 of 2 is getting repeated. Is there a way to remove existing page numbers and re-add the custom page numbers back?

  27. @Sujith V S: I think you have to remove the footer yourself and number the pages again. I can help you if you can give me some small PDFs files for examples.

  28. Hey, good tips! Thanks for sharing!

    I have a question: is it possible to remove lines from a PDF using ITextSharp?

    Thank you

  29. Perhaps you can help me. I have a PDF-Form with a barcode-field inside. If I try to prefill the form – like your example with setfield – I see no barcodes only the prefilled number. After changing the number manually and pressing Enter I get a barcode …?

  30. Really nice article. Just a question: there is a way to extract text with font properties? Or at least the font applied to the text?

    Assuming that in the source pdf there are some words in bold, I would like to extract text preserving formatting.

    Thanks.

    Giuseppe

  31. Hi, Amazing post. Great Code.
    Maybe you could help me with a problema I have. I’m working on a project were I send a PDF hash to sign to an external provider, that provider sends me back the signed hash wich I have to “reinsert” on the pdf file.

    I can’t find the way to insert/update/modify the hash table from the pdf.

    ¿Can you give me a clue?

    Regards

  32. @afzaal: The section “1. Insert image and text to PDF” insert image into first page. You can just modify the code to insert image to last page. I think it wouldn’t be complicated.

  33. Hi,

    I have used your first example code, all the things goes fine but the problem is that, my logo is not appearing on top of the document, instead it is appearing in the middle of the existing content of the pdf.
    Can you please tell me where i am making the mistake.
    I have exactly used the same code.

Leave a Reply

Your email address will not be published. Required fields are marked *