Skip to main content

Optimize PDF extraction

How do I use my PDF documents ?

Whether used for corporate activity reports, press releases, documentation or various regulations, PDF has long been one of the most widely used formats for disseminating information on the Internet. In addition, the ever increasing amount of online data , combined with recent advances in artificial intelligence and natural language processing, today offers a multitude of opportunities and new applications: scraping, automation of reading tasks, training of NLP models, etc.

Thus, PDF documents represent a vast and almost unavoidable source of data. However, they can be complex to use due to their particular structure,

In this article, we will review the right reflexes to have when dealing with this type of document, and how to make the most of them.

I. The PDF format: where do the complications come from ?

a. Why do we use it so much?

 

The Portable Document Format (PDF) is a page description language introduced by Adobe Systems in 1992 and which became an ISO standard in 2008.

This format was born out of the Camelot project, whose goal was to create "a universal means of transmitting documents across a wide variety of configurations of machines, operating systems and communication networks." The goal was to make these documents visible on any screen and printable on any modern printer.

This technological feat was achieved thanks to the invention of the PostScript page description language, making it possible to encapsulate in a single file all the elements of a page with the use of vector representations of the elements that compose it (text, fonts, graphics , encapsulated images, etc.).

Thus, the PDF format has become the "international standard", and is used in a wide and varied set of software, ranging from export in public office suites to handling by specialized programs in the artistic industry, or the generation of electronic invoices or official documents via the Internet.

b. Generated difficulties

 

However, this harmonization and unity of presentation has its downside. Indeed, the Encapsulation in Postscript language prevents from parsing these documents as one would parse a simple text file. If you do not want to use a paid service, recovering a simple piece of text from a PDF document can be quite tedious, because the operation requires a minimum of development skills, and requires libraries that are often difficult to install, slow, or limited in the type of potentially extracted data.

Besides, the relevant information in a document is very often contained in tables. Unlike a language like Excel, PDF does not have a table data structure that can be easily extracted. On the contrary, the PostScript language simply defines statements placing each character with x and y coordinates on a plane. Thus, spaces are simulated simply by moving the characters away from each other. Likewise, tables are simulated by placing characters (constituting words) in two-dimensional grids. A PDF viewer simply takes these instructions and draws everything for the user to view.

This method of encapsulating tables makes their extraction extremely difficult, and even if great progress has been made in recent years, there is currently no library enabling to restore 100% of the tables of a PDF in an exploitable format.

II. Extracting information: what strategy to use ?

There are two main categories of PDF documents: numerically created files, or “normal” PDFs, and “Image only” PDFs, or scanned PDFs. Each type of PDF imposes an extraction method.

 

To determine which type of document you are dealing with, nothing simpler! Just open the document and try to select a piece of text. If you succeed, you can already rejoice, because your document belongs to the first category, the easiest to extract. Otherwise, retrieving the content of the document will require Optical Character Recognition (OCR) techniques, which are more difficult to implement and less reliable.

 

1. The “normal” PDF

 

  • Parse the document and get some text  

    Currently, there are many open source libraries for parsing PDF documents, regardless of the programming language used (pdftools for R, PDFBox for Java, etc.). However, the vast majority of them are coded in Python. Below is a non-exhaustive list of the main python PDF parsing libraries, and their main characteristics:

    • PDFMiner, the building block of many wrappers like Slate or PDFQuery, and which specializes in text recovery and analysis. It provides the exact location of text on a page, as well as other information such as fonts or lines. It includes a converter that can transform PDF files into other text formats (such as HTML), and also has an expandable PDF parser that can be used for other purposes than text analysis. It is only compatible with Python 2, but has wrappers for Python 3 like PDFMiner.six.
    • PyPDF2, capable of splitting, merging, cropping and transforming pages of PDF files. It also enables access to certain parameters such as display options or document metadata, and passwords management.
    • pdfrw, enabling the aforementioned operations to be performed very quickly, and can also be coupled with other libraries to perform additional operations (rst2pdf to faithfully reproduce vector images of a document, or reportlab to reuse existing PDFs in new).

Once the document has been parsed using these libraries, it can be accessed as a simple text file, and can be used using conventional information search scripts, for example using regular expressions.

 

  • The special case of tables

As we saw in the introduction, a very large amount of relevant information is usually contained in tables, and the PDF format greatly complicates its extraction.

Here again, there have been open-source (Tabula, pdf-table-extract) or private (smallpdf, PDFTables) Python tools for several years to perform this kind of task. However, in most cases, the result is either very satisfactory (and usable in a Pandas dataframe for example), or the library fails completely. There is no intermediate result.

In the case of a one-time extraction, these libraries can be handy because you can give them the area to check out on the page. However, this method cannot be industrialized, as it is very rare for the tables in a PDF to have a standardized format. On the contrary, as the image below shows, more and more PDFs

That is why table extraction usually leads to writing ad hoc scripts for each type of PDF table. However, a recent library named Camelot was created to overcome these difficulties and offer users much greater control over the extraction of tables, thanks to multiple tools and adjustable parameters. Among the various innovations that Camelot brings, we can list:

  • automatic detection of several tables on the same page;
  • stream and lattice extraction modes, allowing the user to indicate whether the table has well-defined dividing lines, such as an Excel table (lattice), or whether Camelot should infer the structure of the table from the arrangement of the various elements, and in particular of the white areas (stream). For more information, click here;
  • a visual debugger, enabling users to grasp what the library "sees", and thus understand which parameters to adjust, so that the table is better detected.

 

2. PDF Image

 

For the second type of PDF, the solution is to use OCR tools, with open-source libraries like Pytesseract, Textract or Pyocr.

Here again, they offer the user a more or less extensive set of functions to recognize text, tables, the extraction zones associated with the different characters, and even configure the type of data to be extracted (letter, word, number , etc.)

III. Feasibility of an extraction project, and integration into existing processes

In a project involving the extraction of PDF documents, there are two sets of questions that must first be answered.

1. Is the information always present in the document and where is it located? What is my minimum expected service? How should I deal with irreducible errors ?

As we have seen, it is almost impossible to recover 100% of the information in a PDF document. That is why it is essential for each new project involving the extraction of PDFs to carry out a real feasibility study relating the objectives of the project to the real extraction capacity, and to determine a priori a threshold below which the project will not be feasible.

Let's take an example, and imagine that we want to retrieve financial data from activity reports of French SMEs. Based on the (daring) assumption that we have access to a sufficient number of financial and accounting documents of these companies, it is very likely to be faced with the following issues:

  • the information is not present on all documents. Indeed, it happens that some companies do not provide a detailed income statement, and limit themselves to only giving the main fields;
  • the information is there, but in different formats. A first company will give its bottom line in a paragraph of text. A second will present it in a table displaying the information inline. A third will present them in columns, and use the rows to compare year N to year N-1, etc.

In this example, the feasibility study will estimate the percentage of documents containing the information sought, as well as scripts to be coded to extract it (one script per information presentation format). This will clarify the stakes and challenges of the project to the various stakeholders, by anticipating its limits as well as the necessary development time.

2. What tools should I use? Are they consistent with the dimension of the project (size, performance) and the expected level of service?

 

We have indeed seen above that there exists a myriad of tools, more or less difficult to install, efficient, and above all more or less resource costly. Thus, in order to optimize the use of the various tools, it is essential to determine upstream the type of PDF to be processed (image or text), the type of information to be extracted (text, table, images, etc.), and finally the volume of documents. Indeed, between extracting a hundred documents of a few KB every month and 1000 documents of several MB per day, the resources needed will not be the same, knowing that the extraction of PDF can take from a few seconds to a few minutes depending on the size of the document.

Conclusion

We have reviewed the different reflexes to have when working with PDF documents: define the type of document, locate the information, analyze the diversity of formats, deduce the technologies to use, estimate the volume of documents to process and establish a performance threshold. Although these tasks may seem tedious, they are nonetheless essential to ensure the feasibility of a project. They make it easier to manage, and make it possible to anticipate a large number of potential fatal errors, which generally only occur after several weeks of development.