2020 CCAR Resubmission
How do I use my PDF documents ?
Whether used for corporate activity reports, press releases, documentation or various regulations, PDF has long been one of the most widely used formats for disseminating information on the Internet. In addition, the ever increasing amount of online data , combined with recent advances in artificial intelligence and natural language processing, today offers a multitude of opportunities and new applications: scraping, automation of reading tasks, training of NLP models, etc.
Thus, PDF documents represent a vast and almost unavoidable source of data. However, they can be complex to use due to their particular structure,
In this article, we will review the right reflexes to have when dealing with this type of document, and how to make the most of them.
The Portable Document Format (PDF) is a page description language introduced by Adobe Systems in 1992 and which became an ISO standard in 2008.
This format was born out of the Camelot project, whose goal was to create "a universal means of transmitting documents across a wide variety of configurations of machines, operating systems and communication networks." The goal was to make these documents visible on any screen and printable on any modern printer.
This technological feat was achieved thanks to the invention of the PostScript page description language, making it possible to encapsulate in a single file all the elements of a page with the use of vector representations of the elements that compose it (text, fonts, graphics , encapsulated images, etc.).
Thus, the PDF format has become the "international standard", and is used in a wide and varied set of software, ranging from export in public office suites to handling by specialized programs in the artistic industry, or the generation of electronic invoices or official documents via the Internet.
However, this harmonization and unity of presentation has its downside. Indeed, the Encapsulation in Postscript language prevents from parsing these documents as one would parse a simple text file. If you do not want to use a paid service, recovering a simple piece of text from a PDF document can be quite tedious, because the operation requires a minimum of development skills, and requires libraries that are often difficult to install, slow, or limited in the type of potentially extracted data.
Besides, the relevant information in a document is very often contained in tables. Unlike a language like Excel, PDF does not have a table data structure that can be easily extracted. On the contrary, the PostScript language simply defines statements placing each character with x and y coordinates on a plane. Thus, spaces are simulated simply by moving the characters away from each other. Likewise, tables are simulated by placing characters (constituting words) in two-dimensional grids. A PDF viewer simply takes these instructions and draws everything for the user to view.
This method of encapsulating tables makes their extraction extremely difficult, and even if great progress has been made in recent years, there is currently no library enabling to restore 100% of the tables of a PDF in an exploitable format.
There are two main categories of PDF documents: numerically created files, or “normal” PDFs, and “Image only” PDFs, or scanned PDFs. Each type of PDF imposes an extraction method.
To determine which type of document you are dealing with, nothing simpler! Just open the document and try to select a piece of text. If you succeed, you can already rejoice, because your document belongs to the first category, the easiest to extract. Otherwise, retrieving the content of the document will require Optical Character Recognition (OCR) techniques, which are more difficult to implement and less reliable.
Currently, there are many open source libraries for parsing PDF documents, regardless of the programming language used (pdftools for R, PDFBox for Java, etc.). However, the vast majority of them are coded in Python. Below is a non-exhaustive list of the main python PDF parsing libraries, and their main characteristics:
Once the document has been parsed using these libraries, it can be accessed as a simple text file, and can be used using conventional information search scripts, for example using regular expressions.
As we saw in the introduction, a very large amount of relevant information is usually contained in tables, and the PDF format greatly complicates its extraction.
Here again, there have been open-source (Tabula, pdf-table-extract) or private (smallpdf, PDFTables) Python tools for several years to perform this kind of task. However, in most cases, the result is either very satisfactory (and usable in a Pandas dataframe for example), or the library fails completely. There is no intermediate result.
In the case of a one-time extraction, these libraries can be handy because you can give them the area to check out on the page. However, this method cannot be industrialized, as it is very rare for the tables in a PDF to have a standardized format. On the contrary, as the image below shows, more and more PDFs
That is why table extraction usually leads to writing ad hoc scripts for each type of PDF table. However, a recent library named Camelot was created to overcome these difficulties and offer users much greater control over the extraction of tables, thanks to multiple tools and adjustable parameters. Among the various innovations that Camelot brings, we can list:
Here again, they offer the user a more or less extensive set of functions to recognize text, tables, the extraction zones associated with the different characters, and even configure the type of data to be extracted (letter, word, number , etc.)
In a project involving the extraction of PDF documents, there are two sets of questions that must first be answered.
As we have seen, it is almost impossible to recover 100% of the information in a PDF document. That is why it is essential for each new project involving the extraction of PDFs to carry out a real feasibility study relating the objectives of the project to the real extraction capacity, and to determine a priori a threshold below which the project will not be feasible.
Let's take an example, and imagine that we want to retrieve financial data from activity reports of French SMEs. Based on the (daring) assumption that we have access to a sufficient number of financial and accounting documents of these companies, it is very likely to be faced with the following issues:
In this example, the feasibility study will estimate the percentage of documents containing the information sought, as well as scripts to be coded to extract it (one script per information presentation format). This will clarify the stakes and challenges of the project to the various stakeholders, by anticipating its limits as well as the necessary development time.
We have indeed seen above that there exists a myriad of tools, more or less difficult to install, efficient, and above all more or less resource costly. Thus, in order to optimize the use of the various tools, it is essential to determine upstream the type of PDF to be processed (image or text), the type of information to be extracted (text, table, images, etc.), and finally the volume of documents. Indeed, between extracting a hundred documents of a few KB every month and 1000 documents of several MB per day, the resources needed will not be the same, knowing that the extraction of PDF can take from a few seconds to a few minutes depending on the size of the document.
We have reviewed the different reflexes to have when working with PDF documents: define the type of document, locate the information, analyze the diversity of formats, deduce the technologies to use, estimate the volume of documents to process and establish a performance threshold. Although these tasks may seem tedious, they are nonetheless essential to ensure the feasibility of a project. They make it easier to manage, and make it possible to anticipate a large number of potential fatal errors, which generally only occur after several weeks of development.