Intelligent Information Extraction from PDFs using AI/DL
Going beyond a simple OCR
by Kalaiselvan Panneerselvam
In the era of digitization, most companies are digitizing their data to be competitive in their market. These companies have massive amounts of potentially untapped information stored in the form of unstructured data. It can be anything from text present in documents such as pdf/word to images, audio or social media. Extraction of information from such data is not impossible, but challenging and tricky. This article will briefly discuss the process and challenges of extracting the information from PDF documents.
“Since most of the world’s data is unstructured, an ability to analyze and act on it presents a big opportunity.”
— Michael Shulman, Head of Machine Learning, Kensho.
Optical Character Recognition, in short, OCR is the process of extracting the text information present in physical/digital documents such as PDFs and images. In simple words, OCR helps the computer to read and interpret the content of documents.
Many companies require manual labour to read through the content of documents for a data entry job. For example, these companies might be interested in extracting the information from documents such as invoices, tax forms, ID cards, financial contracts and agreements, etc. But manually extracting this information can be a daunting task for both the employees and the company. Automating these tasks will be a huge advantage for their organisation which can be beneficial in both aspects of time-saving and being profitable.
Why is a simple OCR not enough anymore?
The documents such as invoices, forms and agreements are generally stored as PDF. It is very easy for OCR to extract information when the pdf has only text in printed formats. But those PDF files can comprise information in any format such as images, form fields, tables, texts as printed, handwritten and even in multi-columns like research papers. Extracting information from these pdf files using a simple open-source OCR engine like Tesseract OCR will result in gibberish output. The presence of images, tables and multi-column text creates a bottleneck in the workflow and thus OCR fails in such scenarios. Therefore more sophisticated solutions are required to handle the text extraction from pdfs.
There are a lot of tools available in the market to provide sophisticated cloud OCR solutions to tackle the problems of pdf content extraction. Some of the notable mentions are Amazon Textract, Google Cloud Vision, IBM Datacap, Abby FineReader, etc. These solutions provide pre-trained models to extract the contents of pdf.
It works quite well in most cases, but these are strictly generic universal solutions and do not offer options to customize them according to the business needs. For example, these pre-trained models can extract most of the form field information from the pdf, but for sure they will fail to extract information such as values in a radio/checkbox button or from a tabular field. Moreover, these companies offer their services on the basis of monthly/yearly subscriptions with an attractive tagline of “No prior AI/ML experience needed” which can be costly as well as non-essential for most companies.
How to perform Intelligent Information extraction with Genie Enterprise Text Extraction service?
Genie Enterprise Inc enables B2B and B2C companies to automate the information extraction process to improve their business pipeline in terms of speed, accuracy, reliability and consistency. In this way, we help our clients to ease the pressure of sustaining in today's competitive global market by providing smarter and more profitable AI-based business solutions. Our solutions are highly customizable to the business needs, highly scalable, cost-effective and can be extended for other types of unstructured data as well. Any company/firm that needs information from unstructured data to be stored in a database can benefit from our solutions. The most common use cases are listed below:
Semantic text extraction
Tabular content extraction
Extraction of signatures, logo or any similar objects from documents
Customized form fields extraction
Extraction of text present in Multicolumn
Post extraction services like Text summarization, Question and Answering from the extracted contents
For more queries and demo, please feel free to contact us.
Kalaiselvan Panneerselvam (Msc. Data Analytics) is a Data Scientist at Genie Enterprise.
He specializes in computer vision tasks such as image recognition and semantic text extraction (OCR).