How to Extract Key-Value Pairs Using Deep Learning

How to extract Key-Value pairs from Documents using deep learning

You encounter KVPs more often than you might realize. Remember the last time you flipped through a dictionary? Each word (the key) is paired with its definition (the value). Or consider the forms you've filled out - the questions are the keys, and your answers are the values. Even in the business world, invoices use this structure: items purchased are the keys, with prices as their corresponding values.

Forms are a common real-world example of key-value pair information display

But here's the challenge: unlike neatly structured tables, KVPs often hide in unstructured data or unfamiliar formats. Sometimes, they're even partially handwritten. Imagine trying to manually extract data from thousands of forms or scanned invoices and converting them from handwritten to text. It's a recipe for errors and frustration.

This is where automated key-value pair extraction helps. By leveraging deep learning techniques, we can teach machines to understand document structures and extract valuable information accurately and efficiently.

In this guide, we'll look at key-value pair extraction, from its wide-ranging applications to cutting-edge techniques. We will provide an overview of KVP extraction use cases, help you understand traditional methods and their limitations, explore how deep learning is revolutionizing the field, and guide you through building your own extraction system.

What is a Key-Value Pair (KVP)?

Imagine you're organizing your closet. You might label each shelf: "Shirts," "Pants," "Shoes." These labels are your keys, and the items on each shelf are the values. That's the essence of a key-value pair!

In the world of data, a key-value pair (KVP) is a set of two linked data elements: a unique identifier (the key) and its associated data (the value). It's like a digital labeling system that allows for efficient storage and retrieval of information.

Key-Value Pairs or KVPs are essentially two linked data items: a key, which serves as a unique identifier, and a value associated with that key. — An example of key-value pair extraction

KVPs are the building blocks of many data structures and databases. The beauty of KVPs lies in their simplicity and flexibility. They can handle structured data (like spreadsheets) and unstructured data (like text in documents) equally well. This makes them a powerful tool for key information extraction from diverse sources.

Who will find KVP extraction useful?

Key-value pair extraction is not just for tech wizards. This powerful technique has applications that stretch far beyond the realm of coding and data science.

Let's explore how KVP extraction can be a game-changer for both personal and business use.

Personal use cases

Invoice OCR in Nanonets — Key value pair extraction from email

While automation is mostly used for large-scale productions, fast and accurate key-value extraction can also benefit small parties and personal usages, improving the organization and efficiency of daily routines.

1. ID-scanning and data conversion:

Personal IDs are typical examples of documents that contain various KVPs, from the given name to the date of birth. When needed for online applications, we often have to manually find and type in the information, which could be tedious and repetitive.

KVP extractions from images of the ID can allow us to quickly convert data into machine-understandable texts. Finding the matching fields for different values will then become a trivial task for programs, and the only manual effort required would be to just scan through for double-checking.

2. Invoice data extraction for budgeting:

Budgeting is an important aspect of our personal routine. While the development of Excel and spreadsheets has already made such irritable tasks simpler, a KVP extraction of items purchased and their corresponding prices from merely an image of the invoice can speed up the entire process even faster. Structured data and numbers can allow us to quickly perform analysis and watch out for purchases that are beyond our affordability.

3. Email organization and prioritization:

Drowning in a sea of emails? KVP extraction can help you stay afloat. By identifying key information like sender, subject, and important dates within emails, it can automatically sort and prioritize your inbox. Imagine never missing an important deadline or follow-up again!

Businesses use cases

Both industries and corporations deal with thousands of paperwork with similar formats every day. From applications to asset management, these document information retrieval processes are often labor-intensive.

Hence, automation of the initial step of extracting key-value pairs within unformatted data can significantly reduce the redundancy of human resources while simultaneously ensuring the reliability of the data retrieved.

1. Automation of document scanning:

Governments or large businesses such as banks process many handwritten forms with identical formats for various purposes (e.g., Visa application, bank transfer). Retrieving the handwritten information from the forms and converting it into digital documents via human effort could be extremely repetitive and tedious, leading to frequent minor errors.

A proper KVP extraction pipeline of converting handwritten data into corresponding values of different keys and then inputting it into large-scale systems can reduce such errors and save extra labor expenditures.

2. Survey collection and statistical analysis:

Companies and Non-Governmental Organisations (NGOs) may often require feedback from customers or citizens to improve their current products or promotional plans. They'd need to perform a statistical analysis to evaluate the input comprehensively.

Yet, the similar problem of converting unstructured data and handwritten surveys into numerical figures that could be used for calculations still exists. Hence, KVP extraction plays a crucial role in converting images of these surveys into analyzable data.

3. Supply chain management:

In the complex world of logistics, KVP extraction can be a lifesaver. Extract key information from shipping manifests, invoices, and customs documents to streamline your supply chain processes. This can lead to faster shipments, reduced errors, and happier customers.

4. Healthcare record management:

For healthcare providers, managing patient records efficiently is crucial. KVP extraction can help digitize and organize patient information from various sources – intake forms, lab reports, and doctor's notes. This not only saves time but can also improve patient care by making critical information easily accessible.

5. Legal document analysis:

Law firms deal with mountains of documents daily. KVP extraction can help lawyers quickly identify key information in contracts, court documents, and case files. This can significantly speed up case preparation and contract review processes, allowing legal professionals to focus on strategy rather than drowning in paperwork.

6. Customer service optimization:

By extracting key information from customer emails, chat logs, and support tickets, businesses can quickly categorize and prioritize customer issues. This leads to faster response times, more personalized service, and ultimately, higher customer satisfaction.

So, how exactly does KVP extraction work? And how can you implement it in your own projects or business processes? In the next section, we'll look at the traditional approaches to KVP extraction and their limitations.

The traditional approach to Key Value Pair extraction

The most important element of KVP extraction and finding the underlying useful data is the Optical Character Recognition (OCR) process. In simple words, OCR is the electronic conversion of scanned images and photos into machine-encoded texts for further computations.

Before the accuracy of deep learning meets the needs of the markets for such tasks, OCRs are performed with the following procedure:

Database creation: First, we build a vast library of known characters and symbols. It's like creating a digital alphabet book.
Feature detection: When an image comes in, OCR uses a photosensor to identify key points and features. Imagine tracing the lines of each letter with your finger.
Pattern matching: The system then compares the detected features with its database of known characters.
Text conversion: Based on the highest similarity attributes, it transforms the matched patterns into machine-readable text, making your scanned image or document digitally accessible.

Limitations of traditional Key Value Extraction

For years, this approach has been the go-to method for extracting key-value pairs from documents. But as with any technology, it has its limitations.

Template dependence: Traditional methods often require predefined templates or rules for different document types.
Handwriting detection: While great with printed text, these systems often stumble when faced with the wild world of human handwriting.
Lack of context: Traditional OCR focuses on individual characters, sometimes missing the bigger picture of how information is structured on the page.
Inflexibility: Adapting to new document formats or layouts can be time-consuming and require manual updates to the system.

Despite these limitations, traditional methods still play a crucial role in many key value extraction scenarios. However, as our data needs have grown more complex – think of the vast array of document types a large corporation deals with daily – so too have our extraction methods.

Thankfully, the recent advancements in deep learning have breathed new life into OCR and key-value pair extraction techniques. Deep learning models, particularly convolutional neural networks (CNNs), have revolutionized the field of image recognition and text extraction.

Read About: Best Practices for Deep Learning

Deep learning in action

Deep learning is one of the significant branches of machine learning that has gained popularity in recent decades. Unlike traditional computer science and engineering approaches, where we design the system that receives an input to generate an output, deep learning hopes to rely on the inputs and outputs to design an intermediate system that can be extended to unseen inputs by creating a so-called neural network.

At the heart of deep learning lies the neural network - a complex web of interconnected nodes.

A neural network is an architecture that is inspired by the biological function of the human brain. The network consists of multiple layers:

Input layer: This is where your document enters the system. Whether it's a scanned invoice, a handwritten form, or a digital PDF, the input layer processes the raw data.
Hidden layers: These are the brain's powerhouse. Multiple layers work together to identify features, recognize patterns, and make sense of the document's structure.
Output layer: This is where the magic happens. The system produces the extracted key-value pairs, neatly organized and ready for use.

As the capacity of GPUs and memories drastically advanced, deep learning has become a favorable strategy in recent years, which ignited creative variations of neural networks. One of the most typical neural networks used today, especially in computer vision, is the convolutional neural network (CNN). CNNs are convolutional kernels that slide through the image to extract features, often accompanied by traditional network layers to perform tasks such as image classification or object detection.

It doesn't just look at individual words or characters; it examines the entire document, considering layout, font sizes, and even subtle visual cues. This holistic approach allows it to understand the document's structure and extract key-value pairs with remarkable accuracy.

For instance, in healthcare record management, a CNN can distinguish between patient information, doctor's notes, and test results, even when the layout varies between documents. This level of understanding was simply not possible with traditional methods.

The most exciting bit is that the more documents a deep learning system processes, the smarter it becomes.

Now that you have some basic understanding of deep learning, let's go through several deep learning approaches for KVP extraction.

Tesseract OCR Engine

Recent OCR techniques have also incorporated deep learning models to achieve higher accuracy. The Tesseract OCR engine, maintained by Google, is a prime example. It utilizes a specific type of neural network called Long Short-Term Memory (LSTM).

What is LSTM?

An LSTM is a particular family of networks that are applied majorly to sequence inputs. Here's why it's a game-changer for key value pair extraction:

Sequential Data Processing: LSTMs excel at handling sequential data. Think of it as reading a document the way a human would – understanding context and predicting what might come next.

Context matters: In OCR, previously detected letters can help predict the next ones. For example, if "D" and "o" are detected, "g" is more likely to follow than "y".

Tesseract Architecture

the above figure is the detailed architecture of the current Tesseract V4.

A small bounding box is moved forward pixel by pixel with time. The image bounded by the box is extracted to pass through both a forward and backward LSTM, followed by a convolution layer for the final output.

How does it help with KVP Extraction?

The improved architecture increases the accuracy and robustness of the OCR, making it easier to convert multiple different types of texts into one structured, electronic document. These electronic documents with machine-readable strings are much easier to be organised for KVP extraction.

Deep Reader

Besides leading the advancements in OCR, deep learning also created opportunities for exploration. Deep Reader, a workshop paper from the top CS conference ACCV*, is one example that utilizes neural networks to recognize shapes and formats extending beyond just words and symbols of a scanned document. Such techniques can be particularly helpful in tasks such as KVP extraction.

*Side Note: The best research papers from the computer science domain are usually published in top-tier conferences. Acceptance into such conferences symbolises an approval and recognition of by experts within the field. The Asian Conference on Computer Vision (ACCV) is one of the recognized conferences within the domain of computer vision.

What is Deep Reader?

While Tesseract focuses on text, Deep Reader takes key value pair extraction to the next level by understanding the entire document structure.

Deep Readers attempts to tackle the ongoing problem of insufficient information retrieval when extracting only words and texts alone by also finding the visual entities such as lines, tables, and boxes within these scanned documents.

For every image, Deep Reader denoises the image, identifies the document, and processes the handwritten text with a deep-learning approach before detecting and extracting meaningful texts and shapes. These features are then used to retrieve tables, boxes, and, most importantly, KVPs.

Pre-processing

Prior to extracting textual entities, Deep Reader performs several pre-processing steps to ensure the best quality retrieval in the latter parts:

Image de-noising: Deep Reader adopts a generative adversarial network (GAN) to generate a de-noised version of an input. GAN, first developed by Ian et al. in 2014, is a neural network that comprises two sub-networks — a generator and a discriminator. Once an input is given, the generator generates an image based on the input, and the discriminator tries to distinguish between the ground truth and the generated input. Upon training-completion, a generator can successfully generate an image based on the input that is close to the actual ground truth. In this case, the GAN, given pairs of images (one de-noised and one noised), attempts to learn how to generate the de-noised version of the image from the perturbed one.
Document identification: In order to accurately retrieve visual entities, Deep Reader also attempts to classify the scanned documents into one of the templates via a convolutional Siamese network. The Siamese network consists of two identical convolutional layers that accept images of the scanned document and templates as inputs respectively, then compute the similarity between the two. The highest similarity among all comparisons implies that the document is based on the template.
Processing handwritten text: To tackle the problem of recognising handwritten texts, Deep Reader also adopts a handwritten text recognition through an encoder-decoder to map the handwritten texts into sets of characters.

Deep Reader Architecture

After pre-processing, Deep Reader detects a set of entities from the image, including page lines, text blocks, lines of text blocks, and boxes. The detection goes through the schema, as shown in the above figure to retrieve a comprehensive set of data from the scanned document.

Rule-based methods provided by domain experts are also adopted to aid the extraction process. For example, Deep Reader uses abstract universal data types such as city, country, and date to ensure that the fields retrieved are relevant.

Code implementation: Bringing KVP Extraction to life

Let's apply our theoretical knowledge to a practical problem. We'll focus on a common yet challenging scenario: extracting company, address, and price fields from invoices. Whether you're a small business owner tracking expenses or a data scientist automating document processing, this implementation will give you a solid foundation.

A sample invoice image -- the type that we wil be extracting — Imagine you have a stack of invoices that look something like this.

The figure above is a standard invoice template saved in an image format. We have many of these invoices with similar formats, but manually finding the KVPs, such as the company name, address, and total price, is a tiring job. Thus, the aim is to design a KVP extractor such that with a given format (or similar formats), we can automatically retrieve and present the KVPs.

To perform KVP extraction, we will need an OCR library and an image processing library. We will use the infamous openCV library for image reading and processing and the PyTesseract library for OCR. The PyTesseract library is a wrapper of the aforementioned Google Tesseract engine, which will be sufficient for our task.

*Side Note: The program is based on the solution of the ICDAR Robusting Reading Challenge

Part I — Libraries

You can use pip to install the two libraries via the following commands:

https://gist.github.com/ttchengab/c040ab7ce44114d76c63ecef226d5d09

After installation, we can then import the libraries as the following:

https://gist.github.com/ttchengab/cd32bcd502e99c3e3cc9c73f693927c7

We will also have to import some external libraries:

https://gist.github.com/ttchengab/01280236448e4fc4a03505f6f0baea3f

Part II — Image Preprocessing

https://gist.github.com/ttchengab/293fc3ca782b20cf9b05c33f13583338

The function above is our image preprocessing for text retrieval. We follow a two stage approach to accomplish this:

Firstly, we utilize the cv2.imread() function to retrieve the image for processing. To increase the clarity of the texts in the image, we performed image dilation followed by noise removal using some cv2 functions. Some additional functions for image processing is also listed in the comment section. Then, we find contours from the image and based on the contours we find the bounding rectangles.

Secondly, after image processing, we then iteratively retrieve each bounding box and use the pytesseract engine to retrieve retrieve all the text information to feed into a network for KVP extraction.

Part III – LSTM KVP Extraction

https://gist.github.com/ttchengab/b81ea8bb1c21121237845d65d15aa3a0

The model above is a simple LSTM that takes the texts as inputs and outputs the KVPs of company name, date, address, and total. We adopted the pre-trained model from the solution for testing.

The following are the evaluation functions for the LSTM network with a given set of texts:

https://gist.github.com/ttchengab/9f31568ef1b916ab0ee74ac1b8b482e5

Part IV – Entire Pipeline

https://gist.github.com/ttchengab/c2f7614cbeaa8cd14883d4ebbcd36ba6

With all the functions and libraries implemented, the entire pipeline of KVP extraction can be achieved with the above code. Using the invoice above, we could successfully retrieve the company name and the address as the following:

company details extracted from the invoice

To test the robustness of our model, we can also test on invoices with unseen formats, such as the following:

By using the same pipeline, without further training, we could obtain the following:

Even though we couldn't retrieve other information such as company name or address, we were still able to obtain the total correctly without ever seeing any similar invoice formats before!

With an understanding of the model architecture and pipeline, you can now use more invoice formats that are more relevant as training and continue to train the model so that it would work with higher confidence and accuracy.

Best practices and optimization techniques for Key-Value Extraction

Implementing an effective key value pair extraction system isn't just about writing code; it's about optimizing your approach for accuracy, efficiency, and scalability. Here are some best practices to supercharge your extraction process:

Clean your images: Remove noise, correct skew, and enhance contrast.
Standardize formats: Convert all documents to a consistent format before processing.
Create custom dictionaries: Build lists of expected keys for specific document types.
Use regular expressions: Design patterns to catch common value formats (e.g., dates, currency).
Validate extracted data: Set up checks to ensure extracted values make sense.
Handle exceptions: Plan for unexpected document formats or OCR errors.
Use parallel processing: Distribute extraction tasks across multiple cores or machines.
Implement caching: Store frequently accessed data to reduce processing time.
Implement feedback loops: Allow users to correct errors, feeding this data back into your system.
Regularly update your models: Retrain on new data to improve accuracy over time.
Encrypt sensitive data: Protect extracted information, especially when dealing with personal or financial details.
Implement access controls: Ensure only authorized personnel can access extracted data.

What is a Key-Value Database?

While we've explored the intricacies of key value pair extraction, it's crucial to understand where this data often ends up: key-value databases. These powerful systems help many modern applications, from e-commerce platforms to social media networks.

A key-value database, also known as a key-value store, is a type of non-relational database that uses a simple key-value method to store data. Each item in the database is stored as an attribute name (or "key") together with its value.

Key-Value vs. Relational Databases

Traditional relational databases organize data into tables with predefined schemas. In contrast, key-value databases offer more flexibility:

Schema-less: Key-value databases don't require a fixed schema, allowing for easy modifications.
Scalability: They can handle vast amounts of data and traffic more efficiently.
Performance: For simple queries, key-value databases often outperform relational databases.

Nanonets OCR API for Key Value Extraction

As we've explored the complexities of key-value pair extraction, it's clear that implementing a robust solution requires significant expertise. This is where platforms like Nanonets shine, offering a powerful OCR API that simplifies the extraction process.

Nanonets leverages cutting-edge AI to provide:

Pre-trained models for common documents like invoices, receipts, and ID cards
Custom training capabilities for your unique document formats
High accuracy on both printed and handwritten text
Seamless integration through a RESTful API
Flexible post-processing rules to refine extracted data

For organizations looking to quickly implement key value extraction without compromising on quality, Nanonets offers a compelling solution. By handling the complexities of AI model development and maintenance, Nanonets allows businesses to focus on what really matters – deriving value from their document data.

Whether you're a startup processing your first batch of invoices or an enterprise handling millions of documents, platforms like Nanonets are making advanced key value extraction accessible and efficient.

Final thoughts

We've covered a lot of ground on key-value pair extraction. We've explored the concept of KVPs, their use cases, and various extraction methods - from traditional OCR to cutting-edge deep learning approaches. But remember, there's still a long way to go.

This field is constantly evolving, with AI and machine learning pushing the boundaries of what's possible. As we wrap up, consider how you can apply these insights to your own document processing challenges.