How to OCR Purchase Orders For Automation

A typical automated procurement process at any company has multiple documents associated with it like Invoices, Purchase Orders, Delivery Notes, etc. This process has been a consistent focus of technology-based improvements to reduce the overhead. A major optimization has been through the digitization of these documents leading to lower costs, faster turnaround times, and a reduction in error rates. This post will outline the current state-of-the-art technology in OCR-based data capture from these documents specifically focusing on Purchase Orders.

Without going into too much detail a typical procurement workflow looks like this:

  1. Buyer shares a Purchase Order
  2. The vendor generates an Invoice
  3. Buyer generates a GRN / Order Receipt Note

There are some subtle differences between the data capture process and requirements for each of these documents because of the difference in the information contained and the structure of these documents. A major difference is also because of who is preparing the document and as a result who has the need to digitize the document.

3 Way Matching

A major reason for the digitization is because these documents need to all corroborate and tell a consistent story of the transaction. The process of corroboration of these 3 documents is referred to as 3-way matching. The need for and process of 3-way matching vastly differ depending on who is conducting the match, the buyer or the seller.

Buyer’s Perspective:

The Buyer generates the PO and the Receipt and has this information which can be easily reconciled in their software. Need to match the Invoice to the Purchase Order and Receipt. The Buyer needs to digitize the Invoice the other documents are already inside their ERP system.

There are a variety of reasons why a Buyer needs to perform 3-way matching:

  1. Ensuring that the purchase is authorized by matching an invoice and GRN with a Purchase Order
  2. Ensuring the correct product is purchased by matching across documents
  3. Ensuring the correct quantity authorized has been procured and delivered.
  4. Ensuring the price paid for each product was authorized
  5. Ensuring the correct vendor was selected and the correct vendor will eventually get paid since the same products might be procured from different vendors
  6. Matching inventory with quantities in the GRN for downstream quality of data

Seller’s Perspective:

The Seller generates the Invoice and needs to make sure the PO and Receipt match the information in the Invoice. The Seller needs to digitise the Purchase Order and Receipt the Invoice is generated from his ERP.

Seller’s need for 3 Way Matching

  1. Checking if a Purchase order can be fulfilled given inventory in the system
  2. Ensuring goods shipped to match the product that was requested
  3. Ensuring the correct customer was sent the requested product
  4. Ensuring the products requested are the ones that are delivered to the customer
  5. Ensuring the correct quantities are given to the customer
  6. Ensuring price in PO has a gross margin that can be fulfilled

Looking for an AI based OCR solution to automate Purchase Orders? Give Nanonets™ Intelligent Automation Platform a spin and put your Purchase Orders on autopilot!


Problems with Manual 3 Way Matching

Since the 3 way matching problem is fairly critical for both sides to meet their end of the contract efficiency and accuracy are paramount. However this is a fairly high process cost from a variety of perspectives:

Human Cost of Document Tracking and Errors

  1. It is a messy process especially when the PO is revised multiple times. Maintaining the correct version of the PO can be difficult. If not done correctly, it can lead to multiple payments, delivery of extra items, etc.
  2. There are multiple similar documents and transactions between a frequent supplier, buyer. These transactions can be consumed.
  3. The process cannot scale. Maintaining optimal human resources is hard when the volume of processing changes rapidly. Most companies have these departments overstaffed to compensate for spikes in volume

Payment or Procurement Delays

  1. Data from the documents is entered manually. This process becomes a bottleneck when the volume of documents processed increases
  2. Delays can lead to a delay in delivery/payment / procurement. Leading to a high cost of working capital or loss of revenue due to delays in procuring raw materials etc.

Inventory Errors

  1. If Inventory systems are not correctly integrated with this process there can be a high cost of miscalculating inventory. Resulting in overstocking, duplicate orders, understocking and loss of revenue.

Errors in 3 Way Matching

There are a variety of different errors that occur in this process. Below are a few examples

Vendor Matching Errors

Vendor matching is typically done based on two things, Vendor Name and Address. Since the same company could have various subsidiaries and different business units issuing similar invoices.

If the address from the Purchase Order and Invoice don’t have the correct address and Vendor name identified there could be issues in matching. Also as can be clearly seen just direct text matching doesn't work in matching the Invoice and PO.

Purchase Order Invoice Status
Acme Inc. Acme Inc. Works
Acme Inc. Acme Inc. Africa Fails
Acme Acme LLC. Fails
Acme LLC. Acme LLC. Artillery Division Fails

Product Matching Errors

Products are the most difficult item to match since they rarely follow the same name in the purchase order and invoice and receipt. This is most likely the biggest cause for errors.

Causes for errors could be because there are different versions of the same product, different sizes, specifications and prices. The product might have had a recent update, a substitute for an unavailable item was provided etc.

Purchase Order Invoice Status
TYLENOL Sinus Pressure and Pain TYLENOL Sinus Pressure and Pain Works
Tylenol TYLENOL Sinus Pressure and Pain Fails
TYLENOL Sinus Pressure and Pain TYLENOL® Sinus Pressure and Pain Fails
TYLENOL Extra Strength TYLENOL Extra Strength Pain Reliever & Fever Reducer 500 mg Caplets Fails
TYLENOL Extra Strength Pain Reliever & Fever Reducer 500 mg Caplets TYLENOL Extra Strength Pain Reliever & Fever Reducer 250 mg Caplets Fails

Quantity Matching Errors

Even if the product is matched correctly there might be errors in matching the quantity if the product in a specified quantity is unavailable etc. This is typically true between the purchase order and GRN as well since there is usually a time lag between these two documents.

Price Matching Errors

If there was a change in price during the entire product purchase lifecycle or if the product was updated or substituted this error might occur.

Duplicate Documents

If there are frequent purchases of similar products from the same vendor there might be multiple documents that look very similar. If the reference number to a PO is not mentioned in the Invoice and on the GRN for either the Invoice or the PO then there is a significant scope for document mismatch.


Digitising Purchase Orders

To get gather all of the relevant data from Purchase Orders the following fields need to be extracted:

Common list of fields to be extracted (across Purchase Orders they might have different names):

Billing Address Payment Terms Sub Total
Buyer Name PO Number Subject
Contact Name Product Total
Currency Purchase Order Date Unit Price
Deliver To Quantity Vendor Name
Due Date Requisition No


Current Solutions and their Problems

Template + Text Matching

This involves defining the exact region to look for a particular piece of information. So let’s say if you are looking to extract the date and the format is exactly the same across documents and the date occurs at the exact same place in the document. We define the area to look for in the document for the date.
Here is the process:

  1. Convert the document to an image
  2. We give a sample document
  3. Mark the region where the date is found (the document is viewed as a coordinate system where the top left corner is (0,0)) we might mark (200,300) to (350,450) which is the region of interest for the date
  4. When there is a new document we go and check (200,300) to (350,450) and extract the text there

Historically this has been one of the most common approaches to solving this problem. This is due to a variety of reasons:

  1. Simplicity of software and to implement. A competent programmer could create this solution in less than a day
  2. There is no uncertainty how the solution will work, if the document is exactly the same format it will work perfectly
  3. It requires very limited compute resources to extract the data

There are however some very obvious challenges given how rudimentary this method is:

  1. It doesn't work if there even a slight difference in documents
  2. It doesn't work if the Buyer updates their format
  3. For each buyer there is a new format that needs to be setup
  4. It doesn't work for scanned documents

OCR + NLP + Text Matching

OCR + NLP is a newer technique to extract data from documents. OCR is a fairly well studied problem and extracting text from documents works most of the time. The next step is to take all of the raw text that is extracted from the document and then try to parse each of the individual pieces of text in the document

Within NLP there a variety of techniques that can be employed to solve this problem

  1. Regex (Regular Expressions)

To extract a date a regular expression would look like:

^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]|(?:Jan|Mar|May|Jul|Aug|Oct|Dec)))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2]|(?:Jan|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)(?:0?2|(?:Feb))\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9]|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep))|(?:1[0-2]|(?:Oct|Nov|Dec)))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Extracts 01/02/2020

Source: https://stackoverflow.com/questions/15491894/regex-to-validate-date-format-dd-mm-yyyy

The downside of the Regex based solution is that each new format needs to be programmed separately. If there is a new format it needs to be added to the regular expression. It does not identify if a particular date is the delivery date, due by date or date of the purchase order.

2. NER (Named Entity Recognition)

To extract field types using NER

results = stanford_ner_tagger.tag(article.split()) print('Original Sentence: %s' % (article)) for result in results: tag_value = result[0] tag_type = result[1] if tag_type != 'O': print('Type: %s, Value: %s' % (tag_type, tag_value))

Prints
Type: DATE, Value: 01/02/2020

Source: https://towardsdatascience.com/named-entity-recognition-3fad3f53c91e

The downside of NER based extraction is that it works well for well defined types but fails for a variety of types like addresses and things that are non standard formats. It works well for Dates, Currency, Phone numbers etc. It works occasionally for Names and Entities. It suffers from a similar problem of not knowing which date it is the delivery date, due by date or date of the purchase order.

3. Force Vendor / Buyer to use your software to submit document

If a Vendor or a Buyer has significant leverage in a transaction he can force the other party to use their software to submit their document. This removes most problems and offloads the entire responsibility to the vendor and no document digitisation is required. However this suffers from the obvious issue of not being ubiquitous and requires that the other party interact with your software. Even if a few 3 parties don't follow this protocol it requires a significant effort to add it to your existing system.


Deep Learning

Deep Learning technology has become fairly advanced in recent times in extracting data and more importantly extracting features that can lead to better predictions.

Data Extraction using Graph Convolutional Neural Networks

Graph Convolutional Neural Networks (GCN) can be used to extract data from these documents. We feed the GCN a variety of different features, each of which is used to be able to extract the correct information.

It can be broken down into 2 steps:
1. Feature Extraction

We extract various features from each text block.

a) Text Features

b) Visual Features

c) Location Features

d) Size Features

e) Font Features

2. Graph Creation

These features are created for each block of text and then a graph is created. For each text block it is passed the features of its neighbour. With the graph features and other features created for each text block it is then classified as one of the fields of interest or as None.

Document Matching

To match the documents Deep Learning is also a great solution where extracted fields from each of the document types can be parsed to give a final prediction if the documents match.

Problems with Deep Learning

There are two classes of information to be extracted

1. Problems with Key values (PO number, date, etc)

  • Identifying key value pairs. They are not uniformly positioned across formats and unclear how many neighbours to look through.
  • Taking multiple languages into account.
  • Not enough data to train for a particular key (class imbalance)

2. Problems with Table values

  • Classifying any box as table
  • Missing tables in pages with multiple tables
  • Merging two close columns
  • Misinterpreting page borders as table borders

3. Other Problems

  • Rotation and Cropping
  • Poor quality images
  • Data drift

Using Nanonets

Solving the Problems in using Deep Learning for OCR on Purchase Orders

Using the Nanonets API You can automatically extract all the required keys required for matching documents. Just upload a document and get all the extracted fields returned in the format of your choosing.

We tackle most of the problems listed above so you don't have to spend time reinventing the wheel.

Key Value Pairs:

1.Identifying key value pairs. They are not uniformly positioned across formats.

Using our GCN implementation we are able to parse keys across documents. Our GCN implementation includes optimisations for finding the right neighbourhood search to get the best tread-off between feature explosion and lack of context for the model to correctly interpret which key each belongs to

2. Taking multiple languages into account.

Our models have been trained with text embeddings that are language agnostic. This is achieved by creating a feature space such that the word embeddings for 'Invoice' and 'Faktura' (Invoice in German) and चलाना (Invoice in Hindi) all map to the same feature space. So the text features become language independent and the model need not be trained per language. Nanonets can be used as OCR software in 200+ more languages.

3. Not enough data to train for a particular key (class imbalance)

We have a large corpus of financial documents our models are trained on which mitigates this issue.

Tables

1.Classifying any box as table

We have built a robust classifier which can identify a variety of different tables in different formats and settings. This uses a mixture of visual features and layout features along with text based features to identify the structure of the table.

2. Missing tables in pages with multiple tables

Having identified each of the different table structures across pages we have a merge logic that decides if the structure is similar enough to merge and if a table on a previous page was incomplete to determine if two tables should be merged.

3. Merging two close columns

If we rely solely on visual features this becomes a problem as it is difficult to identifying space based separation. However with the inclusion of visual text and position based features it becomes possible to identify where the kind of data contained in a column is different enough to be called a new column.

4. Misinterpreting page borders as table borders

Similar to the above this problem is solved by using a variety of features to determine if a particular structure is a table or a page border.

Other Problems

1.Rotation and Cropping

We've implemented a rotation and cropping model as a part of our pre processing step to identify the edges of a document and then correctly orient the document. This uses a model similar to an object detection model with the objective function modified to identify 4 corners as opposed to 2 points that is standard in an object detection problem. This solves for both rotation and cropping

2. Blur and Poor document quality

We have a quality model placed as a part of our preprocessing pipeline that only accepts documents above a certain quality threshold. This is a binary classifier which is a simple image classification model trained on a number of documents that are both good and bad quality. In the document capture pipeline documents can be rejected early if they don't meet the quality standards required and can be sent for recapture or manual processing.

2. Data drift

Data drift is a problem when the model is exposed only to data from a single vendor or single region. If the model has been historically trained on a variety of different vendors, geographies industries etc the possibility of data drift greatly reduces since it has already been exposed to these variances.

Start Digitising POs and Invoices Now with Nanonets - 1 Click PO Digitisation and streamline sales order automation:

Further Reading

Update 1:
Added further reading material around extracting information from Purchase Orders and using OCR for Automation

Update 2: We have significantly improved the accuracies of our models to extract data from POs.

Setup a Demo

Setup a demo to learn about how Nanonets can help you solve this problem