Want to parse documents and extract information/data? Check out Nanonets™ IDP to automate parsing of information from any document type and export them in any format or integrate with external tools!
Introduction
Do you want to learn about one of the secrets of building a successful business? It’s not something that requires a huge amount of investment or work. In fact, it's so simple that it's often overlooked. Okay, let’s spill the beans, it’s “automation”. Read on to know more about how your company can use document parsing to automate your business workflows.
- Introduction
- What is Document Parsing?
- Why do You Need Document Parsing?
- Some Case Studies
- How does Document Parsing Work?
- Using Programming Languages for Document Parsing
- Workflow Automation Using Document Parsing
- Commonly Faced Problems
- Online Document Parsers
- Document Parser Integrations
- Why Nanonets™?
- Conclusion
What is Document Parsing?
Document parsing is a term that involves examining the data present in a document and extracting useful information from it. For example, data from PDFs, CSV files and word documents could be extracted using document parsers and stored as a JSON file. This can be used for performing activities like data analytics, digitizing your company’s records etc.
Want to parse emails or documents and extract information/data? Check out Nanonets™ to automate parsing of information from any document type and export them in any format or integrate with external tools!
Benefits of Document Parsing
1.Elimination of Manual Data Entry
Most companies face similar problems i.e. their efficiency is severely limited by manual processes such as data entry. A good document parsing solution can completely automate the process thereby reducing dependence on manual input for tedious tasks and saving employees' time for the more creative work, thus increasing the company’s throughput.
2.Digitization of Data
If your company has a lot of data stored in the form of paper copies, document parsing can help in data digitization. Paperwork not only takes up a large amount of space, it also makes searching for information a nightmare. With an end-to-end document parsing pipeline, you could simply scan all of your paper copies and your data would automatically be stored in your company's central server.
3.Improves Reliability
An automated document parsing solution eliminates manual labour from the process and as a result is much more reliable. You need to look no further than your Accounts Payable section to see this at work. An automated data extraction solution would make your invoice processing faster and efficient, leading to happy suppliers and customers!
Want to parse documents and extract information/data? Check out Nanonets™ to automate parsing of information from any document type and export them in any format or integrate with external tools!
How does it Work?
Let’s take a look at a general pipeline that can be used for parsing data from any document.
Let’s briefly look at each step of the process:
1.Data Extraction using Optical Character Recognition
Data within a PDF or a word document is as good as having the data written on a piece of paper. You would have to go through the document manually and re-enter the relevant information in an excel sheet. This might work for a couple of documents, however the approach is simply not scalable.
The solution to this rather difficult problem is to use Optical Character Recognition (OCR).
OCR is the process of converting text within scanned documents into a machine readable format. Modern OCR tools are fairly advanced and use steps such as document preprocessing, feature extraction followed by character/word/document classification and postprocessing. Nanonets™ does a deep dive on performing OCR using Tesseract in this blog.
2.Data Parsing
It involves examining the raw data and extracting relevant information from the document. It is normally performed using two main approaches.
Rule-based Approaches:
This is suitable for structured documents such as loan applications, tax invoices, proforma invoices etc. The user normally defines a template of the document. This template is used as a reference to extract data from the document.
The major disadvantage with using rule based approaches for data parsing is the strict reliance on pre-defined templates. If the document uses a slightly different format than the one defined in the template, rule based matching will fail.
Model-based or Learning-based Approaches:
Model based approaches are generally used to extract data from unstructured documents. They rely heavily on Machine learning(ML) and Natural Language Processing(NLP).
The models are usually trained on a diverse set of unstructured documents. This improves their ability to easily recognize important fields and extract data from them.
In practice, a combination of Rule-based and Model based approaches are used to perform data parsing.
Well that’s enough explanation, let's get coding!
Want to parse documents and extract information/data? Check out Nanonets™ to automate parsing of information from any document type and export them in any format or integrate with external tools!
Using Programming Languages for Document Parsing
In this section, I have illustrated how various programming languages such as Python, Javascript etc can be used to parse different types of documents (PDFs, XML files etc)
Parsing PDFs Using Python
Let’s take a look at a simple rule based parser. Assume that we are parsing the structured document shown below.
A simple pipeline that you could follow is: Scan the document, extract data using an open source OCR software (like Tesseract) and parse the data using regular expressions in Python.
If you're looking to extract data from a scanned document using Tesseract, you can refer to the OCR with Tesseract blog by Nanonets.
Once the data has been extracted, we can perform additional checks using regular expressions to ensure data integrity. The following code snippet shows a simple regular expression that could be used to parse the First Name field in the application form.
import re
p = re.compile('[A-Za-z]+')
name = "Varghese"
match_result = p.match(name)
print(match_result)'
Parsing XML Files Using Javascript
While parsing XML files using Javascript, we access XML elements using the XML Document Object Model (DOM). The DOM represents a standard method for accessing data within XML documents.
Assume that we have an XML file that contains information present in the following receipt
The code below illustrates one possible method of parsing the XML file
<html>
<body>
<p id="item_name"></p>
<p id="item_amount"></p>
<script>
var text, parser, xmlDoc;
text = "<storename>" +
"<item>" +
"<name>T-Shirt</name>" +
"<qty>1</qty>" +
"<amount>25.50</amount>" +
"</item>" +
"<item>" +
"<name>Watches</name>" +
"<qty>1</qty>" +
"<amount>299</amount>" +
"</item>"
"</storename>"
parser = new DOMParser();
xmlDoc = parser.parseFromString(text,"text/xml");
document.getElementById("item_name").innerHTML = xmlDoc.getElementsByTagName("name")[0].childNodes[0].nodeValue;
document.getElementById("item_amount").innerHTML = xmlDoc.getElementsByTagName("amount")[0].childNodes[0].nodeValue;
</script>
</body>
</html>
The following link contains a few examples of parsing data using the DOM object.
Want to parse documents and extract information/data? Check out Nanonets™ to automate parsing of information from any document type and export them in any format or integrate with external tools!
Workflow Automation Using Document Parsing
Let’s take the example of invoice processing in your company. Your Accounts Payable(AP) section usually receives the PDF of an invoice from a supplier. An employee in the AP team is given the responsibility of manually going through the PDF, extracting important details such as the total amount to be paid, the due date etc and entering the same into a spreadsheet. This spreadsheet is forwarded to the Finance for approval. Once the payment is completed, the company’s ledger is updated.
The above pipeline is highly inefficient and can be completely automated by using document parsing. Let’s take a look at some of the steps using which invoice payment can be made as easy as pie.
1.Data Capture and Entry
This is the most important step in the entire pipeline. Data from the invoice is automatically extracted by the document parsing software. A robust document parser (or email parser) should be able to handle different document types such as PDFs, word documents, scanned images etc.
The software should also take into account various synonyms for a particular field. For example Total, Amount due, Aggregate etc, could refer to the same field i.e. the sum to be paid to the supplier.
If the software has some trouble recognizing a particular field, it normally asks for assistance from the user. For example, if the parser has trouble recognizing the Amount due field, it asks the user to manually select the text corresponding to that particular field. What's interesting is that, since most document parsers use machine learning under the hood, they learn to identify similar fields in other documents.
2.Matching Invoices to Purchase Orders
A three-way match is automatically performed between the invoice, the purchase order, and the receiving report. This step is used to reduce errors such as data duplication and helps in fraud prevention. The accuracy of three-way matching links back to the importance of accurate data extraction by the document parsing software. Find out more about how Nanonets can help with your two-way or three-way matching requirements.
3.Notifying Managers and the Finance Section
Notifications can automatically be sent to the Finance section and managers who have to approve the payment. Deadlines and reminders can be added to the notifications to ensure timely response.
4.Updating the Company Ledger
After the invoice has been paid, the company ledger can be automatically updated with the details of the payment.
Want to parse documents and extract information/data? Check out Nanonets™ to automate parsing of information from any document type and export them in any format or integrate with external tools!
Commonly Faced Problems
1.Inability to Parse Data Correctly
Parsing data from documents involves solving problems related to both computer vision and natural language processing. Data could be presented in a variety of tabular formats which might be mutually inconsistent. Even after leveraging the power of machine learning, most document parser's are bound to run into difficulties from time to time.
2. Debugging
This is a problem that is inherent to almost all AI based applications. While building large networks seem to solve a variety of problems, only a handful of people understand what goes on under the hood. Your document parser could be spitting out a whole lot of mumbo jumbo and it is possible that no one has a solution to the problem.
3.Handling Multiple Languages
Many document parsers don’t support multiple languages. This might be because of the unavailability of good quality training data. However, supporting a variety of native languages is a necessity. For example a company in India is highly likely to receive invoices in more than one language.
Online Document Parsers
After all that explanation, developing a document parser from scratch seems like a tough job. The good news is that there are several tools available online that can be used off the shelf. Here are a few of the popular tools that your company should consider for workflow automation.
-
Uses AI to extract data from documents. It doesn’t require any configuration or custom code to be written by the client.
-
Provides Amazon Virtual Private Cloud (VPC) endpoints that enable customers to encrypt their data.
-
It is integrated with Amazon Augmented AI. This allows for a human in the loop approach in case of sensitive workflows that require a high accuracy.
-
Their website features comprehensive documentation and tutorials regarding their product.
2.Google Cloud Vision
-
Data extraction based on state-of-the art OCR and Natural language processing (NLP).
-
Follows a Human-in-the-Loop approach. This ensures that a higher document processing accuracy can be achieved by using feedback from a user.
-
The extracted data can be validated by making use of Google’s knowledge graph.
3.Nanonets™
-
Data extraction based on cutting edge OCR using AI and ML algorithms.
-
The models can easily be trained with custom data. This ensures easy customization to your specific use case.
-
Their model can handle different font sizes, image noise, blurred images etc.
-
A single model can be used to extract data from documents written in multiple languages.
Want to parse documents and extract information/data? Check out Nanonets™ to automate parsing of information from any document type and export them in any format or integrate with external tools!
Document Parser Integrations
When you are on the prowl for a document parser, the following integrations would prove to be extremely useful in improving your workflow.
1.Application Programming Interface (APIs)
A good document parser software should provide easy to use APIs that is compatible with multiple programming languages. Basic APIs to import documents to the software and to obtain the parsed output would ensure easy integration with your company’s existing ecosystem.
2.Cloud Storage
It is highly likely that your company uses one of the popular cloud storage solutions such as Google Drive, OneDrive etc. The software should be capable of directly reading and uploading data to the cloud.
3. Webhooks
They enable you to send data to a pre-specified URL. Ideally, each time a new document is parsed, the document parser should trigger the Webhook automatically.
4.Accounting Integration
Chances are your company will end up using the document parser to perform some form of invoice automation. It is greatly advantageous if it integrates easily with accounting software such as SAP/Quickbooks.
Why Nanonets?
Here are a few reasons why you should consider using Nanonets™ over the other document parsing tools in the market.
-
High Accuracy: Provides high data extraction accuracy of 95%+. The model also employs state of the art AI that improves with every document it extracts.
-
Seamless Integrations: Nanonets’ document extraction software directly integrates with a wide range of tools such as CMS and Zapier. Your company can treat the Nanonets™ document extractor as a plug and play module that leaves the rest of your pipeline undisturbed.
-
Competitive Pricing: Nanonets™ is reasonably priced and offers greater value for money when compared to other solutions in the market. You can head to their webpage (https://nanonets.com/) and take a look at the pricing (there’s no need to “request for a demo”).
If you still have some reservations about using Nanonets, just take a look at their customer base. Some of the companies that use Nanonets to automate their workflow are:
-
Deloitte
-
Sherwin Williams
-
Afni
-
Procter & Gamble and many more
Here is a customer review by WeWork Labs: “My overall experience with Nanonets™ has been delightful to say the least. The ease of implementation, administration, and use makes our jobs easier when it comes to digitizing large volumes of agreements, invoices, and other partnership related documents.”
Automate manual processes using Nanonets AI-based OCR software. Capture data from documents instantly. Reduce turnaround times and eliminate manual effort.
Conclusion
In this blogpost we took a look at the following:
-
What document parsing is and why your company requires it.
-
How document parsing works.
-
How document extraction can be performed with popular programming languages like Python and Java.
-
Commercial software for document extraction and why Nanonets™ is your best bet
Let’s conclude with this quote by Federico Garcia Lorca: “Besides black art, there is only automation and mechanization”. Since your company doesn’t focus on black magic, your best bet is to automate your processes and Nanonets™ can help you achieve this.