What is document classification?

In our hunter-gatherer days, we had to classify objects and beings as food, foe, or friend, for survival. Today our need for classification is less for conservation and more for clarity. In this era of information overload, document classification is of considerable importance for the efficient management and use of information and knowledge.

In this article, we will look at the types of document classification and how ML techniques are being increasingly used for this purpose. A few examples are also provided to understand the relevance of document classification in today’s data-intensive life.

What is document classification?

Document classification is the slotting of documents and their elements into various types (or classes) depending on their content, context, and intent. The process of document classification involves the analysis of textual and visual entities of documents and categorizing them into pre-defined types or classes. This enables easy organization, retrieval and management of data.

Document classification is usually of two types - Visual- and Text classifications. We shall see them in more detail in the following section.

Types of document classification

The most basic type of classification is based on what is being classified - the visual image or the text itself. Let us see what each of those entails.

Visual Classification

The assignment of labels or category names to visual (non-text) content is image classification. It is a fundamental computer-vision task, wherein an input image is identified and classified. For example, an image classification algorithm meant for a construction site could identify equipment and categorize them as excavators, forklifts, etc. Traditional approaches to document image classification relied on handcrafted features, image segmentation, and classical machine learning algorithms like SVM and k-NN.

Visual classification entails capturing information about the texture, color, and shape of objects. Image segmentation isolates key areas for analysis. In recent years, Computer Vision and Deep Learning methods such as convoluted neural networks (CNN) are being extensively used in document image classification. Any digital image is composed of hundreds of thousands of tiny pixels. Image classification analyses a given image in the form of pixels by treating it as an array of matrices. Computer vision assigns a label or tag to the entire image based on training through a pixel-level analysis.

Deep Learning methods like CNNS are designed to process structured grid data and can learn hierarchical representations, which makes them adept at capturing intricate features within images. Through non-linear complex learning, these tools can thus capture local patterns, discern spatial dimensions, and consolidate information for a complete understanding of the image. They are being increasingly used in biomedical diagnostic imaging, facial recognition, surveillance cameras and environmental monitoring.

Text Classification

As the name suggests, text classification deals only with textual entities in a document. The text may be a word, sentence, paragraph, or even the entire content of a document. Some common methods used for text classification are rule-based OCR , Machine Learning approaches that use labelled training datasets, and Unsupervised learning using NLP.

Rule-based OCR:

Optical Character Recognition in its most basic form is a combination of hardware and software that converts physical, printed documents into machine-readable and editable text. The hardware includes an optical scanner that converts a physical document into an image and it is associated with software that extracts editable text from the scanned image.

Legacy OCR systems don’t perform contextual classification and merely indiscriminately extract all text from images. Most of the modern OCR systems, however, incorporate rule-based classification. The scripts that classify the extracted text run on human-crafted rules. These rules are domain-specific and are programmed into the system by the human. For example, to classify research papers that are in the area of materials science using OCR, the user inputs a set of keywords related to the topic, such as “ceramics”, “composites”, “nanomaterials” and so on. The rule-based OCR engine then scans the documents and scores each research paper by the number of found keywords. These types of OCR are easy to implement and can be used for classifying standard documents such as financial and transactional ones. Simply checking for keywords such as “invoice”, “receipts”, etc., for example, can enable the OCR engine to classify the document automatically.

Rule-based OCR is however not very useful when the documents to be classified are non-standard or there are too many keywords that must be input as rules for checking. For example, rule-based OCR would not perform very well in the classification of emails as spam because “spam” can encompass a range of sentiments and content that have no underlying commonality other than being annoying.

ML-based classification

Advanced document classification tools use ML techniques for contextual classification of the text. The most common ML technique is one that uses a training dataset. The training dataset is the largest subset of the sample to be classified and is introduced into the system so that the ML model can learn. The training dataset typically includes data and their labels, which are usually annotated by humans. After cleaning and normalisation of this data, the machine learning algorithm is trained to identify the features and associate them with the labels. Once trained, the model’s performance is tested using a testing dataset, which is a smaller subset of the document database. After necessary adjustments and corrections are made, the algorithm is used to classify documents.

SuVM, Decision Trees and Neural Network models like CNNs fall under this category. The model’s performance is periodically checked using a validation dataset (which is different from the training dataset). Although supervised classification is time-consuming, its performance becomes better with time.

Unsupervised Learning using NLP

In this, there is no training dataset, and there are no labelled data. The algorithm compares similar documents and picks out the similarities and differences for classification. NLP uses several techniques in linguistics, statistics, and computer science – to understand the context of the text. NLP-based document classifiers not only can define patterns in texts but also ‘understand’ the meaning of words, and use these for classification.

The unsupervised NLP process begins by first transforming text data into word embeddings or TF-IDF vectors to obtain the semantic content. Similar documents are grouped using these vectors by clustering algorithms like K-means or hierarchical clustering. Clustering results in the grouping of data by underlying similarities in patterns or topics. These clusters reveal underlying patterns or topics within the text, allowing for the automatic organization of documents based on their content.

There is no need to label data in unsupervised classification, and thus it is useful when not much training data is available. It is often used in topic classification where there is a need to identify themes within a large collection.

Where is document classification used?

With many operations now shifting to the digital realm, document classification is ubiquitous.

Perhaps the most common place we encounter document classification even without realising it, is in customer support. Not too long ago, customer service operations for many companies were outsourced to countries with relatively cheaper operational overheads. Today, we are increasingly finding the first line of online customer service to be automated. NLP is used to automatically pick out words and phrases from customer queries and interactions and categorize them so that appropriate responses can be provided. This helps in the fast identification of the issue or topic being discussed, which enhances customer experience and overall satisfaction.

Automatic document categorization can help derive insights from any kind of written customer interaction including reviews, feedback and social media posts about products and trends. This can help organizations understand the reception of their product among customers and identify trends to cater to.

Document classification is also used extensively in topical classification, e.g., in news aggregator sites, research journal sites and any such repository containing a variety of documents and information. Search engines and digital cataloguing are other examples of topic categorization. The words and phrases input by the user are matched with categories and metadata and the appropriate output is generated. Topical categorization is an integral part of information storage retrieval and knowledge management.

With this being the era of extensive social media communication, it is next to impossible to manually check interactions among media users across the globe. Content surveillance and moderation are now automated and highly sophisticated document classification tools are used for the purpose. These tools constantly crawl interactive platforms and classify words or phrases contextually to flag inappropriate content.

The most rapidly emerging application of document classification is in the accounting sector. The accounting department of businesses deals with a range of finance-related documents such as bank statements, accounting ledgers, invoices, bills, receipts, purchase orders, payment records and so on. Automated document classification tools can help not only sort these documents and slot them into types but also extract relevant data from them, cross-match data across different documents and manipulate and use data for deriving insights and reports.

Much like Accounting operations, Human Resources deals with a plethora of documents starting from resumes and CVs, to payrolls and payslips. As a company grows, it is virtually impossible to classify these documents physically in various files and folders, no matter how many Miss. Lemons (of the Agatha Christie Poirot series, who dreamed of the “perfect filing system beside which all other filing systems will sink under oblivion”) work in HR. Document classification tools are an inevitable and irrevocable part of the HR department.

Conclusion

Document classification enhances data management, information retrieval and insight access, in addition to affording time and cost savings to organizations. There are various types and degrees of document extraction possible, and the tool’s choice depends upon the application’s needs. Whether the document extraction is unsupervised or supervised depends upon the type of documents to be categorized and the quantum of data available for categorization. Often a combination of approaches is used. For example, in healthcare, a rule-based classification could categorize documents into diagnosis or treatment and a subsequent ML-based classification can further categorize them into blood tests, sonograms, etc. Such combinations are particularly useful for categorizing complex data sets.

To conclude, document classification is just as important in today’s data-intensive world as the mental classification of objects was to our cave-dwelling forefathers. It must however not be forgotten that document classification, no matter how efficient the tool, is only as accurate as the integrity of the original document that is worked upon.

What is document classification?