Portable Document Format (PDF) files are commonly used for sharing documents electronically. Individuals and businesses use PDF files to share information alike. Often we need to extract some information from the PDF files for further processing. However, extracting text from a PDF file can be challenging, especially if the document contains complex formatting and layout. Fortunately, there are several ways to do this.
Here, we will provide the most commonly used method to extract text from PDFs using Python. Python comprises several libraries like pypdf and PyMuPDF, that enable efficient PDF text extraction.
The article explores these popular Python libraries for extracting text from PDF files and the step-by-step tutorial on text extraction process from PDFs.
Python Libraries for PDF Text Extraction
Python has several well-integrated libraries that effectively handle unstructured data sources such as PDF files. Here is a list of a few Python libraries for PDF processing.
- PyPDF2: Using the PyPDF2 library, you can easily convert Python PDF to text for efficient data extraction. It can help split, merge, crop, and transform pages of PDF files. PyPDF2 also allows you to extract text from PDF files.
- PyMuPDF: PyMuPDF is a Python wrapper for the MuPDF C library. It allows you to read, write, and manipulate PDF files in Python. Also, you can access the PDF document metadata, extract text and images, and decrypt a PDF document. PyMuPDF library is a powerful tool for converting PDF to text Python, making it ideal for handling complex documents.
- ReportLab: It is an open-source Python library that can be used to create and manipulate PDF files. It provides a high-level API for creating PDF documents from scratch and supports embedding images and fonts.
- PDFMiner: Known for its advanced capabilities, PDFMiner allows for precise data extraction from PDF files to text in Python. It excels in extracting structured data, handling complex layouts, and providing robust support for detailed analysis of PDF documents.
Setting up the Development Environment
Before we discuss the steps for text extraction from PDF, it is essential to set up a development environment and install the required Python libraries to extract text.
- Install Python: If you haven't already, you'll need to install Python on your system. You can download the latest version of Python from here.
- Install pip: To check whether you have pip installed in Python, run
py -m ensurepip --default-pip
If pip doesn't run automatically, download it here and run the following code to install or upgrade
pip.python get-pip.py
- Install the required library: Install any Python library to work with PDF files. Here, we will install the commonly used library, PyPDF2 and PuMuPDF. To install it, run the following command.
pip install PyPDF2
pip install fitz
Once you've installed Python and the required libraries, your development environment is set. You can use any text editor or IDE to write Python code, such as Visual Studio Code, PyCharm, or Sublime Text.
Extracting Text from PDF Using PyPDF2
We will use the PyPDF2 Python library to extract files.
Input PDF:
Below is the code to extract the data from PDF using PyPDF2 library.
# importing required modules
from PyPDF2 import PdfReader
# creating a pdf reader object
reader = PdfReader('nanonet.pdf')
# printing number of pages in pdf file
print(len(reader.pages))
# getting a specific page from the pdf file
page = reader.pages[0]
# extracting text from page
text = page.extract_text()
print(text)
Output:
Now, let's understand each line of code separately.
1. Import PyPDF2 python library
from PyPDF2 import PdfReader
2. Read PDF file
From the PyPDF2 module, we created an object of the PDFReader class. It will take the required positional argument of the path to the pdf file.
reader = PdfReader('nanonet.pdf')
3. Print number of pages in PDF
print(len(reader.pages))
The pages property provides a List of PageObjects. Here, we can use the built-in len() Python function to get the number of pages in the pdf file.
page = reader.pages[0]
We can also get a specific pdf file page by tapping into the page index. List indexing starts from 0 in Python, so this command will give us the file's first page.
4. Extract text from PDF
text = page.extract_text()
print(text)
Extraction from Complex PDFs with PyMuPDF
PyMuPDF (also known as fitz) provides advanced features for handling complex PDF documents, including those with images and annotations.
Below is code to extract data from each page of a PDF using PyMuPDF
# Importing required modules
import fitz
# Opening the PDF file
doc = fitz.open('nanonet.pdf')
# Extracting text from all pages
all_text = []
for page_num in range(len(doc)):
page = doc[page_num]
text = page.get_text()
all_text.append([page_num + 1, text])
Let's understand the code:
1. Import fitz python library
import fitz
2. Read PDF file
doc = fitz.open('nanonet.pdf')
3. Extract text from PDF
text = page.get_text()
The above code will extract all the text present from every page of the page. With the PyMuPDF library, you can use Python extract text from PDF efficiently, even with complex layouts.
Advanced Techniques for Improving Text Extraction Accuracy
While basic libraries like PyPDF2 and PyMuPDF offer straightforward methods for extracting text from PDF files, they can sometimes fall short when dealing with complex documents. PDFs can have intricate layouts, varied fonts, embedded images, and multi-column formats that basic extraction tools might not handle effectively. This can lead to inaccurate text extraction, with issues such as misplaced text, missing characters, or incorrect formatting.
To address these challenges and improve the accuracy of text extraction, advanced techniques and tools become necessary. These methods can help in handling complex layouts, preserving formatting, and extracting structured data more accurately. Here are some advanced techniques to consider:
- Optical Character Recognition (OCR): OCR is a process that converts scanned images into machine-encoded text. OCR can be used to extract text from PDF files that contain images or scanned pages. Several PDF OCR engines are available, including Tesseract, Google Cloud Vision, and Amazon Textract.
- Pre-processing Techniques: Pre-processing techniques involve manipulating the PDF file before the actual extraction process. This includes techniques like de-skewing, de-noising, and thresholding to remove noise, skew and other distortions that may affect the accuracy of the extraction process.
- Layout analysis: It involves identifying and classifying the different elements of a PDF file, such as text blocks, tables, and images. This information can improve text extraction accuracy by identifying the document's structure.
- Machine learning tools: Libraries like DeepText and LayoutParser provide advanced capabilities for this purpose. Several text extraction tools, such as Nanonets, use machine learning techniques to extract text from PDF files accurately.
Summary
Extracting text from PDFs using Python is a vital capability for many data processing and analysis tasks. Handling memory errors in Python can be challenging, but various methods exist to optimize memory usage and prevent memory overflows. The first step is identifying the issue's core reason and applying applicable memory optimization methods. If the issue persists, related processes can be optimized, or the operation can be broken down into smaller chunks using an outside database service.
With these tips and techniques, it's possible to optimize memory usage and avoid memory-related issues when working with large datasets in Python. While Python libraries offer a convenient way to extract text from PDF files, it's worth considering other automated tools for text extraction, such as Nanonets.