Python ProgrammingPython Programming

How to extract data from PDF file?

Sometimes data will be stored as PDF files, hence first we need to extract text data from PDF file and then use it for further analysis.

PyPDF2 is required library for this recipe. Installing PyPDF2 on your computer is a very simple. You simply need to install it using pip.

pip install PyPDF2

PDF Data Collection for Analysis

import PyPDF2
from PyPDF2 import PdfFileReader

# Creating a pdf file object.
pdf = open("test.pdf", "rb")

# Creating pdf reader object.
pdf_reader = PyPDF2.PdfFileReader(pdf)

# Checking total number of pages in a pdf file.
print("Total number of Pages:", pdf_reader.numPages)

# Creating a page object.
page = pdf_reader.getPage(200)

# Extract data from a specific page number.
print(page.extractText())

# Closing the object.
pdf.close()