29.06.2018

Searching for text in PDF files with pypdf2

Searching for text in PDF files with pypdf2 

Portable Document Format (PDF) is wonderful as long as you do just have to read the format, not work with it. The pdf format is not really meant to be tampered with, so that is why pdf editing is normally a hard thing to do. It is defacto a worldwide standard so you will most likely come across it when coding. Read along to see how to tackle the PDF format and how to do a search to find the information contained within them.

The code below is taken from Al Sweigart's book page Automate the Boring Stuff with Python (No affiliation, it is a great book that you can read for free. I do have the hard copy at home also.) I have added some error handling functionality to his code with utf-8 encoding and the strict=False for the PdfReadError.

In an earlier post, we covered how to search for files on your hard drive. We are now going to search inside pdf files instead. for this we need the pypdf2 package which you can install from your command line; py -m pip install pypdf2

I used the pdf document SHIP-ICE INTERACTION IN A CHANNEL found from trafi.fi as an example. According to my pdf reader, the word "ship" is written 83 times. Let's see if we can come to the same number with pypdf2. The code works as follows: first, we open the pdf and read the pdf with the PdfFileReader method.

We loop through the pages and get each page with the getPage method. The count for the word "ship" is 82, so we do not find all of the words. The word "ice" should appear 158 times, but pypdf2 only finds "ice" 153 times. This is expected behavior, since there may be tables and similar formats that pypdf2 does not detect.

We can conclude that the search is still working sufficiently good. Searching through a couple of hundred pdf's would yield good enough results if you are searching for something specific. Do you have a better way to search? Please let me know in the comments. Happy coding!

The Code:

# If you get the PdfReadError: Multiple definitions in dictionary at byte, add strict = False
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)
# if you get the UnicodeEncodeError: 'charmap' codec can't encode characters, add .encode("utf-8") to your text
text = pageObj.extractText().encode('utf-8')
import PyPDF2
pdfFileObj = open('22897-WNRB_research_report_93.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict=False)

search_word = "ship"
search_word_count = 0

for pageNum in range(0, pdfReader.numPages):
    pageObj = pdfReader.getPage(pageNum)
    text = pageObj.extractText().encode('utf-8')
    search_text = text.lower().split()
    for word in search_text:
        if search_word in word.decode("utf-8"):
            search_word_count += 1
        
print("The word {} was found {} times".format(search_word, search_word_count))

Leave a Reply

Your email address will not be published.

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram