My new course Control PDF with Python and PyPDF2 library course is out! Check it out on Udemy.

The course got a 9.7/10 on Coursemarks.com also!

Control PDF with Python & PyPDF2 rating

Do you have a bunch of PDF files that you need to format or merge? Do you need to overlay images on each page? Maybe you just want to show off at the office! Whatever your needs, you can expect a comprehensive guide going through the nuts and bolts of how automating PDFs works. The course dives straight into PyPDF2, so you will be up and running creating and manipulating your PDFs in no-time.

I have had so much use of PyPDF2 when dealing with PDFs. You can split and merge documents, add metadata, fill in form fields, add javascript, and much more. I am sure you will benefit from learning PyPDF2 if you have even the slightest interest in making your everyday life easier.

After taking this course you will know how to:

Check it out on Udemy.

Counting the number of PDF pages

Do you need to count the total number of pages in your pdf files?

Look no further, as PyPDF2 is here to help you!

Here is the documentation: https://pythonhosted.org/PyPDF2/

To count the number of pages in a PDF file, you need only four lines of code.

You can install PyPDF2 with pip (PyPi link):

py -m pip install PyPDF2

The code

Ok ok enough installing, what do we need to do to count the pages?

First, we want to import the PdfFileReader class from PyPDF2

from PyPDF2 import PdfFileReader

After that, we need to open our PDF file in binary reading mode.

with open("your_pdf_file.pdf", "rb") as pdf_file:

We then want to instantiate our PdfFileReader object

pdf_reader = PdfFileReader(pdf_file)

We then get the number of pages with the numPages property

print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")

That's it! We have now counted the number of pages in a PDF file with Python!

The complete code:

#!/usr/bin/env python3

"""
Extracting number of pages in the document

getNumPages()
Calculates the number of pages in this PDF file.

Returns:    number of pages
Return type:    int
Raises PdfReadError:
    if file is encrypted and restrictions prevent this action.
    
numPages
Read-only property that accesses the getNumPages() function.
"""

from PyPDF2 import PdfFileReader

# Load the pdf to the PdfFileReader object with default settings
with open("your_pdf_file.pdf", "rb") as pdf_file:
    pdf_reader = PdfFileReader(pdf_file)
    print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")

Rotating PDF files with PyPDF2 and Tkinter

Introduction

Sometimes we need simple and basic tools to get the job done. At work, we have people that use pdf files daily, on which they need to perform certain manual operations. One of these operations is rotating pages. Thinking of programming a pdf rotator can look quite massive at first, but is it really?

To build our tool, we need to be able to rotate the pdf in three ways, clockwise, counterclockwise and 180 degrees. For simplicity, we want to rotate all pdf files in our current working directory. The user shall also be able to use the finished script, without installing Python or any dependencies on Windows. Let us walk through the steps in creating our tool.

Step 1 - PyPDF2 for rotating pages

Rotating a pdf with PyPDF2 can be done with the PageObject class's method RotateClockwise. The method takes one Int parameter, anglewhich defines the rotation degrees. Note that the angle have to be specified in incremetns of 90°. There is no possibility of rotating a PDF page for example 55°.

Ok, we know how to rotate our page, now we need to load our PDF file into memory. After that, we need to initialize our PdfFileReader and PdfFileWriter objects. We can then loop through our pages by using the readers numPages variable. We get the page, rotate it, write it to our new PDF and then save it to disc.

import PyPDF2
with open("test.pdf", "rb") as pdf_file: pdf_reader = PyPDF2.PdfFileReader(pdf_file) pdf_writer = PyPDF2.PdfFileWriter() print("Rotating", degrees) for page_num in range(pdf_reader.numPages): pdf_page = pdf_reader.getPage(page_num) pdf_page.rotateClockwise(degrees) pdf_writer.addPage(pdf_page) with open("test_rotated.pdf", "wb") as pdf_file_rotated: pdf_writer.write(pdf_file_rotated)

Step 2 - Giving the user an interface

A command line interface might work for many users, but believe me, a Graphical USer Interface (GUI) beats a Command Line Interface (CLI) by lightyears for the average user. To easily create our interface, we use Tkinter. We need three Radiobuttons for specifying the rotations, Left, Right and 180 degrees. We also need some kind of descripte text to guide the user, as well as a Button for being able to start the rotation. See the code comments for further descriptions.

import tkinter as tk
# Create our root widget, set title and size
master = tk.Tk()
master.title("PDF rotator")
master.geometry("400x100")

# Create a IntVar for getting our rotate values
master.degrees = tk.IntVar()

# Create a description label and a couple radiobuttons, add them to the widget
tk.Label(master, text="Rotates all pdf in the current folder the selected degrees.").grid(row=0,columnspan=4)
tk.Radiobutton(master, text="Right 90 degrees", variable=master.degrees, value=90).grid(row=1,column=1)
tk.Radiobutton(master, text="Left 90 degrees", variable=master.degrees, value=-90).grid(row=1,column=2)
tk.Radiobutton(master, text="180 degrees", variable=master.degrees, value=180).grid(row=1,column=3)

# Create a button for calling our function
master.ok_button = tk.Button(master, command=rotate_pdf, text="Rotate pdf files")
master.ok_button.grid(row=2,column=1)

# Run
tk.mainloop()

Step 3 - Getting the files

We want to rotate all pdf files in the folder where our script is contained. 

import os
# Get all the files in current folder from where we are running the script
files = [f for f in os.listdir('.') if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith(('.pdf')), files))

Step 4 - Putting it all together

Here is the complete code for rotating our pdf files. Enjoy!

#!/usr/bin/env python3
# -*- coding: <utf-8> -*-

import PyPDF2
import tkinter as tk
import os
import sys

# Get all the files in current folder from where we are running the script
files = [f for f in os.listdir('.') if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith(('.pdf')), files))

# main rotate pdf function
def rotate_pdf(*args):
    degrees = master.degrees.get()
    pdf_rotator(files, degrees)

# The pdf rotator
def pdf_rotator(files, degrees):
    
    for filename in files:
        if degrees != 0 and degrees != "":
            with open(filename, "rb") as pdf_file:
                pdf_reader = PyPDF2.PdfFileReader(pdf_file)
                pdf_writer = PyPDF2.PdfFileWriter()

                print("Rotating", degrees)

                for page_num in range(pdf_reader.numPages):
                    pdf_page = pdf_reader.getPage(page_num)
                    pdf_page.rotateClockwise(degrees)
                    
                    pdf_writer.addPage(pdf_page)

                with open(filename[:-4]+"rotated_"+str(degrees)+".pdf", "wb") as pdf_file_rotated:
                    pdf_writer.write(pdf_file_rotated)
    sys.exit()

# Create our root widget, set title and size
master = tk.Tk()
master.title("PDF rotator")
master.geometry("400x100")

# Create a IntVar for getting our rotate values
master.degrees = tk.IntVar()

# Create a description label and a couple radiobuttons, add them to the widget
tk.Label(master, text="Rotates all pdf in the current folder the selected degrees.").grid(row=0,columnspan=4)
tk.Radiobutton(master, text="Right 90 degrees", variable=master.degrees, value=90).grid(row=1,column=1)
tk.Radiobutton(master, text="Left 90 degrees", variable=master.degrees, value=-90).grid(row=1,column=2)
tk.Radiobutton(master, text="180 degrees", variable=master.degrees, value=180).grid(row=1,column=3)

# Create a button for calling our function
master.ok_button = tk.Button(master, command=rotate_pdf, text="Rotate pdf files")
master.ok_button.grid(row=2,column=1)

# Run
tk.mainloop()

 

 

Splitting Pdfs with Python

In this post, we are going to have a look at how to split all pages from a single pdf into one-page pdf files. Splitting a pdf into several pages can easily be done with almost any pdf tool worth its salt. However, splitting a pdf into single pages is a manual operation, and if you have to do it on several pdfs an automated tool makes sense. This is where PyPDF2 comes in handy. If you just want the complete code without all the fancy explanations, you can find it at the end.

Preparations

If you haven't done so already, fire up your command prompt, PowerShell or terminal and install PyPDF2 with pip. 

pip install pypdf2

Currently I am running 32-bit Python 3.8 with PyPDF2 version 1.26.0 on Windows 10. The code works on this setup, and probably also for other OS'es. 

Code line by line

Imports

We start with importing PdfFileWriter and PdfFileReader so that we can read the existing pdf and later write new pdfs. We also need to import sys so that we can check what files we have have in our working directory.

from PyPDF2 import PdfFileWriter, PdfFileReader
import os

Getting the pdf files to split

First we do a list comprehension in os.listdir(".") if the provided path is a file os.path.isfile(f). After that we filter out all the pdf files from the list fileswith files = list(filter(lambda f: f.lower().endswith((".pdf")), files)).

files = [f for f in os.listdir(".") if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith((".pdf")), files))

Splitting and creating new pdf

Now it is time to process all our pdf files. We go through each of our pdf in files with a for loop for pdf in files:. We then open the pdf with open(pdf, "rb") as f: and load each pdf into a PdfFileReader object with inputpdf = PdfFileReader(f).

Now it is time to start the splitting. With another for loop, we loop through all pages in the pdf. You can get the number of pages with numPages. We create a PdfFileWriter object named output and add the first page with getPage(i). We name the output pdf with the original name, add -Page and the page number. name = pdf[:-4]+"-Page "+str(i)+".pdf". Finally, we save the output.

with open(name, "wb") as outputStream: 
output.write(outputStream)

Complete code

from PyPDF2 import PdfFileWriter, PdfFileReader
import os

files = [f for f in os.listdir(".") if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith((".pdf")), files))

for pdf in files:
    with open(pdf, "rb") as f:
        inputpdf = PdfFileReader(f)

        for i in range(inputpdf.numPages):
            output = PdfFileWriter()
            output.addPage(inputpdf.getPage(i))
            name = pdf[:-4]+"-Page "+str(i)+".pdf"
            with open(name, "wb") as outputStream:
                output.write(outputStream)

 

Searching for text in PDF files with pypdf2 

Portable Document Format (PDF) is wonderful as long as you do just have to read the format, not work with it. The pdf format is not really meant to be tampered with, so that is why pdf editing is normally a hard thing to do. It is defacto a worldwide standard so you will most likely come across it when coding. Read along to see how to tackle the PDF format and how to do a search to find the information contained within them. (more…)

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram