Counting the number of PDF pages

Do you need to count the total number of pages in your pdf files?

Look no further, as PyPDF2 is here to help you!

Here is the documentation: https://pythonhosted.org/PyPDF2/

To count the number of pages in a PDF file, you need only four lines of code.

You can install PyPDF2 with pip (PyPi link):

py -m pip install PyPDF2

The code

Ok ok enough installing, what do we need to do to count the pages?

First, we want to import the PdfFileReader class from PyPDF2

from PyPDF2 import PdfFileReader

After that, we need to open our PDF file in binary reading mode.

with open("your_pdf_file.pdf", "rb") as pdf_file:

We then want to instantiate our PdfFileReader object

pdf_reader = PdfFileReader(pdf_file)

We then get the number of pages with the numPages property

print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")

That's it! We have now counted the number of pages in a PDF file with Python!

The complete code:

#!/usr/bin/env python3

"""
Extracting number of pages in the document

getNumPages()
Calculates the number of pages in this PDF file.

Returns:    number of pages
Return type:    int
Raises PdfReadError:
    if file is encrypted and restrictions prevent this action.
    
numPages
Read-only property that accesses the getNumPages() function.
"""

from PyPDF2 import PdfFileReader

# Load the pdf to the PdfFileReader object with default settings
with open("your_pdf_file.pdf", "rb") as pdf_file:
    pdf_reader = PdfFileReader(pdf_file)
    print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")

Rotating PDF files with PyPDF2 and Tkinter

Introduction

Sometimes we need simple and basic tools to get the job done. At work, we have people that use pdf files daily, on which they need to perform certain manual operations. One of these operations is rotating pages. Thinking of programming a pdf rotator can look quite massive at first, but is it really?

To build our tool, we need to be able to rotate the pdf in three ways, clockwise, counterclockwise and 180 degrees. For simplicity, we want to rotate all pdf files in our current working directory. The user shall also be able to use the finished script, without installing Python or any dependencies on Windows. Let us walk through the steps in creating our tool.

Step 1 - PyPDF2 for rotating pages

Rotating a pdf with PyPDF2 can be done with the PageObject class's method RotateClockwise. The method takes one Int parameter, anglewhich defines the rotation degrees. Note that the angle have to be specified in incremetns of 90°. There is no possibility of rotating a PDF page for example 55°.

Ok, we know how to rotate our page, now we need to load our PDF file into memory. After that, we need to initialize our PdfFileReader and PdfFileWriter objects. We can then loop through our pages by using the readers numPages variable. We get the page, rotate it, write it to our new PDF and then save it to disc.

import PyPDF2
with open("test.pdf", "rb") as pdf_file: pdf_reader = PyPDF2.PdfFileReader(pdf_file) pdf_writer = PyPDF2.PdfFileWriter() print("Rotating", degrees) for page_num in range(pdf_reader.numPages): pdf_page = pdf_reader.getPage(page_num) pdf_page.rotateClockwise(degrees) pdf_writer.addPage(pdf_page) with open("test_rotated.pdf", "wb") as pdf_file_rotated: pdf_writer.write(pdf_file_rotated)

Step 2 - Giving the user an interface

A command line interface might work for many users, but believe me, a Graphical USer Interface (GUI) beats a Command Line Interface (CLI) by lightyears for the average user. To easily create our interface, we use Tkinter. We need three Radiobuttons for specifying the rotations, Left, Right and 180 degrees. We also need some kind of descripte text to guide the user, as well as a Button for being able to start the rotation. See the code comments for further descriptions.

import tkinter as tk
# Create our root widget, set title and size
master = tk.Tk()
master.title("PDF rotator")
master.geometry("400x100")

# Create a IntVar for getting our rotate values
master.degrees = tk.IntVar()

# Create a description label and a couple radiobuttons, add them to the widget
tk.Label(master, text="Rotates all pdf in the current folder the selected degrees.").grid(row=0,columnspan=4)
tk.Radiobutton(master, text="Right 90 degrees", variable=master.degrees, value=90).grid(row=1,column=1)
tk.Radiobutton(master, text="Left 90 degrees", variable=master.degrees, value=-90).grid(row=1,column=2)
tk.Radiobutton(master, text="180 degrees", variable=master.degrees, value=180).grid(row=1,column=3)

# Create a button for calling our function
master.ok_button = tk.Button(master, command=rotate_pdf, text="Rotate pdf files")
master.ok_button.grid(row=2,column=1)

# Run
tk.mainloop()

Step 3 - Getting the files

We want to rotate all pdf files in the folder where our script is contained. 

import os
# Get all the files in current folder from where we are running the script
files = [f for f in os.listdir('.') if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith(('.pdf')), files))

Step 4 - Putting it all together

Here is the complete code for rotating our pdf files. Enjoy!

#!/usr/bin/env python3
# -*- coding: <utf-8> -*-

import PyPDF2
import tkinter as tk
import os
import sys

# Get all the files in current folder from where we are running the script
files = [f for f in os.listdir('.') if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith(('.pdf')), files))

# main rotate pdf function
def rotate_pdf(*args):
    degrees = master.degrees.get()
    pdf_rotator(files, degrees)

# The pdf rotator
def pdf_rotator(files, degrees):
    
    for filename in files:
        if degrees != 0 and degrees != "":
            with open(filename, "rb") as pdf_file:
                pdf_reader = PyPDF2.PdfFileReader(pdf_file)
                pdf_writer = PyPDF2.PdfFileWriter()

                print("Rotating", degrees)

                for page_num in range(pdf_reader.numPages):
                    pdf_page = pdf_reader.getPage(page_num)
                    pdf_page.rotateClockwise(degrees)
                    
                    pdf_writer.addPage(pdf_page)

                with open(filename[:-4]+"rotated_"+str(degrees)+".pdf", "wb") as pdf_file_rotated:
                    pdf_writer.write(pdf_file_rotated)
    sys.exit()

# Create our root widget, set title and size
master = tk.Tk()
master.title("PDF rotator")
master.geometry("400x100")

# Create a IntVar for getting our rotate values
master.degrees = tk.IntVar()

# Create a description label and a couple radiobuttons, add them to the widget
tk.Label(master, text="Rotates all pdf in the current folder the selected degrees.").grid(row=0,columnspan=4)
tk.Radiobutton(master, text="Right 90 degrees", variable=master.degrees, value=90).grid(row=1,column=1)
tk.Radiobutton(master, text="Left 90 degrees", variable=master.degrees, value=-90).grid(row=1,column=2)
tk.Radiobutton(master, text="180 degrees", variable=master.degrees, value=180).grid(row=1,column=3)

# Create a button for calling our function
master.ok_button = tk.Button(master, command=rotate_pdf, text="Rotate pdf files")
master.ok_button.grid(row=2,column=1)

# Run
tk.mainloop()

 

 

Introduction

In this post, we will take a look into how we can generate Extensible Markup Language (XML) files from an Excel file with Python. We will be using the Yattag package to generate our XML file and the OpenPyXL package for reading our Excel data.

Packages

Yattag

Yattag is described in its documentation as following:

Yattag is a Python library for generating HTML or XML in a pythonic way.

That pretty much sums Yatttag up, I find it as a simple, easy to use library that just works. I had been searching for this kind of library in order to more easily generate different XML files.

To install Yattag with pip: pip install yattag

Using Yattag

Adding a tag with Yattag is as easy as using the With keyword:

with tag('h1'):
    text('Hello world!')

Tags are automatically closed. To start using Yattag we need to import Doc from Yattag and create our Doc, tag and text with Doc().tagtext().

from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('h1'):
    text('Hello world!')
doc.getvalue()

Output:
'<h1>Hello world!</h1>'

OpenPyXL

OpenPyXL is a library for interacting with Excel 2010 files. OpenPyXL can read and write to .xlsx and .xlsm files.

To install OpenPyXL with pip: pip install openpyxl

Using OpenPyXL

To load an existing workbook in OpenPyXl we need to use the load_workbook method. We also need to select the sheet we are reading the data from. In our example, we are using popular baby names in New York City. You can access the dataset from the link at the bottom of this post.

I have created a workbook named NY_baby_names.xlsx with one sheet of data, Sheet1. The worksheet has the following headers: Year of Birth, Gender, Child's First Name, Count, Rank. You can download the Excel file from my website here.

To access the data with OpenPyXL, do the following:

from openpyxl import load_workbook
wb = load_workbook("NY_baby_names.xlsx")
ws = wb.worksheets[0]
for row in ws.iter_rows(min_row=1, max_row=2, min_col=1, max_col=4):
    print([cell.value for cell in row])

Output: 
['Year of Birth', 'Gender', "Child's First Name", 'Count']
[2011, 'FEMALE', 'GERALDINE', 13]

First, we load the workbook with load_workbook, and then select the first worksheet. We then iterate through the first two rows with the iter_rows method.

Generating the XML from Excel

After the imports, we load the workbook and the worksheet. We then create our Yattag document. We fill the headers with Yattags asis() method. The asis method enables us to input any string as the next line.

We then create our main tag, Babies. We start looping through our sheet with the iter_rows method. The iter_rows method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.

Next, we are adding the babies. Notice the use of the With tag and text. When we are finished we indent our result with Yattags indent method.

Finally, we save our file. The output should look like below. Notice that I only included two babies in the output.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>
<Babies>
    <Baby>
        <Name>
            GERALDINE
        </Name>
        <Gender>
            FEMALE
        </Gender>
        <year>
            2011
        </year>
        <count>
            13
        </count>
        <rank>
            75
        </rank>
    </Baby>
    <Baby>
        <Name>
            GIA
        </Name>
        <Gender>
            FEMALE
        </Gender>
        <year>
            2011
        </year>
        <count>
            21
        </count>
        <rank>
            67
        </rank>
    </Baby>
</Babies>

Complete code

from openpyxl import load_workbook
from yattag import Doc, indent

wb = load_workbook("NY_baby_names.xlsx")
ws = wb.worksheets[0]

# Create Yattag doc, tag and text objects
doc, tag, text = Doc().tagtext()

xml_header = '<?xml version="1.0" encoding="UTF-8"?>'
xml_schema = '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>'

doc.asis(xml_header)
doc.asis(xml_schema)

with tag('Babies'):
    # Use ws.max_row for all rows
    for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=5):
        row = [cell.value for cell in row]
        with tag("Baby"):
            with tag("Name"):
                text(row[2])
            with tag("Gender"):
                text(row[1])
            with tag("year"):
                text(row[0])
            with tag("count"):
                text(row[3])
            with tag("rank"):
                text(row[4])

result = indent(
    doc.getvalue(),
    indentation = '    ',
    indent_text = True
)

with open("baby_names.xml", "w") as f:
    f.write(result)

Data sets

Popular Baby Names dataset: https://catalog.data.gov/dataset/most-popular-baby-names-by-sex-and-mothers-ethnic-group-new-york-city-8c742

I got an excellent question in my OpenPyXL course. The question was how to transfer data from one Excel workbook to another with OpenPyXL. Transferring data only is easier, as there is no need to worry about the formatting of cells and sheets. The complete code is available at the bottom of this post.

Mission statement

The question was how to copy data from one sheet in a workbook to several other sheets in a newly created workbook. The source workbook is WB1 with the source sheet WS1. The source sheet has 1000 rows of data. The data shall be copied to the new workbook WB2 and the sheets WS1 to WS10.

Imports and loading workbook

To copy the values we need to import Workbook and load_workbook from the OpenPyXL library. We can now load our existing workbook, WB1. WB1 = load_workbook("Source.xlsx", data_only=True). The next thing we need to do is set which sheet we are going to copy the data from. We name the sheet WB1_WS1 WB1_WS1 = WB1["WS1"] . After that we are ready to create a new workbook with WB2 = Workbook(). Notice the brackets for the method.

Creating sheets

The question stated that 10 sheets should be created, with the names WS1 to WS10. We can create the sheets with a for loop. Each sheet is created with create_sheet(f"WS{i}"). Notice the usage of Pythons f-string. We then remove the default created sheet, Sheet1. We could also of course have renamed it.

# Create WB2 sheets WS1-WS10
for i in range(1, 11):
    WB2.create_sheet(f"WS{i}")

# delete first sheet
WB2.remove(WB2.worksheets[0])

Copy preparations

The next thing is to create a list for holding our copy ranges, and also which sheets we want to copy the data to. The copy_ranges list holds how many rows we need to copy from the source sheet to the sheets defined in copy_to_sheets.

# Define the copy ranges and sheets
copy_ranges = [100, 200, 50, 300, 350]
copy_to_sheets = ["WS1", "WS2", "WS3", "WS4", "WS4"]

Copying the data

We start with a for loop, that iterates through the copy_ranges list. for i in range( len(copy_ranges)): We then specify which sheet is in turn for copying ws = WB2[ copy_to_sheets[i] ]. Notice how we specify the sheet with the copy_to_sheets list. When i is 1 we select WS1 and so on. We also initialize our row_offset to 1 so that we can keep track of which rows to copy next. We then set the row_offset with yet another for loop. We increase the offset i times with the corresponding values from copy_ranges.

for s in range(i):
    offset += copy_ranges[s]

Now it is time to fill our sheets with data! we traverse through our offset range with a for loop and set the values of the corresponding sheet. First we get the row with for j in range(offset, offset + copy_ranges[i]):. Next up are the cells in each row:

for row in WB1_WS1.iter_rows(min_row=j, max_row=j, min_col=1, max_col=WB1_WS1.max_column):.

We get the values for values_row with a list comprehension [cell.value for cell in row]. Finally, we append the row to the sheet with ws.append(values_row).

# Copy the row with the help of iter_rows, append the row
for j in range(offset,  offset + copy_ranges[i]):
    #if j == 0:
    #    continue
    for row in WB1_WS1.iter_rows(min_row=j, max_row=j, min_col=1, max_col=WB1_WS1.max_column):
        values_row = [cell.value for cell in row]
    ws.append(values_row)

To wrap up, we save the workbook: WB2.save("WB2.xlsx")

That's it! Please comment below if you have questions or any feedback. See you later!

Use this link, Control Excel with Python & OpenPyXL or this code SAVEOPENPYXL to get the course for a discount on Udemy.com

This image has an empty alt attribute; its file name is OpenPyXL_course-1024x407.png

The code

#!/usr/bin/python
# -*- coding: utf-8 -*-

"""
Could you please suggest  how to copy the data from on work book to other book with specified rows
Source: Excel work book "WB1" having work sheet "WS1", This sheet  having 1000 rows of data
Destination: New work book 'WB2' and  work sheets WS1,WS2...WS10
Could you please suggest the code for following condition:
Copy the first 100 rows data and paste it WS1 sheet
Copy the next 200 rows data and paste it WS2 sheet
Copy the next 50 rows data and paste it WS3 sheet
Copy the next 300 rows data and paste it WS4 sheet
Copy the next 350 rows data and paste it WS4 sheet
"""

from openpyxl import Workbook, load_workbook

WB1 = load_workbook("Source.xlsx", data_only=True)
WB1_WS1 = WB1["WS1"]
WB2 = Workbook()

# Create WB2 sheets WS1-WS10
for i in range(1, 11):
    WB2.create_sheet(f"WS{i}")

# delete first sheet
WB2.remove(WB2.worksheets[0])

# Define the copy ranges and sheets
copy_ranges = [100, 200, 50, 300, 350]
copy_to_sheets = ["WS1", "WS2", "WS3", "WS4", "WS4"]

# Copy the values from the rows in WB1 to WB2.
for i in range( len(copy_ranges)):
    # Set the sheet to copy to
    ws = WB2[ copy_to_sheets[i] ]
    # Initialize row offset
    offset = 1
    # Set the row offset
    for s in range(i):
        offset += copy_ranges[s]

    # Copy the row with the help of iter_rows, append the row
    for j in range(offset,  offset + copy_ranges[i]):
        #if j == 0:
        #    continue
        for row in WB1_WS1.iter_rows(min_row=j, max_row=j, min_col=1, max_col=WB1_WS1.max_column):
            values_row = [cell.value for cell in row]
        ws.append(values_row)

# Save the workbook
WB2.save("WB2.xlsx")

Saving an Excel sheet to Pdf with Python

Saving a finished report or table in Excel is easy. You choose SaveAs and save the sheet as Pdf. Doing this automatically with Python is a bit trickier though. In this post, we will take a closer look on how to do this with the win32 library. The full code is available at the bottom of the post. Note that you need Excel installed in order to run this script successfully.

Installing dependencies

install the win32 library first with: pip install pypiwin32. This will install the Win32 Api library, which according to PyPi contains: Python extensions for Microsoft Windows Provides access to much of the Win32 API, the ability to create and use COM objects, and the Pythonwin environment.

File paths

To get the file paths we use pathlib. Pathlib was introduced in Python 3.4 so it is quite new (Using Python 3.8 during the writing of this article). We specify the name of the Excel workbook we want to make a pdf of, and also the output pdf's name.

excel_file = "pdf_me.xlsx"
pdf_file = "pdf_me.pdf"

We then create paths from our current working directory (cwd) with Pathlibs cwd() method.

excel_path = str(pathlib.Path.cwd() / excel_file)
pdf_path = str(pathlib.Path.cwd() / pdf_file)

Firing up Excel

Excel is next up. We start the Excel application and hide it.

excel = client.DispatchEx("Excel.Application")
excel.Visible = 0

We then open our workbook  wb = excel.Workbooks.Open(excel_path) and load our first sheet with ws = wb.Worksheets[1]

Now it is time to use the SaveAs to save our sheet as a pdf. wb.SaveAs(pdf_path, FileFormat=57)Fileformat 57 is the pdf file format.

We then close our workbook and quit our Excel application. Our pdf is now saved in our working directory.

The code

from win32com import client
import win32api
import pathlib

### pip install pypiwin32 if module not found

excel_file = "pdf_me.xlsx"
pdf_file = "pdf_me.pdf"
excel_path = str(pathlib.Path.cwd() / excel_file)
pdf_path = str(pathlib.Path.cwd() / pdf_file)

excel = client.DispatchEx("Excel.Application")
excel.Visible = 0

wb = excel.Workbooks.Open(excel_path)
ws = wb.Worksheets[1]

try:
    wb.SaveAs(pdf_path, FileFormat=57)
except Exception as e:
    print("Failed to convert")
    print(str(e))
finally:
    wb.Close()
    excel.Quit()

 

Splitting Pdfs with Python

In this post, we are going to have a look at how to split all pages from a single pdf into one-page pdf files. Splitting a pdf into several pages can easily be done with almost any pdf tool worth its salt. However, splitting a pdf into single pages is a manual operation, and if you have to do it on several pdfs an automated tool makes sense. This is where PyPDF2 comes in handy. If you just want the complete code without all the fancy explanations, you can find it at the end.

Preparations

If you haven't done so already, fire up your command prompt, PowerShell or terminal and install PyPDF2 with pip. 

pip install pypdf2

Currently I am running 32-bit Python 3.8 with PyPDF2 version 1.26.0 on Windows 10. The code works on this setup, and probably also for other OS'es. 

Code line by line

Imports

We start with importing PdfFileWriter and PdfFileReader so that we can read the existing pdf and later write new pdfs. We also need to import sys so that we can check what files we have have in our working directory.

from PyPDF2 import PdfFileWriter, PdfFileReader
import os

Getting the pdf files to split

First we do a list comprehension in os.listdir(".") if the provided path is a file os.path.isfile(f). After that we filter out all the pdf files from the list fileswith files = list(filter(lambda f: f.lower().endswith((".pdf")), files)).

files = [f for f in os.listdir(".") if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith((".pdf")), files))

Splitting and creating new pdf

Now it is time to process all our pdf files. We go through each of our pdf in files with a for loop for pdf in files:. We then open the pdf with open(pdf, "rb") as f: and load each pdf into a PdfFileReader object with inputpdf = PdfFileReader(f).

Now it is time to start the splitting. With another for loop, we loop through all pages in the pdf. You can get the number of pages with numPages. We create a PdfFileWriter object named output and add the first page with getPage(i). We name the output pdf with the original name, add -Page and the page number. name = pdf[:-4]+"-Page "+str(i)+".pdf". Finally, we save the output.

with open(name, "wb") as outputStream: 
output.write(outputStream)

Complete code

from PyPDF2 import PdfFileWriter, PdfFileReader
import os

files = [f for f in os.listdir(".") if os.path.isfile(f)]
files = list(filter(lambda f: f.lower().endswith((".pdf")), files))

for pdf in files:
    with open(pdf, "rb") as f:
        inputpdf = PdfFileReader(f)

        for i in range(inputpdf.numPages):
            output = PdfFileWriter()
            output.addPage(inputpdf.getPage(i))
            name = pdf[:-4]+"-Page "+str(i)+".pdf"
            with open(name, "wb") as outputStream:
                output.write(outputStream)

 

I have finally published my first course, Control Excel with Python & OpenPyXL!

The course covers the ins and outs of how to control and automate Excel with the OpenPyXL library.

I am offering this course for a discount for you who are reading this post.

Use this link, Control Excel with Python & OpenPyXL or this code SAVEOPENPYXL to get the course for a discount on Udemy.com

Introduction

Using Python is wonderful. In fact, it so wonderful that you sometimes want other, non-python users to try out your scripts. You can tell them all day long to go to python.org and download the latest version. However, most people will ask you why you cannot just make an installer or a .exe for them instead.

I have made a few executable files myself and will share some ways in how I have done it. Note that I am using Python 3.6, results can differ if you are using Python 2 or Python 3.7. Occasionally, the packaging does not work without some modification when you are using the latest Python version. Try using an older Python version if the packaging does not work out of the box.

(more…)

Introduction

How to run a multi-thread in Python - While coding, we sometimes want the program to do multiple things at once. You may have a script where you run a background function, checking for changes as you go along. It can also be as simple as disabling an Entry to not allow the user to input more text for a while. In this post, we will look at a simple way of how to use multiple threads in Python. (more…)

Creating a button in Tkinter

In this article, we will explore how to create a simple button in Tkinter. The button is used for a lot of user interaction and is often the most basic way to interact with a user. We will also cover some of the basics of how a Tkinter application is launched. (more…)

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram