Do you need to count the total number of pages in your pdf files?
Look no further, as PyPDF2 is here to help you!
Here is the documentation: https://pythonhosted.org/PyPDF2/
To count the number of pages in a PDF file, you need only four lines of code.
You can install PyPDF2 with pip (PyPi link):
py -m pip install PyPDF2
Ok ok enough installing, what do we need to do to count the pages?
First, we want to import the PdfFileReader class from PyPDF2
from PyPDF2 import PdfFileReader
After that, we need to open our PDF file in binary reading mode.
with open("your_pdf_file.pdf", "rb") as pdf_file:
We then want to instantiate our PdfFileReader object
pdf_reader = PdfFileReader(pdf_file)
We then get the number of pages with the numPages property
print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")
That's it! We have now counted the number of pages in a PDF file with Python!
#!/usr/bin/env python3
"""
Extracting number of pages in the document
getNumPages()
Calculates the number of pages in this PDF file.
Returns: number of pages
Return type: int
Raises PdfReadError:
if file is encrypted and restrictions prevent this action.
numPages
Read-only property that accesses the getNumPages() function.
"""
from PyPDF2 import PdfFileReader
# Load the pdf to the PdfFileReader object with default settings
with open("your_pdf_file.pdf", "rb") as pdf_file:
pdf_reader = PdfFileReader(pdf_file)
print(f"The total number of pages in the pdf document is {pdf_reader.numPages}")
Sometimes we need simple and basic tools to get the job done. At work, we have people that use pdf files daily, on which they need to perform certain manual operations. One of these operations is rotating pages. Thinking of programming a pdf rotator can look quite massive at first, but is it really?
To build our tool, we need to be able to rotate the pdf in three ways, clockwise, counterclockwise and 180 degrees. For simplicity, we want to rotate all pdf files in our current working directory. The user shall also be able to use the finished script, without installing Python or any dependencies on Windows. Let us walk through the steps in creating our tool.
Rotating a pdf with PyPDF2 can be done with the PageObject class's method RotateClockwise. The method takes one Int parameter, angle
which defines the rotation degrees. Note that the angle have to be specified in incremetns of 90°. There is no possibility of rotating a PDF page for example 55°.
Ok, we know how to rotate our page, now we need to load our PDF file into memory. After that, we need to initialize our PdfFileReader and PdfFileWriter objects. We can then loop through our pages by using the readers numPages variable. We get the page, rotate it, write it to our new PDF and then save it to disc.
import PyPDF2
with open("test.pdf", "rb") as pdf_file: pdf_reader = PyPDF2.PdfFileReader(pdf_file) pdf_writer = PyPDF2.PdfFileWriter() print("Rotating", degrees) for page_num in range(pdf_reader.numPages): pdf_page = pdf_reader.getPage(page_num) pdf_page.rotateClockwise(degrees) pdf_writer.addPage(pdf_page) with open("test_rotated.pdf", "wb") as pdf_file_rotated: pdf_writer.write(pdf_file_rotated)
A command line interface might work for many users, but believe me, a Graphical USer Interface (GUI) beats a Command Line Interface (CLI) by lightyears for the average user. To easily create our interface, we use Tkinter. We need three Radiobuttons for specifying the rotations, Left, Right and 180 degrees. We also need some kind of descripte text to guide the user, as well as a Button for being able to start the rotation. See the code comments for further descriptions.
import tkinter as tk # Create our root widget, set title and size master = tk.Tk() master.title("PDF rotator") master.geometry("400x100") # Create a IntVar for getting our rotate values master.degrees = tk.IntVar() # Create a description label and a couple radiobuttons, add them to the widget tk.Label(master, text="Rotates all pdf in the current folder the selected degrees.").grid(row=0,columnspan=4) tk.Radiobutton(master, text="Right 90 degrees", variable=master.degrees, value=90).grid(row=1,column=1) tk.Radiobutton(master, text="Left 90 degrees", variable=master.degrees, value=-90).grid(row=1,column=2) tk.Radiobutton(master, text="180 degrees", variable=master.degrees, value=180).grid(row=1,column=3) # Create a button for calling our function master.ok_button = tk.Button(master, command=rotate_pdf, text="Rotate pdf files") master.ok_button.grid(row=2,column=1) # Run tk.mainloop()
We want to rotate all pdf files in the folder where our script is contained.
import os # Get all the files in current folder from where we are running the script files = [f for f in os.listdir('.') if os.path.isfile(f)] files = list(filter(lambda f: f.lower().endswith(('.pdf')), files))
Here is the complete code for rotating our pdf files. Enjoy!
#!/usr/bin/env python3 # -*- coding: <utf-8> -*- import PyPDF2 import tkinter as tk import os import sys # Get all the files in current folder from where we are running the script files = [f for f in os.listdir('.') if os.path.isfile(f)] files = list(filter(lambda f: f.lower().endswith(('.pdf')), files)) # main rotate pdf function def rotate_pdf(*args): degrees = master.degrees.get() pdf_rotator(files, degrees) # The pdf rotator def pdf_rotator(files, degrees): for filename in files: if degrees != 0 and degrees != "": with open(filename, "rb") as pdf_file: pdf_reader = PyPDF2.PdfFileReader(pdf_file) pdf_writer = PyPDF2.PdfFileWriter() print("Rotating", degrees) for page_num in range(pdf_reader.numPages): pdf_page = pdf_reader.getPage(page_num) pdf_page.rotateClockwise(degrees) pdf_writer.addPage(pdf_page) with open(filename[:-4]+"rotated_"+str(degrees)+".pdf", "wb") as pdf_file_rotated: pdf_writer.write(pdf_file_rotated) sys.exit() # Create our root widget, set title and size master = tk.Tk() master.title("PDF rotator") master.geometry("400x100") # Create a IntVar for getting our rotate values master.degrees = tk.IntVar() # Create a description label and a couple radiobuttons, add them to the widget tk.Label(master, text="Rotates all pdf in the current folder the selected degrees.").grid(row=0,columnspan=4) tk.Radiobutton(master, text="Right 90 degrees", variable=master.degrees, value=90).grid(row=1,column=1) tk.Radiobutton(master, text="Left 90 degrees", variable=master.degrees, value=-90).grid(row=1,column=2) tk.Radiobutton(master, text="180 degrees", variable=master.degrees, value=180).grid(row=1,column=3) # Create a button for calling our function master.ok_button = tk.Button(master, command=rotate_pdf, text="Rotate pdf files") master.ok_button.grid(row=2,column=1) # Run tk.mainloop()
In this post, we will take a look into how we can generate Extensible Markup Language (XML) files from an Excel file with Python. We will be using the Yattag package to generate our XML file and the OpenPyXL package for reading our Excel data.
Yattag is described in its documentation as following:
Yattag is a Python library for generating HTML or XML in a pythonic way.
That pretty much sums Yatttag up, I find it as a simple, easy to use library that just works. I had been searching for this kind of library in order to more easily generate different XML files.
To install Yattag with pip: pip install yattag
Adding a tag with Yattag is as easy as using the With
keyword:
with tag('h1'): text('Hello world!')
Tags are automatically closed. To start using Yattag we need to import Doc from Yattag and create our Doc, tag and text with Doc().tagtext().
from yattag import Doc doc, tag, text = Doc().tagtext() with tag('h1'): text('Hello world!') doc.getvalue() Output: '<h1>Hello world!</h1>'
OpenPyXL is a library for interacting with Excel 2010 files. OpenPyXL can read and write to .xlsx and .xlsm files.
To install OpenPyXL with pip: pip install openpyxl
To load an existing workbook in OpenPyXl we need to use the load_workbook
method. We also need to select the sheet we are reading the data from. In our example, we are using popular baby names in New York City. You can access the dataset from the link at the bottom of this post.
I have created a workbook named NY_baby_names.xlsx with one sheet of data, Sheet1. The worksheet has the following headers: Year of Birth, Gender, Child's First Name, Count, Rank. You can download the Excel file from my website here.
To access the data with OpenPyXL, do the following:
from openpyxl import load_workbook wb = load_workbook("NY_baby_names.xlsx") ws = wb.worksheets[0] for row in ws.iter_rows(min_row=1, max_row=2, min_col=1, max_col=4): print([cell.value for cell in row]) Output: ['Year of Birth', 'Gender', "Child's First Name", 'Count'] [2011, 'FEMALE', 'GERALDINE', 13]
First, we load the workbook with load_workbook
, and then select the first worksheet. We then iterate through the first two rows with the iter_rows
method.
After the imports, we load the workbook and the worksheet. We then create our Yattag document. We fill the headers with Yattags asis()
method. The asis method enables us to input any string as the next line.
We then create our main tag, Babies. We start looping through our sheet with the iter_rows
method. The iter_rows
method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.
Next, we are adding the babies. Notice the use of the With tag and text. When we are finished we indent our result with Yattags indent method.
Finally, we save our file. The output should look like below. Notice that I only included two babies in the output.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema> <Babies> <Baby> <Name> GERALDINE </Name> <Gender> FEMALE </Gender> <year> 2011 </year> <count> 13 </count> <rank> 75 </rank> </Baby> <Baby> <Name> GIA </Name> <Gender> FEMALE </Gender> <year> 2011 </year> <count> 21 </count> <rank> 67 </rank> </Baby> </Babies>
from openpyxl import load_workbook from yattag import Doc, indent wb = load_workbook("NY_baby_names.xlsx") ws = wb.worksheets[0] # Create Yattag doc, tag and text objects doc, tag, text = Doc().tagtext() xml_header = '<?xml version="1.0" encoding="UTF-8"?>' xml_schema = '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>' doc.asis(xml_header) doc.asis(xml_schema) with tag('Babies'): # Use ws.max_row for all rows for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=5): row = [cell.value for cell in row] with tag("Baby"): with tag("Name"): text(row[2]) with tag("Gender"): text(row[1]) with tag("year"): text(row[0]) with tag("count"): text(row[3]) with tag("rank"): text(row[4]) result = indent( doc.getvalue(), indentation = ' ', indent_text = True ) with open("baby_names.xml", "w") as f: f.write(result)
Popular Baby Names dataset: https://catalog.data.gov/dataset/most-popular-baby-names-by-sex-and-mothers-ethnic-group-new-york-city-8c742
I got an excellent question in my OpenPyXL course. The question was how to transfer data from one Excel workbook to another with OpenPyXL. Transferring data only is easier, as there is no need to worry about the formatting of cells and sheets. The complete code is available at the bottom of this post.
The question was how to copy data from one sheet in a workbook to several other sheets in a newly created workbook. The source workbook is WB1 with the source sheet WS1. The source sheet has 1000 rows of data. The data shall be copied to the new workbook WB2 and the sheets WS1 to WS10.
To copy the values we need to import Workbook
and load_workbook
from the OpenPyXL library. We can now load our existing workbook, WB1. WB1 = load_workbook("Source.xlsx", data_only=True)
. The next thing we need to do is set which sheet we are going to copy the data from. We name the sheet WB1_WS1 WB1_WS1 = WB1["WS1"]
. After that we are ready to create a new workbook with WB2 = Workbook()
. Notice the brackets for the method.
The question stated that 10 sheets should be created, with the names WS1 to WS10. We can create the sheets with a for loop. Each sheet is created with create_sheet(f"WS{i}")
. Notice the usage of Pythons f-string. We then remove the default created sheet, Sheet1. We could also of course have renamed it.
# Create WB2 sheets WS1-WS10 for i in range(1, 11): WB2.create_sheet(f"WS{i}") # delete first sheet WB2.remove(WB2.worksheets[0])
The next thing is to create a list for holding our copy ranges, and also which sheets we want to copy the data to. The copy_ranges
list holds how many rows we need to copy from the source sheet to the sheets defined in copy_to_sheets
.
# Define the copy ranges and sheets copy_ranges = [100, 200, 50, 300, 350] copy_to_sheets = ["WS1", "WS2", "WS3", "WS4", "WS4"]
We start with a for loop, that iterates through the copy_ranges
list. for i in range( len(copy_ranges)):
We then specify which sheet is in turn for copying ws = WB2[ copy_to_sheets[i] ]
. Notice how we specify the sheet with the copy_to_sheets
list. When i
is 1 we select WS1
and so on. We also initialize our row_offset
to 1 so that we can keep track of which rows to copy next. We then set the row_offset with yet another for loop. We increase the offset i
times with the corresponding values from copy_ranges
.
for s in range(i): offset += copy_ranges[s]
Now it is time to fill our sheets with data! we traverse through our offset range with a for loop and set the values of the corresponding sheet. First we get the row with for j in range(offset, offset + copy_ranges[i]):
. Next up are the cells in each row:
for row in WB1_WS1.iter_rows(min_row=j, max_row=j, min_col=1, max_col=WB1_WS1.max_column):
.
We get the values for values_row
with a list comprehension [cell.value for cell in row]
. Finally, we append the row to the sheet with ws.append(values_row)
.
# Copy the row with the help of iter_rows, append the row for j in range(offset, offset + copy_ranges[i]): #if j == 0: # continue for row in WB1_WS1.iter_rows(min_row=j, max_row=j, min_col=1, max_col=WB1_WS1.max_column): values_row = [cell.value for cell in row] ws.append(values_row)
To wrap up, we save the workbook: WB2.save("WB2.xlsx")
That's it! Please comment below if you have questions or any feedback. See you later!
Use this link, Control Excel with Python & OpenPyXL or this code SAVEOPENPYXL to get the course for a discount on Udemy.com
#!/usr/bin/python # -*- coding: utf-8 -*- """ Could you please suggest how to copy the data from on work book to other book with specified rows Source: Excel work book "WB1" having work sheet "WS1", This sheet having 1000 rows of data Destination: New work book 'WB2' and work sheets WS1,WS2...WS10 Could you please suggest the code for following condition: Copy the first 100 rows data and paste it WS1 sheet Copy the next 200 rows data and paste it WS2 sheet Copy the next 50 rows data and paste it WS3 sheet Copy the next 300 rows data and paste it WS4 sheet Copy the next 350 rows data and paste it WS4 sheet """ from openpyxl import Workbook, load_workbook WB1 = load_workbook("Source.xlsx", data_only=True) WB1_WS1 = WB1["WS1"] WB2 = Workbook() # Create WB2 sheets WS1-WS10 for i in range(1, 11): WB2.create_sheet(f"WS{i}") # delete first sheet WB2.remove(WB2.worksheets[0]) # Define the copy ranges and sheets copy_ranges = [100, 200, 50, 300, 350] copy_to_sheets = ["WS1", "WS2", "WS3", "WS4", "WS4"] # Copy the values from the rows in WB1 to WB2. for i in range( len(copy_ranges)): # Set the sheet to copy to ws = WB2[ copy_to_sheets[i] ] # Initialize row offset offset = 1 # Set the row offset for s in range(i): offset += copy_ranges[s] # Copy the row with the help of iter_rows, append the row for j in range(offset, offset + copy_ranges[i]): #if j == 0: # continue for row in WB1_WS1.iter_rows(min_row=j, max_row=j, min_col=1, max_col=WB1_WS1.max_column): values_row = [cell.value for cell in row] ws.append(values_row) # Save the workbook WB2.save("WB2.xlsx")
Saving a finished report or table in Excel is easy. You choose SaveAs and save the sheet as Pdf. Doing this automatically with Python is a bit trickier though. In this post, we will take a closer look on how to do this with the win32 library. The full code is available at the bottom of the post. Note that you need Excel installed in order to run this script successfully.
install the win32 library first with: pip install pypiwin32
. This will install the Win32 Api library, which according to PyPi contains: Python extensions for Microsoft Windows Provides access to much of the Win32 API, the ability to create and use COM objects, and the Pythonwin environment.
To get the file paths we use pathlib. Pathlib was introduced in Python 3.4 so it is quite new (Using Python 3.8 during the writing of this article). We specify the name of the Excel workbook we want to make a pdf of, and also the output pdf's name.
excel_file = "pdf_me.xlsx"
pdf_file = "pdf_me.pdf"
We then create paths from our current working directory (cwd) with Pathlibs cwd() method.
excel_path = str(pathlib.Path.cwd() / excel_file)
pdf_path = str(pathlib.Path.cwd() / pdf_file)
Excel is next up. We start the Excel application and hide it.
excel = client.DispatchEx("Excel.Application")
excel.Visible = 0
We then open our workbook wb = excel.Workbooks.Open(excel_path)
and load our first sheet with ws = wb.Worksheets[1]
Now it is time to use the SaveAs to save our sheet as a pdf. wb.SaveAs(pdf_path, FileFormat=57)
Fileformat 57 is the pdf file format.
We then close our workbook and quit our Excel application. Our pdf is now saved in our working directory.
from win32com import client import win32api import pathlib ### pip install pypiwin32 if module not found excel_file = "pdf_me.xlsx" pdf_file = "pdf_me.pdf" excel_path = str(pathlib.Path.cwd() / excel_file) pdf_path = str(pathlib.Path.cwd() / pdf_file) excel = client.DispatchEx("Excel.Application") excel.Visible = 0 wb = excel.Workbooks.Open(excel_path) ws = wb.Worksheets[1] try: wb.SaveAs(pdf_path, FileFormat=57) except Exception as e: print("Failed to convert") print(str(e)) finally: wb.Close() excel.Quit()
In this post, we are going to have a look at how to split all pages from a single pdf into one-page pdf files. Splitting a pdf into several pages can easily be done with almost any pdf tool worth its salt. However, splitting a pdf into single pages is a manual operation, and if you have to do it on several pdfs an automated tool makes sense. This is where PyPDF2 comes in handy. If you just want the complete code without all the fancy explanations, you can find it at the end.
If you haven't done so already, fire up your command prompt, PowerShell or terminal and install PyPDF2 with pip.
pip install pypdf2
Currently I am running 32-bit Python 3.8 with PyPDF2 version 1.26.0 on Windows 10. The code works on this setup, and probably also for other OS'es.
We start with importing PdfFileWriter and PdfFileReader so that we can read the existing pdf and later write new pdfs. We also need to import sys so that we can check what files we have have in our working directory.
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
First we do a list comprehension in os.listdir(".")
if the provided path is a file os.path.isfile(f)
. After that we filter out all the pdf files from the list files
with files = list(filter(lambda f: f.lower().endswith((".pdf")), files))
.
files = [f for f in os.listdir(".") if os.path.isfile(f)] files = list(filter(lambda f: f.lower().endswith((".pdf")), files))
Now it is time to process all our pdf files. We go through each of our pdf in files with a for loop for pdf in files:
. We then open the pdf with open(pdf, "rb") as f:
and load each pdf into a PdfFileReader object with inputpdf = PdfFileReader(f)
.
Now it is time to start the splitting. With another for loop, we loop through all pages in the pdf. You can get the number of pages with numPages
. We create a PdfFileWriter object named output and add the first page with getPage(i)
. We name the output pdf with the original name, add -Page and the page number. name = pdf[:-4]+"-Page "+str(i)+".pdf"
. Finally, we save the output.
with open(name, "wb") as outputStream:
output.write(outputStream)
from PyPDF2 import PdfFileWriter, PdfFileReader import os files = [f for f in os.listdir(".") if os.path.isfile(f)] files = list(filter(lambda f: f.lower().endswith((".pdf")), files)) for pdf in files: with open(pdf, "rb") as f: inputpdf = PdfFileReader(f) for i in range(inputpdf.numPages): output = PdfFileWriter() output.addPage(inputpdf.getPage(i)) name = pdf[:-4]+"-Page "+str(i)+".pdf" with open(name, "wb") as outputStream: output.write(outputStream)
I have finally published my first course, Control Excel with Python & OpenPyXL!
The course covers the ins and outs of how to control and automate Excel with the OpenPyXL library.
I am offering this course for a discount for you who are reading this post.
Use this link, Control Excel with Python & OpenPyXL or this code SAVEOPENPYXL to get the course for a discount on Udemy.com
Using Python is wonderful. In fact, it so wonderful that you sometimes want other, non-python users to try out your scripts. You can tell them all day long to go to python.org and download the latest version. However, most people will ask you why you cannot just make an installer or a .exe for them instead.
I have made a few executable files myself and will share some ways in how I have done it. Note that I am using Python 3.6, results can differ if you are using Python 2 or Python 3.7. Occasionally, the packaging does not work without some modification when you are using the latest Python version. Try using an older Python version if the packaging does not work out of the box.
How to run a multi-thread in Python - While coding, we sometimes want the program to do multiple things at once. You may have a script where you run a background function, checking for changes as you go along. It can also be as simple as disabling an Entry to not allow the user to input more text for a while. In this post, we will look at a simple way of how to use multiple threads in Python. (more…)
In this article, we will explore how to create a simple button in Tkinter. The button is used for a lot of user interaction and is often the most basic way to interact with a user. We will also cover some of the basics of how a Tkinter application is launched. (more…)