22.12.2019

How to generate an XML file from Excel with Python

Introduction

In this post, we will take a look into how we can generate Extensible Markup Language (XML) files from an Excel file with Python. We will be using the Yattag package to generate our XML file and the OpenPyXL package for reading our Excel data.

Packages

Yattag

Yattag is described in its documentation as following:

Yattag is a Python library for generating HTML or XML in a pythonic way.

That pretty much sums Yatttag up, I find it as a simple, easy to use library that just works. I had been searching for this kind of library in order to more easily generate different XML files.

To install Yattag with pip: pip install yattag

Using Yattag

Adding a tag with Yattag is as easy as using the With keyword:

with tag('h1'):
    text('Hello world!')

Tags are automatically closed. To start using Yattag we need to import Doc from Yattag and create our Doc, tag and text with Doc().tagtext().

from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('h1'):
    text('Hello world!')
doc.getvalue()

Output:
'<h1>Hello world!</h1>'

OpenPyXL

OpenPyXL is a library for interacting with Excel 2010 files. OpenPyXL can read and write to .xlsx and .xlsm files.

To install OpenPyXL with pip: pip install openpyxl

Using OpenPyXL

To load an existing workbook in OpenPyXl we need to use the load_workbook method. We also need to select the sheet we are reading the data from. In our example, we are using popular baby names in New York City. You can access the dataset from the link at the bottom of this post.

I have created a workbook named NY_baby_names.xlsx with one sheet of data, Sheet1. The worksheet has the following headers: Year of Birth, Gender, Child's First Name, Count, Rank. You can download the Excel file from my website here.

To access the data with OpenPyXL, do the following:

from openpyxl import load_workbook
wb = load_workbook("NY_baby_names.xlsx")
ws = wb.worksheets[0]
for row in ws.iter_rows(min_row=1, max_row=2, min_col=1, max_col=4):
    print([cell.value for cell in row])

Output: 
['Year of Birth', 'Gender', "Child's First Name", 'Count']
[2011, 'FEMALE', 'GERALDINE', 13]

First, we load the workbook with load_workbook, and then select the first worksheet. We then iterate through the first two rows with the iter_rows method.

Generating the XML from Excel

After the imports, we load the workbook and the worksheet. We then create our Yattag document. We fill the headers with Yattags asis() method. The asis method enables us to input any string as the next line.

We then create our main tag, Babies. We start looping through our sheet with the iter_rows method. The iter_rows method returns a generator with all the cells. We use a list comprehension to get all the values from the cells.

Next, we are adding the babies. Notice the use of the With tag and text. When we are finished we indent our result with Yattags indent method.

Finally, we save our file. The output should look like below. Notice that I only included two babies in the output.

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>
<Babies>
    <Baby>
        <Name>
            GERALDINE
        </Name>
        <Gender>
            FEMALE
        </Gender>
        <year>
            2011
        </year>
        <count>
            13
        </count>
        <rank>
            75
        </rank>
    </Baby>
    <Baby>
        <Name>
            GIA
        </Name>
        <Gender>
            FEMALE
        </Gender>
        <year>
            2011
        </year>
        <count>
            21
        </count>
        <rank>
            67
        </rank>
    </Baby>
</Babies>

Complete code

from openpyxl import load_workbook
from yattag import Doc, indent

wb = load_workbook("NY_baby_names.xlsx")
ws = wb.worksheets[0]

# Create Yattag doc, tag and text objects
doc, tag, text = Doc().tagtext()

xml_header = '<?xml version="1.0" encoding="UTF-8"?>'
xml_schema = '<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"></xs:schema>'

doc.asis(xml_header)
doc.asis(xml_schema)

with tag('Babies'):
    # Use ws.max_row for all rows
    for row in ws.iter_rows(min_row=2, max_row=100, min_col=1, max_col=5):
        row = [cell.value for cell in row]
        with tag("Baby"):
            with tag("Name"):
                text(row[2])
            with tag("Gender"):
                text(row[1])
            with tag("year"):
                text(row[0])
            with tag("count"):
                text(row[3])
            with tag("rank"):
                text(row[4])

result = indent(
    doc.getvalue(),
    indentation = '    ',
    indent_text = True
)

with open("baby_names.xml", "w") as f:
    f.write(result)

Data sets

Popular Baby Names dataset: https://catalog.data.gov/dataset/most-popular-baby-names-by-sex-and-mothers-ethnic-group-new-york-city-8c742

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram