Patrice's Blog

Reading an ODF document with odfpy

A few things I learned about reading an ODF document using the Python odfpy library.

ODF document logo

Odfpy is quite a powerful library for reading and writing ODF documents, but it has an incredibly mediocre documentation.    If you look around, you’ll find an appaling one hundred pages odt specification that is of no use whatsoever.   Probably some contractor that was committed to writing specifications and made up a program for generating loads of bullshit.   Which is quite appropriate in a qay, since Odfpy seems to be primarily meant to have a program write an ODF file, not so much read it.   At least he was eating his own dogfood, but to no avail.

I searched and searched, but all Google could tell me was that many others were looking for the same thing:  decent information about how to use odfpy.    Which, apart from that, seems to be a good project, and still alive, though the team doesn’t seem to care about developers being able to use it.

That’s how I spent a few days groping around trying to find a way to navigate and read an Odt file, extract information, not just text, but styles and formating as well.   My real need was to get some information into a web site, but Html is generally not the recommanded way to handle formatted text into a web application, so my goal was to extract some formatted text from an Odt file and conver it into Markdown formatting.

Markdown is a very simple text formatting syntax developped by John Gruber and Aaron Swartz and some Python templating systems among others have a good markdown to html filter, which allows you to manage content as markdown text, including for editing, and insert it into pages transparently.

The Odf document structure

There are very good sources explaining the Open Document Format.  I will just go through the minimum stuff, which may be sufficient to browse your odf doc with odfpy, and I will only describe the ODF document structure as you see it through the odfpy classes.

The document object is a big hierarchy of node elements, starting at doc.text.  The most important method for navigating the tree is node.childNodes; such as in:

for n in startNode.childNodes:
    # do something with each n node

Now nodes have a number of attributes, but the most important are:

  • nodeType:  the type of node, mostly it is either 1 for ‘element nodes’ and 3 for ‘text nodes’.   Text nodes do not have the attributes that element nodes have, so before doing anything with a node, you should test whether it’s an element node or a text node.
  • attributes:  a dictionary of attributes (only for nodeType==1).
  • qname:  an (a,b) couple where b is a meaningful name for the node element so most of the time you will be looking at node.qname[1].

Most interresting qnames are:

  • p:  A paragraph, equivalent to html’s <p> element.
  • span:  Equivalent to html’s <span> element.   It’s a piece of text with identical style.
  • s: An element that represents a certain number of spaces.
  • table, tablerow, tablecell

There are many others, but less frequent such as soft-page-break, which is what the name says.

The ‘attributes’ dictionary of an element node is not straightforward to deal with.   You can access its keys with the usual keys() call, but each key is a (a,b) couple, where only b can easily be understood and managed.

The one attribute that’s interresting to read is the stylename.  ‘p’ elements and ‘span’ elements have a stylename attribute, which is a reference to a style, either an automatic style or a user-defined style.   An automatic style is defined automatically when a user sets bold, italic, font, … formatting, without defining a style.

With this in mind, here is a simple recursive function that will print a formatted dump of an ODF document structure and attributes:

def odf_dump_nodes(start_node, level=0):
    if start_node.nodeType==3:
        # text node
        print "  "*level, "NODE:", start_node.nodeType, ":(text):", str(start_node)
    else:
        # element node
        attrs= []
        for k in start_node.attributes.keys():
            attrs.append( k[1] + ':' + start_node.attributes[k]  )
        print "  "*level, "NODE:", start_node.nodeType, ":", start_node.qname[1], " ATTR:(", ",".join(attrs), ") ", str(start_node)

        for n in start_node.childNodes:
            dump_nodes(n, level+1)
    return

The function prints a short description of the current node, and all of its attributes, then calls itself recursively for each of its child nodes, with depth level used for indentation.   It’s a short and easy to understand function but it will give you a good understanding of your document’s structure.    Note that we only consider the second element of the key’s tuple, k[1].

You can apply to any sample Odf document:

from odf.opendocument import load

doc= load(filepath_and_filename)
odf_dump_nodes( doc.text )

If you are looking for a given element within your document, you can use the node.getElementsByType() method, giving it an element class name, such as in

from odf.opendocument import load
from odf.table import Table, TableRow, TableCell

doc= load(filepath_and_filename)
table= doc.text.getElementsByType(Table)

So if you were looking for something within the third column of the fifth row of the first table, you might do:

cell= doc.text.getElementsByType(Table)[0].getElementsByType(TableRow)[4].getElementsByType(TableCell)[2]

cell is an element node, which would generally have ‘p’ elements as its childNodes.

Understanding styles

The styles are described in doc.automaticstyles, and automaticstyles, like the elements described above, have childNodes, and the childNodes have attributes.

So you can dump automatic styles by:

for ast in doc.automaticstyles:
    name= ast.getAttribute('name')
    for k in ast.attributes.keys():
        print k[1], ":", ast.attributes[k]
    for n in ast.childNodes:
        for k in n.attributes.keys():
            print n.qname[1], "/", k[1], ":", n.attributes[k]

We print the attributes of each ast style, then the attributes of each of its child nodes.   It might be necessary to dive into the child nodes’ children, but for what I’ve seen, relevant attributes would be within the first level nodes.

Now, in order to be able to get style information quickly while you are scanning through the document, you need to put this style data into something easy to query, such as a Python dict object.   Let’s have a dict object for describing a style, with (attribute_name : attribute_value) pairs, and a global styles dictionary with (style_name :  style_attributes_dict).    Once this is set up, we will be able to access the font name of the ‘P13’ style by styles[‘P13’][‘text/font-name’].

The following code will set up the dictionary classes:

def get_styles(doc):

    styles= {}
    for ast in doc.automaticstyles.childNodes:

        name= ast.getAttribute('name')
        style= {}
        styles[name]= style

        for k in ast.attributes.keys():
            style[k[1]]= ast.attributes[k]
        for n in ast.childNodes:
            for k in n.attributes.keys():
                style[n.qname[1] + "/" + k[1]]= n.attributes[k]

    return styles

Note that within the dictionary that represents a given style, we have contatenated the childNodes name (n.qname[1]), with the attribute’s name (k[1]).

For example, the node that holds text formatting is named ‘text-properties’, so that the attribute ‘font-name’ will be obtained in our dictionary by style[‘text-properties/font-name’].

 

Comments are closed.