4 Jan 2010

A closer look at OPF

Let's take a closer look at the Open Packaging Format.

Recall from the previous post that an electronic publication conforming to the OPF standard must provide a package document. This must be an XML document with a root element of <package> which includes elements called <metadata>, <manifest>, and <spine>. Figure 1. shows an overview of the package document that is delivered with our sample epub ebook.

Click to see the full image
Figure 1. Package Overview

You can see this document has the correct set of package elements. Before looking at each in detail, there are two things I'd like to point out on this screenshot. First, the unique-identifier attribute in the <package>:
unique-identifier="EPB-UUID"
This attribute tells the software of the reading device to look out for a metadata item of type <dc:identifier> that has an 'id' attribute with the value 'EPB-UUID'. The value of this element is an identifier that is globally unique. Towards the bottom of the metadata, when we look at it, you will find the following element:
<dc:identifier   id="EPB-UUID">
     urn:uuid:CBC56AFC-6C29-1014-8672-92A1DF1F0AF1
</dc:identifier>
That 32-digit hexadecimal value is the GUID (Globally Unique Identifier) generated by epubBooks to identify this particular publication. Of course, an ISBN is another globally unique identifier, and if you are in the publishing business and routinely buy ISBNs by the dozen, you would probably insert the ISBN here.

Second, the attributes in the <metadata> element:
xmlns:opf="http://www.idpf.org/2007/opf"
xmlns:dc="http://purl.org/dc/elements/1.1"
indicate that the prefix 'dc' refers to the Dublin Core element definitions (see below for more detail), and that the 'opf' prefix refers to OPF extensions to the Dublin Core. For instance, look at the two date elements in the metadata:

<dc:date opf:event="original-publication">1922</dc:date>
<dc:date opf:event="epub-publication">2009-09-24</dc:date>
The 'dc' prefix on the 'date' elements identifies them as publication dates that follow the Dublin Core specification. The 'opf' prefix on the 'event' attribute identifies 'event' as belonging to the OPF specification.

Unfortunately, looking at the OPF specification, it seems the publisher is free to give the event attribute whatever value they like:
"The set of values for event are not defined by this specification; possible values may include: creation, publication, and modification."
Now, let's look in more detail at the package metadata.

Package <metadata>
The <metadata> element of the package can contain wide ranging information about the publication. To keep OPF as open as possible, the metadata of an OPF package makes use of another open standard, namely the Dublin Core Metadata standard.

The Dublin Core is an initiative working towards standard ways of describing resources. They actively promote standardised sharing information thereby increasing interoperability between organisations - let's all agree to call a spade a spade and not a shovel or a digger.

The Dublin Core has a wider scope than just ebooks. However, there is a rich set of attributes that can be applied to electronic publications. Figure 2. shows the metadata that epubBooks placed in the package of The Curious Case of Benjamin Button.

Click to see the full image
Figure 2. Package Metadata

For convenience, I'll list the metadata elements again here:
  • title
  • language
  • identifier
  • ---------
  • creator
  • date(s)
  • publisher
  • subject
  • source
  • rights
The elements above the dashed line are mandatory. The OPF Package Schema says that there must be at least one title, at least one language element, and at least one identifier element. All other elements are optional.

Title
In fact the schema says there must be 'One Or More' of the mandatory elements. In other words, there can be more than one title, more than one language, and more than one identifier. The standard does not specify which title should be displayed, only that a reading device should choose 'the most appropriate title' for display, perhaps based on available fonts or language.

Identifier
There can also be more than one identifier element in the metadata. We've seen above how the unique identifier is handled. If you wanted, you could publish an ebook with several identifiers: your internally generated identifier, a GUID, and your ISBN. You then have to say which is to be considered the globally unique identifier.

Language
The specification says there must be at least one <language> metadata element, but there may be more than one. I suppose if you were publishing an English-Mandarin dictionary or were writing a learned text about the Rosetta Stone you might have a reason to specify more than one language.

Full list of metadata elements
The following table summarises the full set of metadata elements that can appear in the <metadata> section of a <package>

Element
Number
Description
title
One or more
The title of the publication. As we've seen, there can be more than one, but there must be at least one title.
creator
Zero or more
The primary creators or authors of the publication. Each element is recommended to hold one name and is recommended to be in the form it should be presented to the reader. When there's more than one creator, it's expected they would be displayed in the order in which the elements appear in the metadata. Other contributors should be identified in <contributor> elements.
subject
Zero or more
The subject matter of the publication. There is no standardisation here. The optional text could be a sentence, a list of keywords, or one keyword per element.
description
Zero or more
The description(s) of the publication.
publisher
Zero or more
The publisher(s) of the publication.
contributor
Zero or more
The person(s) making contributions to the publication in a manner that is secondary to the role of creator. OPF defines nearly 30 different roles as contributor and specifies the syntax for their identification.
date
Zero or more
The publication date(s) for the publication. We've already seen that OPF extends the Dublin Core definition of this element, allowing different 'event' dates to be recognised.
type
Zero or more
The type(s) that describe the publication. This is relatively free-form although the specification recommends using words from controlled vocabularies i.e. selecting from a restricted set of words. Terms relating to genre e.g. Young Adult, Fantasy, Literary, might be used as well as terms like Fiction, Non-Fiction etc.
format
Zero or more
The media-types of the publication. The recommendation is to use a MIME type.
identifier
One or more
One or more identifiers for the publication, one of which must be defined as a unique identifier. See the discussion above.
source
Zero or more
Identification of any other documents or publications from which the current publication is derived.
language
One or more
One or more language identifiers.
relation
Zero or more
Identifier(s) of resources to which the current publication is related.
coverage
Zero or more
One or more identifiers of the scope of the publication. OPF recommends following the Dublin Core specification of coverage and to use a controlled vocabulary for geographical, temporal, and juridical descriptions.
rights
Zero or more
An assertion of the rights of the publisher/creator with respect to this publication.
Table 1. Package metadata elements

Package <manifest>
Figure 3. shows the <manifest> element of the OPF package for our sample epub ebook, The Curious Case of Benjamin Button.

Click to see the full image

Figure 3. Package Manifest

The package manifest identifies all of the resources that are needed to display the ebook fully and correctly. Each entry in the manifest consists of an <item> element, as in:
<item id="chapter-001" href="chapter-001.xml" media-type="application/xhtml+xml"/>
Each <item> element has an 'id' attribute which identifies this resource uniquely within the publication. It has an 'href' attribute which points to the content document, in the example above it's an XML document called 'chapter-001.xml'. The 'media-type' attribute in this example shows that the resource should be handled as an XHTML document.

You can see that the manifest lists 13 content documents: a title page, a page of information about the publisher (epubBooks), and the 11 chapters of the story.

Each content document includes a 'link' element that refers to the CSS stylesheet 'body.css'. Therefore, the manifest includes an <item> for it:

<item id="main-css" href="css/book.css" media-type="text/css"/>
The publisher information document, epubbooksinfo.xml, includes an image which is the company logo. Therefore, the manifest includes an <item> for it:

<item id="epubbooks-logo" href="images/epubbooks-logo.png" media-type="image/png"/>

The rule is: if the content documents use it, it must be in the manifest. There are some aspects of the manifest that will be reserved for a future post so they don't clutter up this presentation. These cover media-types that are not part of the OPS Core, Out-Of-Line XML Islands, and the use of fallback documents to support these non-standard documents.

There's one more item in the manifest, and it's quite an important one:
<item id="ncx" href="epb.ncx" media-type="application/x-dtbncx+xml">
The 'id' attribute is set to 'ncx', the 'href' points to a file called 'epb.ncx', and the 'media-type' indicates that the resource should be handled as an NCX document. NCX is a standard way of declaring a Table of Contents. It's another open standard, this time maintained by the DAISY consortium.

This leads us nicely into the description of the third mandatory element of an OPF package - the <spine> element.

Package <spine>
Figure 4. shows the expanded <spine> element of our sample package.

Click to see the full image

Figure 4. Package Spine

The <spine> starts off like this:
<spine toc="ncx">
   <itemref idref="titlepage" linear="yes"/>
   <itemref idref="epubbooksinfo" linear="yes"/>
   <itemref idref="chapter-001" linear="yes"/>
   <itemref idref="chapter-002" linear="yes"/>
The element has a 'toc' attribute with the value 'ncx'. This is the value of the 'id' attribute of the item in the manifest that points to the mandatory table of contents document. In other words, it's a way to identify the table of contents. We saw in the last section that the manifest item with an id of 'ncx' points to a document called 'epb.ncx'. We'll look at NCX documents in more detail in a future article.

The next thing to notice about the <spine> is that it contains a list of <itemref> elements. Each <itemref> has an attribute called 'idref', and the value of an idref is the id of an item in the manifest.

For example, the first idref has value 'titlepage'. Look back at the manifest screenshot and you'll see that the first content document in the manifest has id="titlepage", and that item points to the content document itself (titlepage.xml).

The spine is a list of content documents and the important thing about the list is that it specifies the linear order in which the content documents should be displayed: title page, followed by the publisher's information page, followed by chapter 1, etc.

The <itemref> element has an optional attribute called 'linear'. This attribute takes a yes/no value and is used to indicate whether the referenced document is primary or auxiliary. This can be used by reading devices to show auxiliary information in a different way from the main flow of the primary information. In our case, the values are all set to 'yes' which is the default.