1 Jan 2010

First look inside an epub ebook

In this article we will download an ebook from the web and examine its structure using common browsing tools. For this exercise you will need to be able to view the contents of Zip files and to display XML files.

I have a fully paid for and licenced copy of WinZip, so that's what I'll be using in the instructions and screenshots. If you haven't, there are many ways of opening Zip files - take a look at How to open a Zip file without WinZip for some examples.

For browsing XML documents I use Internet Explorer and Visual Web Developer Express Edition from Microsoft. If you don't use Internet Explorer, most other browsers will also display them; or take a look at Viewing XML Files at W3Schools.

The purpose of this exercise is to get our hands dirty on a real ebook without looking too closely at the standards documents referred to in the Introduction to epub post.

Download an ebook in the epub format
Go to the epubBooks website and download the free ebook: The Curious Case of Benjamin Button. This is a short story by F.Scott Fitzgerald and, as an ebook, has a fairly simple structure. This makes it suitable for illustration of some key ideas about how epub books are put together. For convenience, save the download to a new folder.


Figure 1. Downloaded epub ebook

(Hint: click on thumbnails to see the full screenshot or image)

Figure 1. shows how the saved ebook looks on my machine. Notice that the ebook download consists of a single file with the file extension .epub.

Open the ebook using your Zip file viewer
We'll look at the Open Publication Structure shortly but for now open the file using WinZip, or whichever program you are using to view Zip files. Figure 2. shows the contents of the file when the folder view of WinZip is switched on.


Figure 2. epub document opened in WinZip

Figure 2. demonstrates first of all that the epub format uses Zip compression technology to package the different parts of an ebook into a single file. Next, notice that, at the top level of the folder hierarchy, there is a file called 'mimetype' and there are two folders called 'META-INF' and 'OPS'. The OPS folder has two sub-folders called 'css' and 'images'.

Figure 3. shows the full contents of the file when the folder view is switched off. This view is sorted by the Path column.



Figure 3. Alternate view of the book contents

In this view, you can see that the META-INF folder holds a file called 'container.xml'. This file holds information that conforms with the Open Container Format. We'll look at what that means shortly.

The OPS folder contains a set of files with names like 'chapter-nnn.xml'. You might guess correctly that these files hold the text of each chapter of the book in XML format. There are a few other files with the .xml extension: 'title.xml' and 'epubbooksinfo.xml'. These are also parts of the book that will be displayed when you read it, namely its title page and a page of information about epubBooks.

Also in the OPS folder are two important files: 'epb.opf' and 'epb.ncx'. These files contain metadata or 'data about data' and together they describe the content of the book and things like the order in which the content files should be displayed by the reading device. A file with extension .opf is used to identify information that conforms with the Open Package Format, and a file with extension .ncx identifies a document containing navigation information i.e. the reading order of the content files.

The folder called 'css' contains the files that apply some initial styling to the document. Aspects like relative font-size, text decoration (underline etc.), and margins are included, but it's common for the reading device to leave to the reader the choice of such aspects as font, font size. and text and background colours.

Folder 'images' contains the images that are displayed in the book. In our example the images folder holds only a logo for epubBooks which is displayed on the 'epubbooksinfo' page.

So far, you've seen words like Container and Package without getting any detailed explanation of what they mean. Nor have you seen any reference to Open Publication Structure. That has been deliberate. Each of these concepts will be the subject of separate posts, indeed each topic will span several posts as we relate what we see in the example epub document both to the OPS specifications and to the development of a class model in C# which we can use to read and manipulate epub documents.

Desperate to read Benjamin Button?
By the way, if you're absolutely desperate to view the book before we examine it technically, you will need an epub reading device. That means any combination of hardware and software that allows you to open and display an epub document. This could be a smartphone, a dedicated reader, or your PC. On your PC you could download and try Adobe Digital Editions which allows you to hold and view a library of ebooks.

Copyright © Colin Hazlehurst, 2010