21 Jan 2010

Online Epub Editor Project: Technical Notes

This post adds some technical notes to the Inside Epub articles which explore the epub standards and show how to write an online wysiwyg epub editor. The following topics are covered:
  • Using the DotNetZip library to open and save an ebook
  • Checking the validity of your .epub files
  • Using the Sandcastle Help Builder to document code
  • Sorting an NCX navMap by playOrder
Using the DotNetZip library to open and save an ebook
I chose the DotNetZip library for this project because it looked well documented and seemed to offer the functionality I wanted. All was fine until I submitted an epub for validation using the facility provided by Threepress (see Checking the validity of your .epub files, below).

There were a few simple problems to resolve but the most persistent was a report that the mimetype file did not contain the expected information (application/epub+zip). The file clearly did have the correct information in it so I suspected the zipping wasn't working correctly.

The rules for the ZIP Container in the Open Container Format Specification state:
The first file in the ZIP Container MUST be a file by the ASCII name of ‘mimetype’ which holds the MIME type for the ZIP Container (i.e., “application/epub+zip” as an ASCII string; no padding, white-space or case change). The file MUST be neither compressed nor encrypted and there MUST NOT be an extra field in its ZIP header. If this is done, then the ZIP Container offers convenient “magic number” support as described in RFC 2048 and the following will hold true:


  • The bytes “PK” will be at the beginning of the file
  • The bytes “mimetype” will be at position 30
  • The actual MIME type (i.e., the ASCII string “application/epub+zip”) will begin at position 38 
I made a couple of mistakes when building the epub the first few (many) times. Figure 1. shows one of my earlier attempts which I opened in Notepad to view the characters in the zipped file (remember it's in Zip format so you can't expect to see normal text). You can see that, although the mimetype is the first file in the archive, the file shouldn't have the long folder path 'eBookTemp/Inside epub/' and the 'application/epub+zip' doesn't start at position 38.

Click to see the full image

Figure 1. Badly constructed epub

It was clear that I couldn't simply write:
using (ZipFile zippedBook = new ZipFile())
{
     zippedBook.AddDirectory(ebookPath);
}
where ebookPath was the folder holding the epub files. I had to make sure that the mimetype was uncompressed and started at the right position in the file. I evolved the following code to achieve this:
zippedBook.ForceNoCompression = true;
zippedBook.AddEntry("mimetype","","application/epub+zip");

zippedBook.ForceNoCompression = false;
zippedBook.AddDirectory(_fileSystemPath + "META-INF","META-INF");
zippedBook.AddDirectory(
    _fileSystemPath + _package.PackagePath,_package.PackagePath);

zippedBook.Save(_ebookPath);
The 'ForceNoCompression=true;' ensures that mimetype is not compressed. This is followed by 'ForceNocompression=false;' to compresss the rest of the epub. With this code, the application produced an epub file that looked like Figure 2. when opened in Notepad.

Click to see the full image

Figure 2. Well constructed epub archive

This file passed the epub validation check discussed below.

The other side of the coin with respect to epub files is how to expand an epub archive so you can work with its files. I developed the following code to do this:

foreach (ZipEntry ze in zippedBook.Entries){
   ze.ExtractExistingFile = ExtractExistingFileAction.OverwriteSilently;
   if (ze.IsDirectory){
     Directory.CreateDirectory(fileSystemPath + Path.GetDirectoryName(ze.FileName));
  }
  else{
     ze.Extract(fileSystemPath);
  }
}
Checking the validity of your .epub files
There are some important tests to perform to be sure the online epub editor is working correctly. Perhaps the most obvious is whether its output can be loaded into a reading device. That could be deceptive, though, because the reading software on one reading device might be more rigorous and compliant than on another.

Another way is to use a tool that validates epub publications. I found one of these at Threepress, as shown in Figure 3. The underlying tool is a software project called epubcheck.

Click to see the full image

Figure 3. ePub validation of Inside Epub.epub

You could argue that this is also software and it may have bugs in it, and you'd be right. Given the wide and rapidly growing range of Zip tools, epub conversion tools, and reading software around, some confirmation of compliance to standards is required and tools like epubcheck will become an important way of ensuring readers will not be disappointed.

It was a great relief, therefore, that I overcame my initial difficulties with building a conforming OCF epub file and received the big green tick from epubcheck.

Using the Sandcastle Help Builder to document code
Microsoft and others provide several tools and techniques for documenting code in the MSDN style, as in this example of the System Class Library.
  • When using Visual Studio, you enable XML comments on the Build tab of Project | Properties. Then you can use the triple slash (///) notation to add XML-style documentation to your source code. See Figure 4. for a sample. The comments are needed for every Namespace and for every public Class, including all its Public members - methods, properties etc. As soon as you type /// before one of these items, the Visual Studio IDE creates the comment structure with an initial <summary> and, if it's a method, with <param> elements appropriate to the method's signature.
  • Tools are available to extract the comments by reflection on the binary files of your application and to build navigable documentation like the MSDN example above. I've used Ndoc before so I thought I would try Sandcastle Help File Builder.
I've made a preliminary pass through the code adding XML decoration, as it's called. Figure 4. shows an example of this. There are guidelines for XML documentation comments which list and explain the appropriate XML grammar.

Click to see the full image

Figure 4. XML decoration of source code

The Sandcastle Help File Builder needed a little configuration. I set values for the following in the Project Properties:
  • HelpFileFormat, set to 'Website'
  • OutputPath, set to the folder where the output should be written.
  • WorkingPath, set to a folder where work files can be created.
You click on the Build icon and wait five minutes while Sandcastle works its magic. You can see the results in Figure 5. and also at Inside Epub Class Library.

Click to see the full image

Figure 5. Sandcastle 'Website' output

Sorting an NCX navMap by playOrder
The NCX document of an epub publication holds information, in XML format, about the structure of the publication. Its most important feature is a hierarchical list of 'navigation points' showing how the publication is subdivided. This list can be used by software in a reading device to provide access points allowing the reader to select directly what they want to view.

The root of the navigation data is the <navMap> element. This contains a hierarchical set of <navPoint> elements. Figure 6. shows a sample <navMap>.

Click to see the full image

Figure 6. Sample NCX showing its navMap

A <navPoint> element holds the text to be displayed to the reader in the <navLabel> element. It also tells the reading software where to go for the content document; this is held in the src attribute of the <content> element.

The topic of this Technical Note is the playOrder attribute of the <navPoint>. The playOrder attribute holds a number specifying the sequence in which the <navPoint> details should be presented to the reader.

It's important to distinguish the playOrder sequence from the sequence in which the <navPoint> nodes are held within the <navMap>; they don't have to be the same. In general, you can't assume anything about the way a content provider has built the <navMap>.

For instance, supposing a publisher is working on a 'flat' document i.e. one with only one level in the <navPoint> hierarchy. If a new content document needs to be inserted between two documents with playOrder values of 5 and 6, the publisher might add the new <navPoint> as the last node in the <navMap>; they would give the new <navPoint> a playOrder value of 6, and increment the playOrder of all <navPoint> elements that need to follow it in reading order.

What this means is that before presenting the navigation details to the reader, the <navPoint> nodes need to be sorted by their playOrder attribute. There are several approaches that could be taken, but the one described here is to use an XSL transform.

The following listing shows one possible transform.

1   <?xml version="1.0" encoding="utf-8"?>
2     <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
3        <xsl:output method="xml" indent="yes" omit-xml-declaration="no" encoding="utf-8"/>
4        <xsl:template match="node()[local-name()='navMap']">
5           <xsl:element name="navMap">
6               <xsl:apply-templates select="node()[local-name()='navPoint']">
7                 <xsl:sort select="number(@playOrder)" data-type="number"/>
8               </xsl:apply-templates>>
9           </xsl:element>
10      </xsl:template>
11      <xsl:template match="node()[local-name()='navPoint']">
12         <xsl:copy-of select="."/>
13     </xsl:template>
14 </xsl:stylesheet>
At line 3, the <xsl:output> statement declares that the output is to be an XML document which includes an <?xml ?> declaration and is encoded in UTF-8.

Lines 4-10 of the transform find and process the <navMap> element.

At line 5 a new <navMap> is generated in the output.

At line 6, the transform handles the <navPoint> elements. The corresponding template at lines 11-13 returns a copy of each <navPoint>.

Line 7 is the key to the whole transform. It sorts the returned <navPoint> elements using the playOrder attribute.

The end result is a <navMap> in which the <navPoint> elements are in the order in which they should be presented to the reader.

Calling the transform in C#
The calling sequence for using this transform in C# can be something like the following:
// load the transform into an XslCompiledTransform instance
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(MapPath(".") + "/TransformNavMap.xsl");

// prepare an XmlWriter to write the transform output
StringBuilder newNavMap = new StringBuilder();
XmlWriter writer = XmlWriter.Create(newNavMap , xslt.OutputSettings);

// sort the book's navMap using the transform
xslt.Transform(book.container.package.ncx.NavMap, writer);
writer.Close();
Now the new <navMap> can be retrieved as an XmlDocument using newNavMap.ToString();