18 Feb 2010

The purpose of this post is to review progress on the project to develop an online wysiwyg epub editor. For a number of reasons it's a good time to pause and reflect on what's been achieved and to identify what remains to be done.

I've built up a basic library of epub classes to handle the functionality of epub editing - classes to handle the container, the package with its metadata, manifest, and spine, and the NCX.

At the front end, the user can perform the following:
  • Create a new epublication (a term I prefer to 'epub publication').
  • Insert a basic set of metadata items for the new publication.
  • Limited editing of metadata for an existing publication.
  • Create and manipulate content documents.
  • Add images to the manifest and insert them (sort of) in a content document.
  • Edit the CSS files of the publication.
The screenshots shown below illustrate this functionality.

However, there are aspects of an online system that can no longer be ignored. For instance, the application needs to be multi-user. To date I've worked with a single server folder for the epub library whereas each user needs a folder to hold their own library and to act as a work area for writing. The need to handle many users immediately suggests some kind of database would be useful; so far I've held data in constants, in the web.config and, embarrassingly, as hard-code. It's time to organise this better.

This review will step through the existing functionality and will identify where and how the application should be enhanced.

Library (Books View)
The screenshot in figure 1. shows the opening page of the application. It presents a list of books and a button for creating a new epub.

Click to see the full image

Figure 1. Books View

The user clicks on the book they want to open or clicks on the 'New epub' button to start a new publication. Currently, the list of books is fetched by extracting a folder path from web.config. All files with the extension .epub are shown in the list.

Design Improvements
The following enhancements are deemed to be either essential or desirable:
  1. The application should be multi-user. User registration and login should be handled by the Membership Service using the range of built-in Login controls. This should use a SQL-Express database in the App_Data folder.
  2. Additional user information should be held; in particular a folder on the web server should be assigned for each user's work. This root folder would be the place to hold the user's library and it is from this folder that the Books list would be populated.
  3. It would be friendlier to display the title of the book rather than the filename for each book, for instance 'The Curious Case of Benjamin Button' rather than 'fitzgerald-curious-case-of-benjamin-button.epub'.
Metadata (Book Information View)
In figure 2. the Book Information view is shown. This displays the <metadata> from the publication's package document.

Click to see the full image

Figure 2. Book Information View

The user can modify any items on the screen and click the Save button to store the changes in the epub.
A review of the Open Packaging Format Schema shows, however, that the handling of metadata needs to be more sophisticated. Figure 3. is an extract from the schema showing the definition of metadata-content.

<define name="OPF20.metadata-content">
 <choice>
  <interleave>
   <ref name="OPF20.dc-metadata-element"/>
    <optional>
     <ref name="OPF20.x-metadata-element"/>
    </optional>
  </interleave>
  <interleave>
   <oneOrMore>
    <ref name="DC.title-element"/>
   </oneOrMore>
   <oneOrMore>
    <ref name="DC.language-element"/>
   </oneOrMore>
   <oneOrMore>
    <ref name="DC.identifier-element"/>
   </oneOrMore>
   <zeroOrMore>
    <ref name="DC.optional-metadata-element"/>
   </zeroOrMore>
   <zeroOrMore>
    <ref name="OPF20.meta-element"/>
   </zeroOrMore>
   <zeroOrMore>
    <ref name="OPF20.any-other-element"/>
   </zeroOrMore>
  </interleave>
 </choice>
</define>
Figure 3.OPF20.metadata-content

The <oneOrMore> wrapped around the title, language, and id elements indicates that there must be at least one of these elements in the metadata, but there may be more. The <zeroOrMore> identifies elements that are optional. However, the OrMore part of the definition means there can be many of these elements too. We've already seen in earlier posts that there can be a range of dates, creators, contributors, and descriptions. In fact the schema says there can be any number of these items.

Design Improvements
The following enhancements are essential for handling the metadata of an epublication.
  1. Handle multiple instances of any metadata element.
  2. It should be possible to add, modify, and delete metadata elements.
  3. All metadata elements can be deleted except for one each of title, language, and id.
Book Content
Figure 4. shows the latest incarnation of the Book Content view.

Click to see the full image

Figure 4. Book Content View

The functionality of this screen has been changed since it was last presented. The most significant differences are:
  • Drag and drop functionality to move content documents within the publication. This activity is enabled/disabled by the 'Organise' checkbox. The Move Up/Move Down options were removed from the Action dropdown as they are no longer needed.
  • The tinyMCE editor, which is where the content documents are displayed, has been configured with a default font size that's easier to read and a drag handle has been provided to allow the user to change the height of the text area.
  • The new document details - Contents Entry and Document Heading - were moved above the editing area to allow the resize facility just mentioned.
  • The ability to put a heading at the top of a new content document was made optional.
  • Not shown in the screenshot, a 'boiler-plate' copyright document is inserted after the title page when a new epublication is created. It uses text like the following, where names and dates are taken from the metadata and inserted into the text at fixed locations. An author or publisher could change the text to that of their choice.

Copyright © Colin Hazlehurst, 2010

Colin Hazlehurst asserts the moral right to be identified as the author of this work.

No part of this publication may be reproduced, stored or introduced into a retrieval system, or transmitted, in any form or by any means, without the prior written permission of both the copyright owner and the publisher of this work.

Design Improvements
There are still useful enhancements that could be made to this view:
  • The ability to promote and demote content document and reflect the changes in the <navMap> of the NCX document.
  • The code should start with the Save button disabled. It should detect when either the Table of Contents or the text of the currently displayed content document are changed. It should then enable the Save button. The Save button should be disabled after any changes have been saved.
  • There is a particular challenge with respect to handling images which is the subject of a separate note below. The problem is that the href of an image in the manifest is relative to the document which references it. When displaying the image in a browser, the URL in the image's src attribute needs to specify a path on the server relative to the root of the application.
Media View
Figure 5. shows the view when the user clicks on the Media tab. The application reads the manifest and finds all files that have a media-type beginning with the text 'image/'. For each file it finds, an Image control is added to the view and the source is set to the href for the manifest item.

Click to see the full image

Figure 5. Media View

A FileUpload control works in conjunction with an Upload button to allow the user to upload a new image for inclusion in the publication.

Design Improvements
The following enhancements would greatly improve the handling of media by the application.
  • Media types other than images should be handled.
  • Some ability to generate thumbnails should be included which would keep the correct aspect ratio for each image.
  • Currently images cannot be selected for deletion.
Styles View
Remembering that the value of XHTML is that it gives structure to the content of a document, but it also separates the content from its presentation. The widespread tool-of-choice for presenting content is CSS. epublications can incorporate any number of CSS stylesheets to help present the content.

Figure 6. shows the Styles tab in the epub editor project.

Click to see the full image

Figure 6. Styles View

When the user clicks on the Styles tab, the application reads the manifest and finds all files that have the media-type 'text/css'. It constructs a list of these files and allows the user to select one by clicking on it. In the example shown, main.css was selected and the stylesheet is displayed.

The 'Action' dropdown list gives the user the ability to add and remove CSS files. The Save button is used to save any changes the user makes to the currently displayed stylesheet.

Image Handling Issue
It was mentioned above that there is an issue with the handling of images that is particular to the online environment. The href of an image in the manifest of an epub is set relative to the content document that references it. On the web, the src attribute of the image control is a URL relative to the root of the web application.

In the Benjamin Button example, the ePubBooks logo is referenced on the epubbooksinfo page in the OPS folder. The href is set to 'images/epubbooks-logo.png', and the logo is held in folder OPS/images. A web page with a root folder called epub displaying the epubbooksinfo page in a tinyMCE editor would expect to find the logo in folder .../epub/images.

In a multi-user system it would not be possible to put the images for all users' publications in one folder. Each user must have their own work area, which means they must have a separate folder on the server. If F.Scott Fitzgerald were using this application (and who, given the weirdness of Benjamin Button, can say that he can't?), then he might be saving his content in a folder like:
epub/fsfitzgerald/benjaminbutton/OPS
Therefore, to view the image in the browser, its src would need to be something like:
epub/fsfitzgerald/benjaminbutton/OPS/images/myimage.svg
The epub must reference the image simply as: images/myimage.svg.

This difference in addressing must be handled by the application. The obvious choice is to use a pair of XSL transforms, one of which builds a web-relative URL and replaces the src attributes of all images in a content document with this value. This transform runs when the user selects a document from the table of contents to display it in the editing area.

The other transform runs when the user clicks the Save button after making changes to a content document. It strips out the web path and replaces it with a document relative path. The content documents saved to the filesystem and thus in the .epub must always use the document relative path - that's the only way the epub can be shipped.

How do you get to Carnegie Hall?
So that's the job facing me on this project. It's interesting, but time consuming. To sum up the way I feel is like:

Tourist: How do you get to Carnegie Hall?
Yokel: Well, you wouldn't want to start from here.

16 Feb 2010

Manifest and spine management in C#

This post presents the requirements for C# classes that manage the manifest and the spine - both of which are elements in an epub package. The manifest identifies all of the files that are part of a publication while the spine specifies the linear reading order of its content documents.

A reading system needs only to read and parse these elements; it doesn't modify them in any way, with one exception. However, an online wysiwyg epub editor needs the ability to insert, remove, and rearrange files in both the manifest and the spine.

Manifest Items
The <manifest> element of an epub <package> contains <item> elements, one item for each file that is referenced from anywhere in the publication. A manifest item has the attributes shown in Table 1.

Attribute NameAttribute Description
idMandatory, unique identifier of the file within the manifest.
hrefMandatory, URI of the file for this item.
media-typeMandatory, MIME media-type for this item.
fallbackid of the manifest item to which a reading system should fall back if it is unable to process the namespace of the current item. Mandatory when the current document is an Out-Of-Line XML Island.
fallback-styleid of the manifest item which holds a CSS stylesheet using which the contents of the current item may be rendered.
required-namespaceWhen the current document is an Out-Of-Line XML Island, this attribute must be present and it should be set to the namespace of the document.
required-modulesA comma-separated list of Extended Modules, which might belong to XHTML or to the namespace of an Out-Of-Line XML Island. This list of modules helps the reading system decide whether it has the capabilities to process the current item.
Table 1. manifest item attributes

In the context of a C# class designed to read and write manifest items, these attributes are simply strings to be accessed through the methods and properties of the class.

Attribute Handling Methods
Extracting the attributes of an XML node is a common activity in epub code. The most succinct code to access an attribute value is:
XmlNode targetAttribute = node.Attributes.GetNamedItem(attributeName);
However, many attributes are optional, and the variable targetAttribute will be set to null if the attribute is not present. Therefore, I prefer to wrap this statement up with some defensive programming which checks for a missing attribute and also distinguishes the case where the attribute is present but is set to an empty string. I use an overloaded TryGetAttribute method which offers a few ways of handling these situations. One example of the method is shown below.


public static bool TryGetAttribute(XmlNode node
   ,string attributeName
   ,out string attributeValue) {

  // initialise the results
  bool result = false;
  attributeValue = string.Empty;

  // try to get the named attribute
  XmlNode targetAttribute = node.Attributes.GetNamedItem(attributeName);

  // if the attribute was found
  if (targetAttribute != null) {
    // extract the value and set the result to true
    attributeValue = targetAttribute.InnerText;
    result = true;
  }
return result;
}//TryGetAttribute

The converse of reading a potentially missing attribute occurs when we want to set the value of an attribute that may or may not be present in the target XmlNode. Again, this happens often enough to make it worth creating a method to handle it. I call this SetOrAddAttribute and a listing is shown below.


public static void SetOrAddAttribute(XmlNode node,
   string attributeName, string attributeValue){
  // try to get the attribute
  XmlAttribute targetAttribute = TryGetAttribute(node, attributeName);
  // if the attribute is not present in the given node
  if (targetAttribute == null){
    // create and add an empty attribute
    targetAttribute = node.OwnerDocument.CreateAttribute(attributeName);
    node.Attributes.Append(targetAttribute);
  }
// set the attribute value
  targetAttribute.InnerText = attributeValue;
}

The manifestitem class
With attribute handling in place, it's straightforward to create a manifestitem class in C#. The constructor is given a reference to an XmlNode which it stores in a private variable:


private XmlNode _node;

public manifestitem(XmlNode node){
  _node = node;
}

Each attribute of the manifest item is then provided with a property which can be used to get and set the attribute value. For example, look at the following snippet which handles the href attribute


public string href {
  get {
    string _href;
    utilities.TryGetAttribute(_node,"href", out _href);
    return _href;
  }
  set {
    utilities.SetOrAddAttribute(_node, "href", value);
  }
}

The get method returns the attribute value, if it is present, or an empty string. The set method replaces the value of any existing href attribute or adds an href attribute with the given value if the attribute is not present in the item node.

This pattern is repeated for each attribute.

The manifest class
A C# class to handle an epub's manifest is concerned with the manifest's <item> elements. It needs to find them, add them, and remove them. To that end the methods in Table 2. make up the manifest class which is part of the project to develop an online wysiwyg epub editor.

MethodDescription
manifest(XmlDocument package)Constructor which receives the epub package as an XmlDocument.
ManifestNode()A method which returns the <manifest> as an XmlNode.
ManifestItems()Method returning the manifest item elements as an XmlNodeList.
Add(manifestitem item)Method to add the given instance of a manifestitem to the manifest.
Add(string id, string href, string media_type)Method to add an item to the manifest, assigning it the given mandatory values for id, href, and media-type.
Remove(string id, string packagePath)Remove the item with the given id from the manifest. Also, delete the file from the file system using the physical path in the packagePath argument.
GetManifestItemById(string id)Return the item element with the given id as a manifestitem instance.
CreateManifestItem()Create a new manifestitem instance which can be adorned with attribute values and inserted in the manifest using the Add method.
Table 2. properties and methods of the manifest class

Note that the node order in the manifest is not important, unlike in the spine. Therefore, the Add methods simply append new items at the end of the manifest.

Spine
In some ways the <spine> is easier to handle than the <manifest>; there are fewer attributes to work with; but it does have a few complications. Firstly, the <spine> element includes the toc attribute which holds the id of the manifest item that holds the NCX document for the publication. That attribute has to be accessible so the reading software can find and open the NCX.

Secondly, the spine provides the reading system with the linear reading order of the content documents. Therefore, the order of the nodes in the spine is important.

Spine nodes are called <itemref> because they refer to items in the manifest; the idref attribute of each itemref element is the id of a manifest <item>. Each item id must only appear once in the spine.

The only other attribute that the Open Packaging Format schema allows is the linear attribute. This distinguishes primary content documents (value="yes") from auxiliary content  documents (value="no"). "yes" is the default, so this attribute can be omitted.

Useful Enumerations
Before presenting the spine class, it's worth introducing two enumerations that support the code. The first of these describes the position where a new spine itemref should be inserted. The InsertPosition enumeration is shown below.


public enum InsertPosition {
   after
   ,before
   ,bottom
   ,top
}

This provides options to insert a new itemref at the top or bottom of the reading order, or to insert it before or after a given other itemref node.

The second enumeration allows the code to specify the value of the linear attribute without passing a string. The Linear enumeration is show below.


public enum Linear {
   yes
   ,no
}

The spine class
A C# class to provide basic handling for the <spine> element could have the methods and properties shown in Table 3.

MethodDescription
spine(XmlDocument package)Constructor which receives the epub package as an XmlDocument.
tocIdReturn the id of the NCX manifest item from the toc attribute of the spine element.
itemrefsMethod returning the itemref elements as an XmlNodeList.
Add(string idref, InsertPosition ip, string refNodeId, Linear linear)Add an itemref instance to the spine. The new itemref will have the given idref and linear values, and the position will be determined by the InsertPosition value relative to the itemref element which has the idref value in the refNodeId argument.
Remove(string id)Remove the itemref with the given id from the spine.
Table 3. properties and methods of the spine class

Earlier I mentioned that with one exception a reading system does not modify the manifest or the spine. The Open Package Format says that any part of the publication that can be referenced during processing of an epub must be included in the spine. However, if the reading system encounters content that is not present in the spine:
the Reading System should add it to the spine (the placement at the discretion of the Reading System) and assign a value of 'no' to the linear attribute.
So, a reading system can add itemrefs to the spine. I interpret this to mean that the in-memory representation of the spine is modified and not the package file in the file system nor the compressed version of the package held in the .epub file. Please contradict me if you know this to be false.

15 Feb 2010

XML Islands in epub publications

Preferred Vocabulary
An earlier post - How the standards work together - showed that the Open Publication Structure (OPS) specifies the XHTML tags, grouped into modules, that should be used to create an epub publication. These modules constitute what is called a Preferred Vocabulary. In other words, the OPS specifies the tags that should be used to define the structure of a work - a <div> here, a <table> there, and so on.

All conforming reading systems must recognise and be able to render documents written using the preferred vocabulary. The term 'baseline' reading system indicates this minimal ability.

Beyond the Preferred Vocabulary
You can achieve a great deal using only the preferred vocabulary. A very high proportion of existing printed matter, including its illustrations, could be transferred to epub format using only the standard OPS modules - with layout and formatting support from CSS. However, that adjective 'preferred' hints that epub productions do not have to be written exclusively in that vocabulary. OPS recognises that there are other tags in XHTML 1.1 that an author might want to use and that there are other XML vocabularies that a publisher might need to include.

XML is a widely used technology. If you enter into Google the search term: 'XML vocabulary for', and follow this with any topic in which you have an interest, there's a good chance someone has designed, or is actively designing, an XML vocabulary for that topic. I found the following; you will find many more:
  • ceXML - civil engineering XML.
  • genXML - for the exchange of genealogical data.
  • mathML - XML for sharing mathematical expressions.
  • MusicXML - XML for capturing musical notation.
The users of these diverse vocabularies are very likely to want to include content written using them in their epub publications. It might be a fragment of non-preferred XML embedded, as an example, within an otherwise conforming XHTML content document, or it might be an entire document conveying the essential content of the publication.

The Open Publication Structure offers an approach that allows non-preferred content to be included in a publication while ensuring that the content is available, in some form, to all consumers. The trick is to allow reading systems that have been designed to handle the non-standard content to exploit it to the full while insisting that the publisher provide the information in a form that is accessible to baseline readers.

XML Islands
If I were to embed snippets of foreign languages into this post by suggesting per se that it is de riguer to introduce de novo some concept of the deus ex machina, you would rightly accuse me of poor writing style. However, those French and Latin phrases are examples of content taken from another language, or non-preferred vocabulary, emebedded in a stream of preferred vocabulary; in this case it's embedded in English but it could be Greek embedded in Spanish or Mandarin embedded in Tagalog.

It is poor writing style to sprinkle foreign phrases about like this because the reader whose preferred vocabulary is English will not necessarily understand Latin. Likewise, a chunk of XML written using a non-preferred vocabulary will make no sense to a baseline reading system because it is not required to process such content. I'm not saying it's a bad idea to insert 'foreign' XML into a content document, simply that special handling is required when it is used.

Chunks of XML, written in a foreign language and embedded in a stream of preferred vocabulary, are called Inline XML Islands. Islands in the stream, that is what they are.

It is possible for entire content documents to be written in a non-preferred vocabulary. In this case, and because they are not embedded in a stream of preferred vocabulary, these documents are called Out-Of-Line XML Islands, though they are more like continents than islands - entirely inaccessible and incomprehensible to a baseline reading system.

The Open Publication Structure and Open Packaging Format standards define between them what XML Islands are and then state the requirements to be met by publishers when creating them and the guidelines to be followed by reading systems when encountering them in a publication.

Publisher's Responsibilities: Out-Of-Line XML Islands
If a publisher wants to include Out-Of-Line XML Islands in a work, they must meet the following requirements.
  • The XML Island must be a complete XML document that conforms to its own schema (the schema defines the vocabulary).
  • The manifest item for an Out-Of-Line XML Island must identify the namespace of the document using the required- namespace attribute.
  • For each Out-Of-Line XML Island, the publisher must provide a fallback document which can be processed directly. The manifest must include fallback documents as well as the XML Islands they support.
  • The manifest item for the XML Island must include a fallback attribute, and that attribute should give the id of the fallback document.
  • If necessary, a fallback item may itself have a specified fallback, creating a fallback chain.
  • Fallback chains must not form a loop.
  • As an alternative to a fallback item, the publisher may provide a stylesheet which can be used for the presentation of the non-standard content. In this case the fallback-style attribute should be specified and the target stylesheet should be identified.
  • An Out-Of-Line XML Island may specify both a fallback item and a fallback-style.
Reading System Guidelines: Out-Of-Line XML Islands
A reading system is some combination of hardware and software. In an open market, reading systems will have a range of abilities, including the ability to handle one or more content types that fall outside the OPS preferred vocabulary.

When a reading system processes an item in the manifest which specifies a fallback item it should follow these guidelines:
  • Starting from an initial content document, identified in the spine or NCX, the reading system must follow the fallback chain until it finds a document it knows how to display. At the end of every fallback chain the reading system should find a document that it can render.
  • A reading system may display any item that it is capable of processing, it doesn't have to be the first one it finds.
  • If an Out-Of-Line XML Island specifies both a fallback item and a fallback stylesheet, a reading system may choose which one to use.
  • When a reading system is designed to have special capabilities, it may do more than the minimum with the content of an XML Island.
Inline XML Islands
When a fragment of 'foreign' XML is to be embedded in a stream of content which is written using the preferred vocabulary, the publisher should provide an inline mechanism for handling it.

We saw that fallback documents were used for Out-Of-Line XML Islands. The equivalent inline technique is the switch statement which presents zero or more case elements each of which wraps XML markup inside a required-namespace declaration. The syntax takes the form:
<ops:switch id="switch_id">
   <ops:case required-namespace="namespace">
      ... XML content in the named vocabulary
   </ops:case>
   <ops:default>
      ... fallback OPS-compliant content
   </ops:default>
</ops:switch>
A reading system should examine the required-namespace of each case element and determine whether it can handle that namespace. It should process the first such case that it finds, although it doesn't have to. If the reading system either cannot or chooses not to process any of the cases, it must process the default element. The default must always contain content that would be valid in any OPS content document.

The example below shows how a fragment of MusicXML might be presented in a content document.
<ops:switch id="musicXML_Example">
   <ops:case required-namespace="http://www.recordare.com/">
      <score-partwise version="2.0">
         <part-list>
            <score-part id="P1">
              <part-name>Music</part-name>
            </score-part>
         </part-list>
         <part id="P1">
            <measure number="1">
               <attributes>
                  <divisions>1</divisions>
                  <key>
                     <fifths>0</fifths>
                  </key>
                  <time>
                     <beats>4</beats>
                     <beat-type>4</beat-type>
                  </time>
                  <clef>
                     <sign>G</sign>
                     <line>2</line>
                  </clef>
               </attributes>
               <note>
                 <pitch>
                    <step>C</step>
                    <octave>4</octave>
                 </pitch>
                 <duration>4</duration>
                 <type>whole</type>
               </note>
            </measure>
         </part>
      </score-partwise>
   </ops:case>
   <ops:default>
      <img src="images/Cnatural.png" </img>
   </ops:default>
</ops:switch>
A reading system that understands MusicXML would probably choose to process the XML contained within the case element. A baseline reader would process the default case and render the image shown here:
CNatural