31 August 2009

Images As Objects

Images are not part of the main DITA specification. The <image/> element exists, but it references an external image file via an href attribute. There is no inherent management of images, or consideration of images as a class of object, involved.[1]

While this is a reasonable decision for the DITA specification—trying to comprehensively handle images would make the specification both more complicated and less general—it's not a reasonable decision for your content management system.

Images in the Content Management System


Single-sourcing and multi-channel publishing both require some sort of image control.

In the single-sourcing case, it does you little good to know precisely which version of a textual content object you are using[2] if you don't know which version of the images are going to appear in it. Images loaded by reference thus have to be controlled as well as the textual content objects. Since DITA does everything by reference, this means your CMS needs to manage images as objects with unique identifiers before you will be able to implement single-sourcing using DITA.

For multi-channel publishing, you run into the case where you want different versions of the image depending on output type. PDF benefits greatly from vector images, such as SVG or WMF; the resulting PDF will be able to print at essentially arbitrary resolution and will scale smoothly when zoomed. HTML content often has maximum image size constraints, such as a 550 or 800 pixel maximum width for any image. HTML output formats don't benefit from, and often cannot render, vector images. Since you're using the same topics to provide each of the different output types, handling this by referencing different image versions directly doesn't work. If you want to keep the single-sourcing capability, you have to be able to reference the unique ID of an image object smart enough to provide the correct image based on output type.

Image Objects


An image object is the thing being pointed to by the unique ID used to reference an image from the href attribute of an <image/> element.

Image objects should contain:
  • A unique ID, used in references to the image object
  • a human-intelligible name, returned in search results
  • meta-data, such as:
    • image good for SpiffyProduct versions up to 3.5; 4.0 or later, DO NOT USE
    • visually awful colour scheme follows industry labelling standards; don't redraw
    • usage labels, such as "consumer product", "non-specialist version", "in-house only", "SpiffyProduct", etc.
  • image files
    • source version in whatever binary format the drawing program uses
    • versions for each output type
    • an optional original image; the scan of the scribble on a napkin, etc.
At least one processable—the output processing knows how to put this image file into at least one output type—image file needs to be present in each image object, or the reference checking for image references needs to be smart enough to return a list of "this image object reference unique ID, image object contains no nothing that can go in the output" warnings.

If neither of those things is true, you get the problem where, somewhere in a long document with hundreds of image references, an image is either being replaced with a default image or just quietly vanishing, and a human being has to find it. Sometimes, the human is going to fail. Making the human try is both poor process design and unnecessarily hard on the human; this kind of detailed link checking is precisely the sort of task which should be performed automatically.

The output processing has to be able to extract the correct image file when a content delivery is produced. The CMS needs to be able to check for existence of the image object on the basis of the unique identifier, so image references can be guaranteed to exist when CMS acceptability is checked after XML validity in the process of releasing content to the CMS. You might want to institute business rules about what kind of image files need to be present in the image object.

Image Output Processing and Error Handling


Output processing has to be deterministic. It can't guess, or, rather, if it has to guess, you're not going to like the results.

As such, however you decide to set up image objects, the output processing either must be able to make a 1:1 mapping between an output type and an image file stored in the image object, or it must be able to convert an image file stored in the image object to the appropriate format for the output type.

One approach to the must requirement is to have the output processing check for the necessary image file, or for an image file it can convert, and if it finds neither, insert a place holder error image. This works, and adds to robustness in the sense that it guarantees that the output processing will work.

The disadvantage of this approach is that a human has to check the entire deliverable document or documents for error messages, which reduces robustness in an information quality sense. It's remarkably easy to miss a single error message image in a hundred page document, but you can count on your customers to find it. For that reason, I prefer an "image can't process, output fails with a message about which image" approach to handling errors in processing image objects.

The downside to the "can't process, fail" approach is that it requires your CMS to have some way of passing error messages back to the user from the output processing, and in the case of images quite possibly from an ancillary part of the output processing, outside of the primary XSL transformations. This can be a surprisingly large technical headache, and it's something you want to be careful to specify up front in your CMS selection process.

Even when the image processes correctly, it might not be what you want. In an environment with a 550 pixel maximum width for images, and a provided WEEE standard compliance graphic that started off as approximately 2000 by 7000 pixels in size, automatic down-scaling of images wider than 550 pixels to 550 pixels did not do what was wanted and produce a 75 x 250 WEEE compliance logo in the HTML output. Cases like this either require a willingness to forcibly re-scale the source image or to provide a way for the writer to provide image scaling information to the processor.

Due to multi-channel publishing and the unpredictability of output types, I would strongly recommend that if you provide user control of image scaling in your CMS, you do it in terms of percentages of the available space. Otherwise, even just the switch between US letter and A4 paper in PDF output will cause problems.

Not All Images Are Content


Some images are properly part of the delivered content, and some images, such as corporate logos or promotional graphics for the cover of the delivered document, are not properly part of content. Since getting a content image wrong is bad but more forgiveable than getting the corporate branding wrong, it's a good idea to think about a parallel mechanism for the non-content images.

Ideally, the non-content images are provided in an automatic way by the output generation, and there is no interaction between the writing team and the non-content images.

You might not be able to do this; if you have delivered documents with distinct individual cover graphics, for instance, there will need to be some mechanism to identify which cover graphic goes with which map. Even in this case, it's preferable if the image reference is a map property rather than a direct href via an image element. Making the non-content images distinct in terms of how they are referenced allows for special checking in the output processing; where you might accept an error image for regular content images in case of a processing error, you would prefer that an error with the cover graphic result in a failure of output processing. You may also have the option of making a map property reference a different content repository with restricted access, so there is less concern about accidental modification of the non-content images associated with the corporate brand.


[1] the <object/> element is a straight pass-through reference, equivalent to the HTML object element, that provides a reference to some kind of rendering binary content; animated images, plugins, Active-X controls, and so on. It's not a reference-this-image-object element.

[2] In DITA, the textual content object is a topic, but I'm talking about the single source general case, here, rather than strictly DITA.

No comments: