31 August 2009

One Wild Rose

More properly, I suppose it is a feral rose, but either way, I liked it.

Goldfinch on a post

A goldfinch in a setting that does not involve a finch feeder! (Though it was an environment with a lot of seeded out thistles.)

He took a look at me, but was reluctant to let me get any closer. So that's the most goldfinch I could get in the frame.

Basic Beach

There are certain inescapable temptations involved being on a section of lake shore with onshore winds, and Tommy Thompson, being made out of fill, has as a result such interesting weathered "rocks" that the temptation is rather stronger from my point of view.

The two images were taken two seconds apart.

Images As Objects

Images are not part of the main DITA specification. The <image/> element exists, but it references an external image file via an href attribute. There is no inherent management of images, or consideration of images as a class of object, involved.[1]

While this is a reasonable decision for the DITA specification—trying to comprehensively handle images would make the specification both more complicated and less general—it's not a reasonable decision for your content management system.

Images in the Content Management System


Single-sourcing and multi-channel publishing both require some sort of image control.

In the single-sourcing case, it does you little good to know precisely which version of a textual content object you are using[2] if you don't know which version of the images are going to appear in it. Images loaded by reference thus have to be controlled as well as the textual content objects. Since DITA does everything by reference, this means your CMS needs to manage images as objects with unique identifiers before you will be able to implement single-sourcing using DITA.

For multi-channel publishing, you run into the case where you want different versions of the image depending on output type. PDF benefits greatly from vector images, such as SVG or WMF; the resulting PDF will be able to print at essentially arbitrary resolution and will scale smoothly when zoomed. HTML content often has maximum image size constraints, such as a 550 or 800 pixel maximum width for any image. HTML output formats don't benefit from, and often cannot render, vector images. Since you're using the same topics to provide each of the different output types, handling this by referencing different image versions directly doesn't work. If you want to keep the single-sourcing capability, you have to be able to reference the unique ID of an image object smart enough to provide the correct image based on output type.

Image Objects


An image object is the thing being pointed to by the unique ID used to reference an image from the href attribute of an <image/> element.

Image objects should contain:
  • A unique ID, used in references to the image object
  • a human-intelligible name, returned in search results
  • meta-data, such as:
    • image good for SpiffyProduct versions up to 3.5; 4.0 or later, DO NOT USE
    • visually awful colour scheme follows industry labelling standards; don't redraw
    • usage labels, such as "consumer product", "non-specialist version", "in-house only", "SpiffyProduct", etc.
  • image files
    • source version in whatever binary format the drawing program uses
    • versions for each output type
    • an optional original image; the scan of the scribble on a napkin, etc.
At least one processable—the output processing knows how to put this image file into at least one output type—image file needs to be present in each image object, or the reference checking for image references needs to be smart enough to return a list of "this image object reference unique ID, image object contains no nothing that can go in the output" warnings.

If neither of those things is true, you get the problem where, somewhere in a long document with hundreds of image references, an image is either being replaced with a default image or just quietly vanishing, and a human being has to find it. Sometimes, the human is going to fail. Making the human try is both poor process design and unnecessarily hard on the human; this kind of detailed link checking is precisely the sort of task which should be performed automatically.

The output processing has to be able to extract the correct image file when a content delivery is produced. The CMS needs to be able to check for existence of the image object on the basis of the unique identifier, so image references can be guaranteed to exist when CMS acceptability is checked after XML validity in the process of releasing content to the CMS. You might want to institute business rules about what kind of image files need to be present in the image object.

Image Output Processing and Error Handling


Output processing has to be deterministic. It can't guess, or, rather, if it has to guess, you're not going to like the results.

As such, however you decide to set up image objects, the output processing either must be able to make a 1:1 mapping between an output type and an image file stored in the image object, or it must be able to convert an image file stored in the image object to the appropriate format for the output type.

One approach to the must requirement is to have the output processing check for the necessary image file, or for an image file it can convert, and if it finds neither, insert a place holder error image. This works, and adds to robustness in the sense that it guarantees that the output processing will work.

The disadvantage of this approach is that a human has to check the entire deliverable document or documents for error messages, which reduces robustness in an information quality sense. It's remarkably easy to miss a single error message image in a hundred page document, but you can count on your customers to find it. For that reason, I prefer an "image can't process, output fails with a message about which image" approach to handling errors in processing image objects.

The downside to the "can't process, fail" approach is that it requires your CMS to have some way of passing error messages back to the user from the output processing, and in the case of images quite possibly from an ancillary part of the output processing, outside of the primary XSL transformations. This can be a surprisingly large technical headache, and it's something you want to be careful to specify up front in your CMS selection process.

Even when the image processes correctly, it might not be what you want. In an environment with a 550 pixel maximum width for images, and a provided WEEE standard compliance graphic that started off as approximately 2000 by 7000 pixels in size, automatic down-scaling of images wider than 550 pixels to 550 pixels did not do what was wanted and produce a 75 x 250 WEEE compliance logo in the HTML output. Cases like this either require a willingness to forcibly re-scale the source image or to provide a way for the writer to provide image scaling information to the processor.

Due to multi-channel publishing and the unpredictability of output types, I would strongly recommend that if you provide user control of image scaling in your CMS, you do it in terms of percentages of the available space. Otherwise, even just the switch between US letter and A4 paper in PDF output will cause problems.

Not All Images Are Content


Some images are properly part of the delivered content, and some images, such as corporate logos or promotional graphics for the cover of the delivered document, are not properly part of content. Since getting a content image wrong is bad but more forgiveable than getting the corporate branding wrong, it's a good idea to think about a parallel mechanism for the non-content images.

Ideally, the non-content images are provided in an automatic way by the output generation, and there is no interaction between the writing team and the non-content images.

You might not be able to do this; if you have delivered documents with distinct individual cover graphics, for instance, there will need to be some mechanism to identify which cover graphic goes with which map. Even in this case, it's preferable if the image reference is a map property rather than a direct href via an image element. Making the non-content images distinct in terms of how they are referenced allows for special checking in the output processing; where you might accept an error image for regular content images in case of a processing error, you would prefer that an error with the cover graphic result in a failure of output processing. You may also have the option of making a map property reference a different content repository with restricted access, so there is less concern about accidental modification of the non-content images associated with the corporate brand.


[1] the <object/> element is a straight pass-through reference, equivalent to the HTML object element, that provides a reference to some kind of rendering binary content; animated images, plugins, Active-X controls, and so on. It's not a reference-this-image-object element.

[2] In DITA, the textual content object is a topic, but I'm talking about the single source general case, here, rather than strictly DITA.

30 August 2009

Windy thistles

The whole thing:
100% crop of the low right of the previous exposure, so not exactly as seen above:
And a 100% crop of the low left of the first image:
It was a windy day, and this is one of the few times I have managed to take a picture that looks windy to me, observing it later.

Noodly

Tommy Thompson is pretty much entirely fill of one kind or another, and there are various places where one sees rebar weathering out of concrete. This particular collection strikes me as rather large to have weathered out so completely, or to be quite so twisted to have needed much weathering, but then again I can't imagine that it wouldn't have gone to recycling instead if it had been clear of concrete when shipped off as fill, either. So I have no idea how this got into this particular state, other than that a lot of force was involved at least once.

Indian Paintbrush


Having asked more than one person, more than once, what this particular flower is called, I have actually remembered. There's a couple places at Tommy Thompson where there's quite a lot of it, and it seems to be having a good year with the amount of rain we've been having.

Where yellow grows the something-or-other

No idea what it is, but I thought it was pretty. If all those buds manage to open more or less at once, it ought to peak fairly spectacularly, too.

29 August 2009

Horseshoes and hand grenades

As in "close only counts with".


There's hope, or something like hope, for these, but I am going to have to get better at this. (Not shooting into a strong wind with spindrift might admittedly improve things, but still. More work required.)

Architectural detail

A corner of the Old City Hall clock tower.
This is definitely a photographic opportunity that benefits from a cloudy day; not only is the light much more diffuse, there are fewer people on Queen St. to run into you while you're trying to compose the shot.

Surface Feeding

Shutter speed 1/125 s, most unfortunately. (Shifting cloud, sharp drop in light levels, didn't adjust ISO off 100.) Hopefully I shan't do that again.
Whatever it was, this particular immature gull seems to have caught it.

28 August 2009

Purple Bokeh

Now with bonus spider web!
No idea what it is; very small (as you can probably deduce from the apparent size of the spider web) and blowing back and forth in the breeze, so I'm officially pleased any of it is in focus and actually somewhat pleased with the way the bokeh works.

Blue, blue, blue, a colour and a surge...

... everything that rises must converge. (Yes, I'm inflicting Shriekback lyrics on you.)
This shouldnt' really get the camera gloating tag; I had to adjust the white balance. Still, I'm inexplicably happy with it, happy enough to not try to crop the flower a bit more evenly into the frame.

There's a sparrow in there

There might, in all good sooth and honesty, have been five or six sparrows in there, chirpling away in all good glad humour.
Which is not to say I could see them; it's been a wet summer, and the forb is full well grown. Since it was also windy, for the stiff breeze values of windy, all the wee birds flashed across the sky just as quick as they could go, and down into the diverse herbage. So I had a number of "well, I think that was a bird" moments today.
The plus side is that I got a fair number of wildflower pictures; the less plus side, at least going by the limited knowledge of my readership that I have, is that there a bunch of pictures of seagulls, too.

The "System" in CMS

Since I am having trouble untangling what I would like to say about maps and the way the delivered document hierarchy is produced to the point where the resulting post might be short enough that merely contemplating reading it would not crush the spirit of anyone so brave as to entertain the notion, I'm going to produce something of a meta-issues post instead.

A content management system is a system; that is:

  1. A system is an assembly of components connected together in an organized way.
  2. The components are affected by being in the system and the behaviour of the system is changed if they leave it.
  3. This organized assembly of components does something.
  4. This assembly as a whole has been identified by someone who is interested in it.

This understanding of system has a bunch of implications for both what you decide to implement, and how you decide to do that. I'm going to cover three of them.

The Purpose Of A System Is What It Does


This becomes the acronym POSIWID. Initially, the idea seems tautological; of course the system does what it does, it exists to do that, doesn't it?

There's a sort of mental trick involved, similar to looking at a clear night sky (somewhere there is little or no light pollution) and seeing it as something with depth, rather than little lights stuck to a firmament.

In the system case, the trick is to see all the results and side effects of the system, including the bad results, as what the system is designed to do. You should not presume malice, but equally you should not accept unintended side-effect as an explanation. If the CMS you're using causes suffering, causing suffering is what the CMS is for. (Hopefully, causing suffering is not all of what it is for.)

Since a DITA CMS will be new, and people will have to learn it, and change how they do things, and take a chance on blown deadlines the first time they do something deliverable with the CMS, there's always going to be some initial suffering, or at least anxiety, involved. That's not the suffering I mean; that, hopefully minor, suffering is the price of change.

The suffering I mean is the suffering that arises when you know how the system is supposed to work, what to do with it, and wind up suffering anyway. That kind of suffering is an indication that you need a different system.

Maybe a little different—changing the way the search dialogue works, or altering how the output processing handles notes—or maybe a lot different—changing XML editors, or the way XML validation rules are applied—or maybe completely different, and you start over.

Any implementation is going to need a little different levels of change to correct suffering in use. A well-specified, well-planned implementation should not ever wander into the a lot different or completely different categories unless some radical technological change occurs.[1]

You Get What You Reward


People are not the whole of the system, but people are, must be, part of the system. People are also both self-adjusting and markedly difficult to adjust in the sense that you can adjust the user interface colour scheme.

One of the traps with setting up a DITA CMS (or any other technical documentation CMS) is to make it systemically permanently new, and thus hard, and it therefore becomes and stays laudable to use the thing at all.

Other areas of endeavour with content management treat the content management systems as essential and normative; you can't do the job without one. (Consider the reaction of the people setting up a corporate finance system being asked to do it without a database, too.) Because so much technical documentation is still done as what is effectively piecework by individual craftsmen, the expectation that of course you're going to be doing this as part of an industrial, distributed, staged production process with formal divisions of labour and as much automation support as practicable to get productivity up simply isn't there. That expectation of piecework is going to conflict, badly, with a DITA CMS; the better the CMS is, from the point of view of potential productivity improvement and automatic support of those functions (like output, managing work order, and distributing effort by defined role) that can be automated, the worse the conflict will be.

So it's important to decide what you want—I could say effectiveness, but that's palming the card; what constitutes effective for your organization is something you know and I don't—and make sure to reward that. I caution only that you don't want to reward the presumption that DITA, XML, or your CMS are, individually or severally, difficult to use.[2]

A Viable System Is Built From Viable Systems


A clever fellow named Stafford Beer invented the Viable System Model to describe how organizations work.

The whole of the viable system model is something you can easily search for; the article under the link is a reasonable overview from the viewpoint of knowledge management. The point I want to make here is that systemic viability is recursive, in the systems sense of recursive; what looks like a single unit of production ("Implementation", Level 1 and the most basic level of the model) can contain an entire complete system itself. (In much the same way that your body, a system, is made out of cells, which are systems.)

The example usually given is how a subsidiary manufacturing plant is a whole viable organization itself but is seen from head office as simply that place where they make pop rivets and other fasteners. The example I would like to give is the individual writer; they're going to be doing Implementation, Co-ordination, and Control[3] with respect to their unit of production, the DITA topic. If the system does not work for them, it won't work, period, because the broader levels of organization (assembly of topics into maps to meet deliverables, discovering the appropriate persona and scenarios to meet your customer's information needs via scenario based authoring, etc.) are all dependent on the basic production layer working.

Software design might refer to this as getting the primitives right, where primitives are primitive operations; I might refer to it in terms of the meta-data information stratigraphy necessary to get from version control to scenario-based authoring, where it does not good to have a higher layer if you don't have the layer it rests on.

The whole system is made up of a group of smaller systems. The members of the writing team must understand topic-based authoring as an objective before they can write topics. Topic production must work before maps will work. Maps must work before you can attempt scenario-based authoring.

The implication of the necessary systemic recursion, in the pleasantly theoretical general case, is that when you implement your actual, concrete Content Management System, you need to build it, or, if you're buying an existing system that has all the pieces, start using it in the appropriate order. This also means that you can't go on to the next, increased level of complexity until the current level works. If the current level of system doesn't work, the next level can't.


[1] For "someone creates an XML editor that's completely intuitive to the naive user", or "someone starts selling a quantum tree transformation processor for cheap" values of radical technological change.

[2] There had better not be any factual support for "this CMS is difficult to use", either.

[3] Levels 1, 2, and 3 of the Viable System Model. All five layers are Implementation, Co-ordination, Control, Intelligence, and Policy.

26 August 2009

Linty Beast



Sometimes, I get decent shots of the backlit stealth cat. These benefited greatly from a combination of the FA31's f1.8 maximum aperture and Aoife being in a reasonably tractable mood, so far as photography is concerned.

25 August 2009

Validation as a Lifestyle Choice

Quick Overview of XML Validation


A well-formed XML document follows the syntax rules for an XML document. (document strictly in the XML sense of that term!)

An XML document is valid "if it has an associated document type declaration and if the document complies with the constraints expressed in it."

A non-validating XML parser can tell if an XML document (still that strict, restricted XML sense of document) is well formed, but cannot validate. A validating XML parser can validate an XML document against a specific Document Type Definition (DTD).

In practise, validation is somewhat more complicated, due to the various kinds of schemas. Schemas are XML documents (DTDs are not XML, and have their own syntax rules) that provide a description of the constraints for a particular XML vocabulary. DITA has an associated set of schemas, equivalent[1] to the DTDs, but the DTDs continue to be considered canonical for DITA.

Validation is an effective way of catching structural mistakes in DITA content. Valid content might still not be correct in terms of a writing objective, but invalid content certainly isn't correct.

Do You Really Need to Bother?


While it is certainly possible to set up a content management system for DITA authoring to accept invalid XML content, or even XML content that isn't well-formed, this is not a good idea.

There are three main reasons for doing validation with your CMS whenever content is submitted to it. ("checked in", "released", etc. Terminology will depend on the specific CMS you're using; I mean the case where you tell the CMS to "store this as the next version number of this object".)

Firstly, at some point you must process your XML content into a delivery format. No matter what format that is, at that point it will become important that your XML content is valid for relatively inescapable technical reasons.[2] Invalid content means either no output at all or output that you can't safely ship.

Secondly, people are neither good XML validators nor good managers of large groups of references to unique identifiers. Computers, on the other hand, are excellent XML validators and don't forget things. Manual maintenance of a valid body of XML content by human beings is a hard, horrible, and unending job, which is precisely the kind of thing that makes DITA authoring projects fail. Without automated validation support for your XML content, you don't get improved productivity and quality, you get decreased productivity and a writing team looking for other work.

Thirdly, validation in the XML editor, a potential substitute for validation in the CMS, isn't enough to support production use. The editor doesn't know about what things are stored in the CMS, and can therefore only check for validity, not whether or not referenced content actually exists.

Validation Concerns in Practice


Division of labour and staged production means, in part, that there are a bunch of levels and kinds of validation you'll need to be concerned with for your DITA CMS.

The XML Editor


If you are authoring your content in DITA, you should be using an XML editor.[3] That editor should not only validate your content, it should provide visual and textual indication of any problems.

It's important that whatever editor you chose can be pointed at specific DTDs or schemas; built-in DITA DTDs are no good to you if you've specialized something and validation based on the built-in DITA stops being applicable to your content.

At the level of the topic, the most basic reason for validation is that it catches a lot of stupid mistakes—tags with no matching close tag, forgetting to specify the number of columns in a table via the cols attribute of <tgroup/>, use of a <tm/> element in a <navtitle/> where it is not permitted, mis-typing an entity name so that you've introduced an undefined entity, etc. At a less basic level, validation reinforces the use of standard patterns in the organization of the content. This is where the good error messages from the XML editor's validation function are so important; making XML authoring a frustrating exercise in getting your content past an inscrutable and apparently arbitrary guardian function will not lead to productivity increases.

Managing References


Any DITA authoring environment involves a lot of references:
  • maps assemble topics into deliverables through references
  • topics reference other topics, images, and external resources such as web sites
  • maps may contain secondary content ordering structures like <reltable/> being used to provide alternative information traversals than the map's main table-of-contents ordering provides
  • topics may reference part of their content through the DITA <conref/> mechanism, XML entities, or from the map via a per-map variable mechanism.
One consequence of managing references is that you need a specific tool for managing maps, preferably one which does not compel direct entry of unique identifiers. (The more reliably unique an identifier is, the less reliably a human being is going to type it.)

The primary consequence of all these references is that simple XML validation is necessary, but not sufficient.

Consider a topic referencing another topic with <xref/>, for example. The xref element will be valid with no attributes or contents, an empty href attribute, or an href attribute which references XML content which does not exist. It's most unlikely that you want anything other than the case where the referenced content actually exists.

Similarly, consider the case where you have a properly defined XML parsed general entity being used to substitute XML content for a trademarked name. (This is one of the ways to get a single, central list of trademarked names where changes to the central list are automatically reflected in the content.)

Normalization turns the entity &OurProduct; into <tm trademark="OurProduct" tmtype="reg">Our Splendid Product</tm>. However, XML validation in the editor is checking solely for whether or not the referenced entity is properly defined; it's not doing any normalization so it can't be checking that the normalized form is still valid. (Since the <tm/> can not be put just anywhere in a DITA document and the entity can, this is a real issue.)

Similar issues arise with the DITA conref facility, where the conref attribute is valid so long as it is syntactically correct in XML terms; there's no way for XML validation to test that the reference conforms to DITA syntax or that the object referenced by the conref attribute is available.

You want to make sure that your CMS has some way to detect and flag content that references other content that doesn't exist, rather than accepting what is otherwise a completely valid XML object.

Validation in the Content Management System


When a content object is submitted to the CMS, a sequence of events should take place:
  1. XML normalization (expands or resolves any entities)
  2. check the reference values of
    1. conrefs
    2. topics
    3. images
    for existence in the CMS.
  3. perform XML validation of the normalized version
If any of the existence checks for references fail, or if the normalized version of the XML is no longer valid, the CMS should not accept the object and should give an error message instead. This is an area in which you want[4] informative error messages.

Business Rules with Schematron


Schematron is an unusual schema language, in that it does not require you to specify all the rules under which the document is constructed. Instead, it provides a way to make and test assertions about the content of an XML document.

So if you want to make business rules for your DITA content, such as ordered lists must have at least 2 list items or every topic must have a <navtitle/> with text content, secondary validation with an appropriate Schematron schema is the way to do it.

This approach is enormously simpler, less expensive, easier to change, at less risk due to changes in the DITA specification, and faster to implement than undertaking DITA specialization to reflect your business rules. It is also faster and more reliable than having editors check tagging style visually. Because of the way Schematron is structured, it's also easy to add additional rules in an incremental way; you can start with a very short list of rules (no empty topic title elements) and add those you discover you need later.


[1] Equivalent, but not equal; DTDs and schemas do not specify everything in exactly the same way. So, for instance, a DTD can specify 0 or 1, 1 or more, or 0 to many of something; a schema can specify specific numeric ranges using minOccurs and maxOccurs attributes. This is one of the reasons that there has to be a defined canonical specification when a DTD and schema implementation both exist.

[2] The intermediate XML stages of generating output certainly don't need to be valid, but the first stage, the one that resolves all the entities and supplies the class attributes, must. Otherwise the validating parser, conformant to the XML specification, will hit the first error and stop.

[3] XML documents are plain text documents, and you can edit them with any text editor, even Notepad. But it's very unlikely your sins are so great that you deserve to be editing XML in a text editor, and it's even more unlikely that your writing team will achieve the full expected productivity gains with a text editor. I highly recommend taking a look at oXygen for XML editing purposes.

[4] for values of want that closely correspond to intense need; you don't want a writing team collectively trying to figure out why the editor says it's valid but the the CMS won't take it without at least a strong hint.

24 August 2009

Rise Up

More down town condos for Toronto.

A Failed Experiement

But not horribly failed.
large composite building pictureThis was stitched together automagically by hugin from 36 exposures taken manually, without any kind of support. A couple of them weren't as low as they ought to have been, either; I tracked the wrong layer of windows a couple of times, leading to that blank wedge, which is not a stitching artefact.

So the front of the building isn't actually curved, the sky certainly didn't have rainbow streaks in it, and all the suspended cables do actually connect. The original is immense (59 MB as a fully-compressed PNG), and you can actually see people working behind some of the windows.
detail of workers through the windowThe full thing would never fit (and you wouldn't want to download it, either) but at most 1600 pixels seems to be max-width for photos on picassa, and therefore also for photos directly into blogger. So this is a 100% crop of the folks working, just as an example.

I now understand why there were such pleased noises that the K-7 has the focal plane marked on the outside of the camera body, and why people bother with specialized pano heads for tripods.

I have one more stitching experiment to go (a high window in Union Station) and then I should probably either stop this or start trying to get it right.

23 August 2009

Has the Hot Stopped?

The heat wave pretty much has stopped, but I'm not sure Aoife has really internalized this as a belief just yet.

Using Semantic Tagging

Semantic Tagging Isn't Formatting Instructions


Years of having to worry about formatting can make it particularly difficult to view semantic tagging as a label for the kind of meaning the content of the element should have, rather than "it comes out bold".

Let's consider a (relatively simple) example of the DITA <fig/> (figure)element.

<fig>
<title>Example Placement Diagram</title>
<desc>The Example Placement Diagram shows a common arrangement of component parts.</desc>
<image href="diagram.png">
</fig>

This XML content can be rendered in a bunch of different ways.

In HTML output, we might have:

<div class="figure">
<img href="diagram.png" alt="The Example Placement Diagram shows a
common arrangement of component parts." />
<p class="caption">Example Placement Diagram</p>
</div>

In PDF output, we might have parts of the figure element appearing in different places, as part of the List of Figures at the front of the document:


and as part of the main flow of content within the body of the PDF document:


Note that while the text child of the <desc/> element winds up variously as the contents of an alt attribute, in front of the title of the figure in the main text, and after the title of the figure in the List of Figures, its semantic purpose—to be a more informative description than the title—is being respected in all cases.

It's important to remember that what you're doing is arranging the words to match the semantic purpose in both directions; out towards the expected audience, and in towards the definitions for the XML elements that provide you with the tag set.

This is especially important in the cases where the output processing doesn't use the entire XML tree of the topic, or where the output processing renders the delivered document in some order other than the XML document order of the DITA source.

Examples of not using the whole XML tree include assembling a quick "cheat sheet" style of instruction using only the <cmd/> elements from the <step/> elements of a task topic, or building an overview page by hierarchically assembling the topic titles and short descriptions from all the topics in a map.

Examples of rendering the delivered document in some other order than the XML document order of the DITA source include rendering the topics of a map into multiple sequences of topics to statisfy a list of scenarios the map is expected to meet in scenario-based authoring ("the content referenced by this map allows these personae to perform this list of scenarios", and one output per persona/scenario pair), or, more simply, having a house style that prefers rendering task topics so that the contents of the <context/> element renders before the contents of the <prereq/> element.

Output Types


DITA supports multi-channel publishing; a single source XML content representation can be processed into multiple types of output. The most common types of output are HTML and PDF, but other types are possible, and even within the broad categories of HTML and PDF, it's quite likely that you will have multiple different output types for each. So a "F1 Help" HTML output type may co-exist with a "User Guide" HTML output type, or a "white paper" PDF output type may co-exist with a "technical documentation" PDF output type.

This is a significant system advantage for a DITA content management system, but the advantage comes at the cost of a significant writing challenge. It is necessary to think of the job as getting the semantic representation correct, because you cannot be certain what type of output processing will be used on the topics or maps you produce as an individual. So while you know what output type you will be used—the HTML user guide, or the white paper PDF, etc.—to produce the shipped document from your current work assignment, you don't know what else will happen to that content, either in parallel or in the future. Perhaps the HTML user guides shall be combined and processed into a PDF format to be presented to a potential customer for your software by someone in your company's customer relations department; perhaps some of the concept topics from the white paper will wind up in the engineering documentation, introducing the subject before the task topics and reference topics with quantified values specific to a particular shipping product.

This mix of different and unknown uses for the content make it imperative to not allow the semantic tagging to collapse into a sort of awkward attempt at a formatter. If that collapse happens, you'll wind up with some unexpected suitably generic output processing that won't work on your content, which has been customized to a particular output type's processing at a particular point in time.

Getting the delivered document is inherently a two-stage process with DITA, and keeping the formatting step separate from the semantic tagging step is vital for maintaining the writing advantages—arbitrary re-arrangement, no content edits required to change the type of delivered documents—of topic-based authoring in DITA.

Agreeing on Meaning


DITA a general XML vocabulary for technical authoring, with little inherent structural constraint. This is in part because DITA also supports specialization, but specialization does not solve the problems of semantic tagging. Specialization is a way to reflect consistent and frequent recurring patterns in your content, so that you might for instance wish to specialize the <prereq/> element of a task into <tools/> and <precautions/>. This is very helpful if you want every task (or every task for a particular audience) to include content about the required tools and the necessary precautions, but it does not allow you to agree on the local value of the semantic meaning of an element. It might, at best, reduce the scope of the argument about what that local semantic meaning should be.

Consider the <info/> element, to which the specification assigns the semantics: "The information element (<info>) occurs inside a <step> element to provide additional information about the step."

You can put a lot of block level elements in an information element; paragraphs, lists, simple tables, full tables, figures, and objects are all permissible. There are also a large number of legal inline markup elements, such as filepath and menucascade. This means that you can sensibly use <info/> as a text container, like a specialized paragraph, or you can use it as a small section, which contains a sequence of block level elements such a paragraphs and lists. You can't sensibly use a single info block as both of those things at the same time, though; aside from output processing issues, you're putting the same kind of content at different levels in the XML tree, which is bad semantic tagging. Since you can put as many information elements as you like in a single task step, you also need to consider if you want to use one or many <info/> elements.

There is no single right answer for this; it depends on the kind of information you want a task to impart, your overall style decisions about how information is to be presented, and, probably, on how your output processing works. (Information elements that contain note elements with marginal icons of danger symbols, for example, may require you to go one-info-element-per-note to keep the icons from landing on top of each other.)

Someone on the writing team has to own the process of agreement on semantic meaning, and be able to make decisions, break ties, and otherwise ensure that there is a single definitive style guide for semantic tagging of content.

20 August 2009

Sorta Mauve In Colour

Walked by it, and took a picture. Not entirely sure where, but somewhere in the public spaces of Toronto. Slight crop, to get rid of a busy left edge.

Ceiling Vault


A skylight in Union Station in Toronto. A single exposure, and perhaps the only time I have so far noticed where a moderate wide lens would have been really useful. This is an un-cropped image, and I'm happy with the amount of detail in the corners.

Content Conversion Issues

Traditional, narrative-authoring content is not likely to be fully structured. Converting existing content to the concept, task, and reference information typing of the structured writing model used by DITA requires forethought and planning to work well.

It's Not Supposed To Look The Same


Not looking the same is in part a formatting issue. The automated formatting simply will not look exactly the same as the formatting from the incumbent DTP program, whatever that program happens to be. There are high-level issues such as fonts, since the fonts DTP programs or the Windows operating system come with are licensed to forbid embedding in PDFs produced with other software; there are various low-level issues like "you know, that's not the same kerning algorithm". And of course there's the fundamental issue that the automated formatting does not allow for hand-tweaking. No putting in an extra blank line to force a table on to the next page, and so on.

Not looking the same is also a content structuring issue. During your content conversion process from whatever you are using to DITA XML, you will need to carefully dividing existing content into good topics following the concept/task/reference convention for information typing. This process inevitably involves changing the location of sentences and paragraphs of existing content. If you take, as you should, the conversion exercise as an opportunity to look at minimalism as a writing approach for your content, you will also remove existing words, sentences, and paragraphs. If your existing narrative or single-source structure has been set up to remove information redundancy, so that a single fact appears in one and only one place, you may even find yourself adding sentences and paragraphs, because in topic-based authoring, the unit is the topic, and each topic must be meaningful by itself. [1]

As a result of these two reasons for the DITA content not looking like the narrative content, it's very important that the objective for content conversion to DITA XML be understood as good semantic tagging and good information encapsulation, rather than replicating the look and feel of the existing content.

Grouping Content By Topic Type


Narrative content generally does not follow the DITA information typing convention of grouping all conceptual information in a concept topic, putting procedural information by itself without other information types, or keeping all quantified facts in a reference topic. Instead, paragraph-sized or smaller instances of each type of information will be scattered through the narrative.

Obtaining that grouping in your content requires agreement on what the local definitions of the DITA information types are. It will also require discussion to get the whole writing team using those definitions in the same way. It is a good idea to have one senior member of the team own the information typing definitions, and to be responsible for answering questions and settling disputes.

Generally, this process takes a certain amount of time until the light bulb goes of and the members of the writing team see how it is possible to restructure the content into the DITA information types. At that point, the information typing associated with conversion to a DITA authoring environment tends to go smoothly.

Restructure First, Then Tag


Someone who has extensive DITA experience, is comfortable with multiple levels of abstraction, and who has a reliable deep grasp of the local information typing and semantic tagging usage conventions can simultaneously restructure and tag content, but this level of ability is not something to plan on finding in your writing team. It is especially not something to plan on while you are doing your initial content conversion from DTP software to DITA XML.

While you can, in principle, do the conversion in either order, it's better to restructure first. Restructuring requires discussion and collaboration—are we all using the same liability disclaimer? which of these user documents use the same topics? who owns the local definition of a concept topic?—across the entire writing team, and this benefits from a certain amount of overt planning and organization on the part of the person in charge of the writing team.

Once the re-organization of the content has been agreed on, XML tagging can be done by individuals working alone. However, it is very important to agree on the list of DITA elements you're going to use before anyone starts tagging content. Not only is there often more than one way to do something with the DITA element set, it is not necessarily the case that every element will process, or necessarily in the way that you expect. It is much better to have planned which elements are to be used ahead of time than to realize that the element set being used by the writing team to tag content is not the element set that will process, and to have to go back and do the tagging for substantial amounts of content over again.

It is certainly possible to restructure your content in the incumbent DTP application and deliver it that way, and this may be a way to manage the possible rate of change. (Very few writing teams have the uncommitted time or resources to proceed with a complete content conversion as a single step.) There are drawbacks to this, particularly the difficulty with heading levels ("is the topic title an H2 or an H3 style in this document?") which tends to force topic duplication. If you need to proceed through your entire body of content at one time while converting, this might be your only option. If you have the option of fully converting a small portion of your delivered content to DITA, and then another portion, and so on, until everything has been converted, take that approach instead.

Automation Mostly Unhelpful


Content conversion is an exercise in semantic tagging; content that wasn't semantically tagged at the start needs to end up semantically tagged at the end. Semantic tagging as an activity requires answering the questions what is the function of these words in the topic? and what kind of meaning do these words have?. Both questions are questions of meaning and require a human being to answer.

There are software products available which can enclose the text content of Word or FrameMaker files in DITA (or other XML) element tags, and which are smart enough to do this on the basis of existing styles in the Word or FrameMaker file. This can function as a modest time-saver for a human, but cannot be used without complete human reworking of the results. There are three primary reasons for this:

  1. the semantic value of a style is rarely a 1:1 match with a DITA element
  2. the program does not, cannot, and should not attempt to re-arrange content to better conform to DITA information typing conventions
  3. Deciding on the structure of the content delivery—the DITA map—is a separate task from tagging content as XML
    • multiple instances of one topic's content will appear in the incumbent narrative content
    • multiple maps with references to one topic once content has been converted
    • it requires a human being to sort these issues out!

Since convert-to-DITA-tagging software products are relatively expensive, it's important to consider the cost-benefit ratio carefully when considering purchase of conversion software.

Summary


Content conversion is necessarily a manual step, with minimal opportunities to exploit automated support. It presents an opportunity to fully restructure your existing content into conformity with DITA information typing, full form/content separation via semantic tagging, and possibly to apply minimalist writing principles at the same time. Since all subsequent content delivery is constrained by the existing content you have available to build on in a high-reuse environment, it is particularly important to have a successful content conversion step when switching to a DITA CMS as your primary means of content delivery if you expect to derive substantial benefit from content re-use.


[1] Topics can and should reference other topics for context and completeness, but the intended audience should not be required to read anything else to extract the unit of information from the stand-alone contents of the topic.

Concept, Task, and Reference

Structured Authoring With DITA


Structured Authoring is, fundamentally, about consistent organization of content. When authoring in DITA, it's the writing use (as distinct from the output generation use) of the general semantic tagging capability of XML markup .

With DITA, there are at least three levels of structure; the organization of topics, the organization of the content of topics, and the external objects imported into topics, such as images. I'm addressing only the "content of topics" part of structured authoring with DITA in this post.

The point to having topic types, and thus the point to DITA including topic types as part of the core specification of the XML vocabulary, is to be able to separate different information objectives into different topic types. Different types of topics enable their audience in different ways.

Concept, task, and reference are specialized, and in terms of internal structure, much more specific, compared to the generic DITA topic type. The problem is that they remain general, and in any group of writers larger than one, the reflexive understanding of what these topic types obviously mean will vary.

So as you implement your DITA solution, you will have to decide precisely how you want the individual topic types to be used, make sure the whole writing team knows the local topic information type definitions, and be prepared to reconsider those definitions if you don't like how they work out in practise.

Generic Topic


DITA includes a completely generic topic type; the root element is <topic/&gt, with class[1] - topic/topic.

There are two problems with saying "well, simple is good; generic means we should be able to handle future surprise easily" and going with the generic DITA topic for your writing.

The first problem is that the generic topic is a little too generic; writing within the full scope of the generic topic is not obviously structured writing. This gets rid of the "simple" part of the advantage, and much of the "generic" part as well; you would have to develop specific local business rules and processes about how you would use the generic topic type in order to practise structured writing by using it.[2]

The second problem is that if all your topics are just "topic", you can't do information typing other than by metadata. In the unlikely even that it was designed to handle this case, a DITA CMS might make this straightforward, but DITA was designed on the assumption that you didn't necessarily have a CMS, and so are the content management systems meant to support it. DITA allows for information typing by providing three specialized topic types.

These types are the concept topic, task topic, and reference topic.

Concept


Concept topics are the least structured and most general of the three default topic types.

From a writing perspective, concept topics contain both paragraph level elements, such as a paragraphs and lists, and section level elements; <section/> <example/> and <table/>.[3] Concept topics are general enough to present pretty much any kind of information.

Concept topics are rare in most technical writing infosets. They are concerned with theory, abstract information, the general, rather than the specific, and ideas.

I would recommend considering a rule that concept topics are used for those cases where no quantified information or instructions are being provided, and that those responsible for tagging style watch very carefully for those cases where there ought to be quantified information. Pure theory, in other words; mathematical proofs, definitions of capacitance, SI units, and economic value are all examples of the type of thing that goes in a concept topic.

Task


Task topics are the most structured and least general of the three default topic types.

From a writing perspective, tasks contain a number of specialized section-level elements: <prereq/> ("pre-requisites"), <context/>, <steps/>, <result/>, <example/>, and<postreq/> ("post-requisites"). Task topics are specialized for presenting sequential instructions. Task topics are not suitable for presenting anything else.

Task topics are common in most technical writing infosets. Everywhere you have a procedure or instructions, you use one or more task topics.

I would recommend considering a rule that all instructions are a task; this is the easy part. The hard part is agreeing on how much content, and what kind, goes in the <context/> element, and how task steps are to be broken down.

Context is intended to provide context for the task ("you brush your cat to cut down on hairballs and shedding") or a small amount of conceptual information ("This experiment allows you to measure the acceleration due to gravity. Remember that acceleration is defined as the rate of change in velocity, measured in meters per second per second.")

Task step (or sub-step) breakdown depends very much on the house style, as well as the type of information to be presented. There are arguments for and against presenting complex instructions as a single large task with sub-steps or as a sequence of individual tasks.

I would recommend that the number of steps in a task, or sub-steps in a step, not go above five. Whether this is best handled with sub-steps (and treating "log on to the service" as a step with sub-steps for the details) or breaking the procedure into multiple task topics is a function of the complexity of the material. More complex instructions benefit from being chunked into discrete topics more than relatively simple material with a large number of specific operations.

While keeping everything to five or fewer steps is not always achievable, it's well-attested that people can keep track of seven things at once, plus or minus two. Since it's unlikely documentation will get all of someone's available attention—remembering to pick the kids up from daycare is, or ought to be, difficult to displace—five is sometimes too large a number.

Reference


Reference topics are only slightly more structured and less general than concept topics.

From a writing perspective, references (in the <refbody/> element) contain section level elements; <section/> <example/> and <table/>.[3] They do not directly contain paragraph-level elements, which is the primary structural difference between a reference topic and a concept topic.

Reference topics are common in most technical writing infosets. Reference topics present information intended for quick look-up of specific facts. While you would generally have to read all of a concept topic or task topic to make use of it, it is common for someone to read a reference topic solely to find a single fact, and to stop once they have done so.

I would recommend a rule that reference topics must contain content that is specific and quantified; always "five apples" rather than "some apples". Aside from making the distinction between the information function of the concept and reference topic types more clear to both the writing team and the audience, there is value in terms of content quality in insisting on being specific in a reference topic. Removing, as much as possible, qualifiers, and presenting quantified facts makes the technical reviewers' jobs easier, pushes the writer toward precision, and makes it easier for the audience to find that single fact they might be looking for.



[1] The class attribute, often abbreviated "@class" after XPATH syntax, is how processing figures out what to do with a particular element. Values of @class are provided by the DITA DTDs during XML normalization; users need not, and should not, ever be entering the class attribute manually. Which means that you might spend years writing documentation with DITA using raw XML, and never see the class attribute.

[2] You can, of course, specialize your own topic types starting from the generic topic type. Specialization is another subject for another time.

[3] This is a simplification, leaving out special-purpose elements like <data/> and <foreign/>. By all means, see the DITA language specification for details.

19 August 2009

A slight failure of timing

Train tracks and the (briefly!) parallel bicycle path in the Don Valley, taken from a moving subway car crossing the Prince Edward Viaduct over the Don Valley. The intruding bit of out-of-focus shadow to the upper left is part of the bridge structure, so I'm going to have to try this again to see if I can better the timing.

Overall, though, I'm mostly pleased; this is something I have seen a lot, and was not at all sure I could photograph in a recognizable way between motion issues, focus issues, and subway car window grime.

18 August 2009

Orange Flower

I've been seeing a surprising amount of this particular flower this year. Could be that it's been a good year for water-loving plants; could be that photography is encouraging me to pay attention. Could be some of each, too.
I especially like the orange colour layering in these; it looks a lot like it's been built up out of layers of enamel.

17 August 2009

Photographic Feedback Request

So, the Metro Zoo has sent me email—I have a membership, they're allowed to do that—announcing a photo contest.

So I thought I would ask, out of the zoo photos I've posted in 2009, if anyone had any particular favourites.

16 August 2009

OFO bird walk: Durham Region and Lake Ontario Marshes

So I got up at 03h30 (the alarm was set for 03h40, but my brain has this fear of being late that is difficult to adequately express) and I got on the 04h30 bus south, negotiated through some unexpected streetcar outages on King and Queen Streets, got on a GO bus to Whitby, and eventually wound up at the Lynde Shores conservation area for 07h40 or so, ten minutes after the nominal start time but actually in time for the walk to start.

The day list for the group as a whole was some 70 birds; I saw 26 well enough to be able to identify them myself.

American coot—family group! possibly the same family groups as Thurs. 13th
American robin
bald eagle—a wandering juvenile and the last bird with the group
belted kingfisher—repeated high hovering dip-and-swoop
black-crowned night heron—three, all flying, all at disparate times
blue jay
cedar waxwing
chickadee—multiple separate supervising flocks
common grackle
double-crested cormorant
downy woodpecker
eastern kingbird
goldfinch
great blue heron—multiple individuals at different times
green-winged teal
grey catbird
lesser yellowlegs
mallard duck
marsh wren—from a minimum distance of about five feet
northern harrier—from the train coming home!
northern mockingbird
osprey—three, in two trees, two devouring fish
red-winged blackbird
ruby-throated hummingbird—perfect profile hover
spotted sandpiper—emotional belief in how small these are is difficult
yellow warbler

Special thanks to Ian, Jerry, and John, for being so kind as to ferry me about when movement went vehicular.

I keep learning things at these; one of the things I've been learning is that the expensive binoculars were worth it. Getting a sharp version of the—quite distant—kingfisher flight was amazing, white wing patterns and all.

14 August 2009

All the DITA posts in order

Since it's quite possible someone wants to read these in publication order, and I want to have a single place to send people who want to read them.

  1. Content Management Meta-Data Stratigraphy, Topic-based Authoring, and DITA
  2. DITA: What's a Topic?
  3. Example: Topic Breakdown Using the "Information Causes Change" Heuristic
  4. Narrative authoring vs. topic based authoring—Productivity gains and their causes
  5. Narrative Authoring Vs. Topic-Based Authoring, Part 2: Everything is Trade-offs
  6. Metadata and Decision-making
  7. Concept, Task, and Reference
  8. Content Conversion Issues
  9. Using Semantic Tagging
  10. Validation as a Lifestyle Choice
  11. The System in CMS
  12. Images As Objects
  13. Maps in DITA
  14. It's Titles All The Way Down

Metadata and Decision-making

Keeping track of hundreds of topics in your head isn't practical, so any production use of DITA requires metadata—readily available, searchable metadata—to make it usable.

So What Is Metadata?


Generally, metadata is data about data.

In the context of DITA or a DITA CMS system, metadata is in, or associated with, an XML object, and describes that object, but is not part of the delivered content of the object.

You can think of this as the data—the content properly part of the object—answers "what does this mean?" while the metadata answers everything else: "who had this last?", "how good is it?", "where is this being used?", and so on.

Alternatively, you can think of metadata as everything that isn't about what the object means. Meaning is a human question, and has to be handled by people. Everything else benefits greatly from automation support, and can sometimes be done completely automatically. Metadata is the information necessary to that automatic support.

Why Should I Care?


Traditional technical writing work flows don't need much metadata; an individual writer is responsible for producing all of a specific document, and in the normal course of events is also the only person who is going to need to know about any of the questions metadata for content objects in an DITA content management system is designed to answer. In this situation, an individual writer can handle their metadata requirements with some combination of a good memory and taking notes.

In a DITA work flow, there are inherently two levels of organization. There is the level of organization inside content objects—topics and images— which function as units of meaning, and there is the level of organization created by the structure of a map referencing topics[1] to create a content deliverable. Whether you use DITA at the level of single-sourcing, topic-based authoring, or scenario-based authoring, once you use maps you have to handle a number of cases that do not arise with narrative authoring:

  • I didn't write any of these topics, but I'm building a map that uses them

  • I didn't write that topic, but I want to reference it from my map

  • That topic would be perfect for my map if I could change it a little...

  • If you change that topic, you will break all my maps
Since efficient content re-use and efficient division of labour within the writing team depend on being able to answer these questions quickly and directly, you need metadata to support your DITA workflow.

Metadata Objectives


Per-topic metadata should support immediate answers to three questions:
  1. Can I ship this?

  2. Who do I need to talk to?

  3. If I change this, what do I break?


Can I Ship This?


In other words, where is this topic in the information quality progression?
That might be as simple as having "draft" and "done" states; it might be a full implementation of an information quality state machine with formal role divisions. Either way, you have to be able to tell if it's enough done to ship to a customer.

This also means that the state mechanism your CMS uses should be smart enough to downgrade referencing objects when the referenced object is marked as having lost information quality; if your map is at "done", and someone changes one of the referenced topics to "draft", the map should change to "draft", too.

Maps get large; it's quite easy to have a map that references hundreds of topics. Manual visual tracking of information quality states isn't entirely reliable, especially under any kind of deadline pressure, so this is a place where CMS support for managing information quality through the object metadata can give you a large win. (Especially if it keeps you from not noticing that there was no one assigned to the topic that will become Chapter 3, Section 2 on this pass through the content until after the current version ships...)

Who Do I Need To Talk To?


Who has changed this topic, when? You want a full stored change history, as is common in version control systems, because the last person who changed the content object may have been doing a pure language edit, and you may need to ask the person who provided the technical review, so you need to be able to find out who they are.

This is hopefully a rare thing to need to do, but when you do need to do it (17 Farad capacitor? on a graphics card? really?) it's going to be important that you can find the correct contact person quickly. So your CMS should make the information about who to ask immediately available.

If I Change This, What Do I Break?


The converse, and just as important, version of this is question is "if I reference this, and it changes, does my map break?" You can't predict the future but you can look at where else the topic is used and either talk to the people who own those content deliverables or make a risk assessment based on content commonality.

A topic referenced by no maps is safe to change[2] or delete[3]; a topic referenced by fifty maps (very easy to achieve in the case of a standard liability disclaimer in user documentation, for instance) is not safe to change, at least not without the exercise of forethought and planning. So grabbing the existing disclaimer, or the existing "the names of the different kinds of screwdriver" topic, is pretty safe; it's not likely to change and many things depend on it, so it won't be changed capriciously. A topic referenced by two maps going to very specific audiences—instructions for tuning one specific model of rocket skates for a single uphill mountain sprint course in Switzerland, say—is much more likely to change suddenly.

The general principle is "less referenced topics are less safe with respect to future change". Whatever writing process you use when doing topic or scenario-based authoring, you're going to need to have people-level process in place as well as the metadata, but without the metadata the people-level process fails on the sheer mass of detail it must track.

Distributed Authoring As A State Of Mind


Everybody on the writing team needs to think of topics as referenced objects. It stops being a case of "this is my document" and starts being a case of "I am working on part of a large information set, along with a bunch of other people, all of whom may be linking to any part of that information set, including the parts of the information set for which I am responsible".

This requires some effort to manage the overlap in content deliveries. Overlap management can be done via an object assignment mechanism in the CMS ("Fred has this one, Felicia has that one" assignment), some other assignment mechanism, or through convention ("never change a referenced topic without asking", in which case there needs to be map metadata that indicates who to ask).

Generally, the awareness that all the content is connected now seems to take an occasion of panic to really sink in, but rarely more than one, or at least not more than one per writer.

Progress At A Glance


Managers of writing effort have, when using DTP tools and narrative authoring, limited and subjective awareness of how work is progressing. Writers will report things like "70% done" or "three days left", and familiarity and practise will allow these assertions to be used as the basis for planning.

Once you've got a combination of topic-based authoring and per-object state labels, though, you have an immediately visible indication of progress. You can open a map and just look at it; if all the referenced topics are in "draft" (or "requires technical review", say), the map and thus the content delivery has not reached a significant degree of completion.

If almost all the referenced topics are in "done" with two left in "needs technical review", you can at least ask the writer what's going on with those, and maybe do something about it. If your CMS provides information like "assigned reviewer" and "date of assignment" and "due date per state" (so "complete technical review" has a due date; "complete editorial review" has another, later, due date, both of which are before the "ship" due date on the map), you can just directly call up the technical reviewer's manager and ask politely why they're late, with no need to interrupt the writer.

Good searching on the meta-data will give you things like "show me all the topics that are a day or more late for review", too.

This capability is difficult to quantify as a productivity gain; it is likely that there's some productivity improvement through both the responsible writer and the manager having a better view of what's going on with any given deliverable, but convincingly measuring the benefit is not straightforward. On the other hand, the benefit to the manager's stomach lining, however subjective, is observed to be worthwhile.

DITA's <prolog/> Element


DITA provides the <prolog/> element and its diverse children as a container to store topic metadata in a topic. Use of the prolog as a directly-editable metadata location requires thought and planning.

For instance, the <keywords/> element is a child of the prolog, and this is where the topic-level keywords ("widget installation", "ACME rocket skates", etc.) go. If you use topic-level keywords, which are intended for processing into HTML <meta/> elements to support search in HTML deliverables, you almost certainly want those keywords to be managed by a human being, since which topic gets which keyword is an issue of meaning and very difficult to automate. However, because the DITA prolog element and its children are XML content, structurally indistinguishable from the <title/> element or the other non-metadata elements of the topic, allowing direct editing of the prolog element allows direct editing of all the topic metadata. This may not be what you want—mistakes in hand-edited metadata can be especially difficult to find or fix—and your CMS may need to provide a special interface for entering metadata in a controlled way.

If you are using a CMS for your DITA content, you may not want to use the prolog at all. Having the metadata stored in the objects means the CMS will have to open each object, read the metadata it is interested in (or check that this metadata has not changed), close that object, and continue, in order to perform searches that require or involve the metadata. The resulting time cost, compared to having the metadata stored for all the objects stored collectively in some efficient data structure, can be considerable, especially after you accumulate a significant (thousands of topics) amount of content in your CMS. There are also possibilities for hybrid approaches, where some but not all of the prolog element's children are compared to the version before editing, and if changed are checked for correctness—can't set the due date to March 1, 604 BCE, etc.—and loaded into the efficient collective data structure.

Use of metadata generally, and use of the prolog element specifically, is an area where "everything is trade-offs" really makes itself felt, and where it is especially important to know what you want going into your CMS selection process. Creating your use cases and working out a formal functional specificaton for how you need metadata to behave is a significant effort, but it is also the only way to be sure you can use your CMS in the way you intend.


[1] Possibly through the intermediary of referencing maps; eventually, there will be a topic reference.

[2] Assuming someone isn't working on it and just hasn't got it referenced by the intended map yet. Paying attention to the history can help with this.

[3] Your CMS shouldn't allow deletion of objects that are referenced by other objects. (Since maps can reference maps in DITA, this needs to be a general case prohibition, it's not just topic references that are a concern.) If it does allow deletion without consideration of references, a mistaken delete can break lots of maps.

13 August 2009

Getting Out of the House

Went for a walk today.
Of course, I had to take a train to where I went for the walk, but one can't have everything.

View Lynde Shores Bird Walk in a larger map

I saw 21 birds I could identify, plus two small songbirds that were more or less "Theropoda incertae sedis" and some eclipse ducks I may attempt to opine about later.

The identified birds were:
american coot — family group! distant and dubious but not too distant
blue jay
canada geese
caspian tern
cedar waxwing — bunches
chickadee — came within arm's reach; there were sunflower seeds on a small platform at one of the observation stands, and food is important when you're a chickadee
common grackle
common moorhen — family group! sprinting over the vegetation and kek-kek-keking!
crow — post-West Nile, actually rather scarce birds
double-crested cormorant
eastern kingbird
goldfinch — a great many, to go with the copious provision of seeding thistles
great blue heron
herring gull
mallard ducks
mourning dove
red-winged black bird
ring-billed gull
song sparrow
tree swallow
tricolored heron — from the road on the way back. Sitting on the marsh board walk railing, which provided an excellent scale reference.
turkey vulture — sitting on an abandoned ferro-concrete silo and glowering a mighty glower.

Total time was 13h30 leaving the GO station to 18h15 returning. About 10 km of distance traversed. Not utterly sluggish but I should definitely go for this sort of walk more often.