14 August 2009

Metadata and Decision-making

Keeping track of hundreds of topics in your head isn't practical, so any production use of DITA requires metadata—readily available, searchable metadata—to make it usable.

So What Is Metadata?


Generally, metadata is data about data.

In the context of DITA or a DITA CMS system, metadata is in, or associated with, an XML object, and describes that object, but is not part of the delivered content of the object.

You can think of this as the data—the content properly part of the object—answers "what does this mean?" while the metadata answers everything else: "who had this last?", "how good is it?", "where is this being used?", and so on.

Alternatively, you can think of metadata as everything that isn't about what the object means. Meaning is a human question, and has to be handled by people. Everything else benefits greatly from automation support, and can sometimes be done completely automatically. Metadata is the information necessary to that automatic support.

Why Should I Care?


Traditional technical writing work flows don't need much metadata; an individual writer is responsible for producing all of a specific document, and in the normal course of events is also the only person who is going to need to know about any of the questions metadata for content objects in an DITA content management system is designed to answer. In this situation, an individual writer can handle their metadata requirements with some combination of a good memory and taking notes.

In a DITA work flow, there are inherently two levels of organization. There is the level of organization inside content objects—topics and images— which function as units of meaning, and there is the level of organization created by the structure of a map referencing topics[1] to create a content deliverable. Whether you use DITA at the level of single-sourcing, topic-based authoring, or scenario-based authoring, once you use maps you have to handle a number of cases that do not arise with narrative authoring:

  • I didn't write any of these topics, but I'm building a map that uses them

  • I didn't write that topic, but I want to reference it from my map

  • That topic would be perfect for my map if I could change it a little...

  • If you change that topic, you will break all my maps
Since efficient content re-use and efficient division of labour within the writing team depend on being able to answer these questions quickly and directly, you need metadata to support your DITA workflow.

Metadata Objectives


Per-topic metadata should support immediate answers to three questions:
  1. Can I ship this?

  2. Who do I need to talk to?

  3. If I change this, what do I break?


Can I Ship This?


In other words, where is this topic in the information quality progression?
That might be as simple as having "draft" and "done" states; it might be a full implementation of an information quality state machine with formal role divisions. Either way, you have to be able to tell if it's enough done to ship to a customer.

This also means that the state mechanism your CMS uses should be smart enough to downgrade referencing objects when the referenced object is marked as having lost information quality; if your map is at "done", and someone changes one of the referenced topics to "draft", the map should change to "draft", too.

Maps get large; it's quite easy to have a map that references hundreds of topics. Manual visual tracking of information quality states isn't entirely reliable, especially under any kind of deadline pressure, so this is a place where CMS support for managing information quality through the object metadata can give you a large win. (Especially if it keeps you from not noticing that there was no one assigned to the topic that will become Chapter 3, Section 2 on this pass through the content until after the current version ships...)

Who Do I Need To Talk To?


Who has changed this topic, when? You want a full stored change history, as is common in version control systems, because the last person who changed the content object may have been doing a pure language edit, and you may need to ask the person who provided the technical review, so you need to be able to find out who they are.

This is hopefully a rare thing to need to do, but when you do need to do it (17 Farad capacitor? on a graphics card? really?) it's going to be important that you can find the correct contact person quickly. So your CMS should make the information about who to ask immediately available.

If I Change This, What Do I Break?


The converse, and just as important, version of this is question is "if I reference this, and it changes, does my map break?" You can't predict the future but you can look at where else the topic is used and either talk to the people who own those content deliverables or make a risk assessment based on content commonality.

A topic referenced by no maps is safe to change[2] or delete[3]; a topic referenced by fifty maps (very easy to achieve in the case of a standard liability disclaimer in user documentation, for instance) is not safe to change, at least not without the exercise of forethought and planning. So grabbing the existing disclaimer, or the existing "the names of the different kinds of screwdriver" topic, is pretty safe; it's not likely to change and many things depend on it, so it won't be changed capriciously. A topic referenced by two maps going to very specific audiences—instructions for tuning one specific model of rocket skates for a single uphill mountain sprint course in Switzerland, say—is much more likely to change suddenly.

The general principle is "less referenced topics are less safe with respect to future change". Whatever writing process you use when doing topic or scenario-based authoring, you're going to need to have people-level process in place as well as the metadata, but without the metadata the people-level process fails on the sheer mass of detail it must track.

Distributed Authoring As A State Of Mind


Everybody on the writing team needs to think of topics as referenced objects. It stops being a case of "this is my document" and starts being a case of "I am working on part of a large information set, along with a bunch of other people, all of whom may be linking to any part of that information set, including the parts of the information set for which I am responsible".

This requires some effort to manage the overlap in content deliveries. Overlap management can be done via an object assignment mechanism in the CMS ("Fred has this one, Felicia has that one" assignment), some other assignment mechanism, or through convention ("never change a referenced topic without asking", in which case there needs to be map metadata that indicates who to ask).

Generally, the awareness that all the content is connected now seems to take an occasion of panic to really sink in, but rarely more than one, or at least not more than one per writer.

Progress At A Glance


Managers of writing effort have, when using DTP tools and narrative authoring, limited and subjective awareness of how work is progressing. Writers will report things like "70% done" or "three days left", and familiarity and practise will allow these assertions to be used as the basis for planning.

Once you've got a combination of topic-based authoring and per-object state labels, though, you have an immediately visible indication of progress. You can open a map and just look at it; if all the referenced topics are in "draft" (or "requires technical review", say), the map and thus the content delivery has not reached a significant degree of completion.

If almost all the referenced topics are in "done" with two left in "needs technical review", you can at least ask the writer what's going on with those, and maybe do something about it. If your CMS provides information like "assigned reviewer" and "date of assignment" and "due date per state" (so "complete technical review" has a due date; "complete editorial review" has another, later, due date, both of which are before the "ship" due date on the map), you can just directly call up the technical reviewer's manager and ask politely why they're late, with no need to interrupt the writer.

Good searching on the meta-data will give you things like "show me all the topics that are a day or more late for review", too.

This capability is difficult to quantify as a productivity gain; it is likely that there's some productivity improvement through both the responsible writer and the manager having a better view of what's going on with any given deliverable, but convincingly measuring the benefit is not straightforward. On the other hand, the benefit to the manager's stomach lining, however subjective, is observed to be worthwhile.

DITA's <prolog/> Element


DITA provides the <prolog/> element and its diverse children as a container to store topic metadata in a topic. Use of the prolog as a directly-editable metadata location requires thought and planning.

For instance, the <keywords/> element is a child of the prolog, and this is where the topic-level keywords ("widget installation", "ACME rocket skates", etc.) go. If you use topic-level keywords, which are intended for processing into HTML <meta/> elements to support search in HTML deliverables, you almost certainly want those keywords to be managed by a human being, since which topic gets which keyword is an issue of meaning and very difficult to automate. However, because the DITA prolog element and its children are XML content, structurally indistinguishable from the <title/> element or the other non-metadata elements of the topic, allowing direct editing of the prolog element allows direct editing of all the topic metadata. This may not be what you want—mistakes in hand-edited metadata can be especially difficult to find or fix—and your CMS may need to provide a special interface for entering metadata in a controlled way.

If you are using a CMS for your DITA content, you may not want to use the prolog at all. Having the metadata stored in the objects means the CMS will have to open each object, read the metadata it is interested in (or check that this metadata has not changed), close that object, and continue, in order to perform searches that require or involve the metadata. The resulting time cost, compared to having the metadata stored for all the objects stored collectively in some efficient data structure, can be considerable, especially after you accumulate a significant (thousands of topics) amount of content in your CMS. There are also possibilities for hybrid approaches, where some but not all of the prolog element's children are compared to the version before editing, and if changed are checked for correctness—can't set the due date to March 1, 604 BCE, etc.—and loaded into the efficient collective data structure.

Use of metadata generally, and use of the prolog element specifically, is an area where "everything is trade-offs" really makes itself felt, and where it is especially important to know what you want going into your CMS selection process. Creating your use cases and working out a formal functional specificaton for how you need metadata to behave is a significant effort, but it is also the only way to be sure you can use your CMS in the way you intend.


[1] Possibly through the intermediary of referencing maps; eventually, there will be a topic reference.

[2] Assuming someone isn't working on it and just hasn't got it referenced by the intended map yet. Paying attention to the history can help with this.

[3] Your CMS shouldn't allow deletion of objects that are referenced by other objects. (Since maps can reference maps in DITA, this needs to be a general case prohibition, it's not just topic references that are a concern.) If it does allow deletion without consideration of references, a mistaken delete can break lots of maps.

No comments: