05 August 2009

DITA: What's a Topic?

Bringing up topic-based authoring immediately brings up the question of what, precisely, is a topic?
There are a bunch of answers to this question; I'm going to give you mine.

Very Short Answer

My first, very short, answer is that a topic is the unit of information delivery you write in when you move away from document delivery to information delivery as your organizational goal. While this answer is true, almost no-one finds it obviously helpful. It's the pure "what" answer with no obvious connection to how you would do such a thing.

A Heuristic for Information

The second, rather longer, answer starts with the formal definition of information.
That formal definition says information reduces the probability of uncertainty. The problem with this definition is that it's almost impossible to apply; you start having to think about ways to measure the probability of uncertainty in your customers, some of whom you may never know anything about whatsoever, in the future, before and after they've used your information delivery to try to do something unspecified. Thinking about the future accurately is very difficult; thinking about probability clearly is very difficult. The combination is nigh-impossible.
So instead of using the formal definition, fall back on a useful heuristic for determining what is information—information causes change.
This remains difficult, because information remains inherently contextual in terms of your audience. Instructions to an experienced mechanic about changing a tire need to contain little other than a torque specification for tightening the nuts and a list of relevant safety standards, where instructions to a complete novice need to start with how to identify the appropriate wrench. So human judgement about the intended audience has to be used to determine what constitutes information—what will cause a member of the intended audience to do something differently than they otherwise would have done—but "what would a member of specific audience do differently if they knew?" is a much easier question than "will this reduce the probability of an unknown individual's future uncertainty?"

DITA Topic As Audience-Specific Unit of Information

So, from this information perspective, a topic is enough information to cause one change, and enough context to understand that information.
Which information, and how much context, remain completely dependent on the intended audience. Relating information to audience is where the exercise of judgement by a writer first comes in to the topic-based authoring process; who am I addressing, what do they already know, what do they need to know to correctly decide what they should do? (Do, in a great many senses; deciding the answer to a question, deciding how to proceed with a job, deciding between alternatives...)
It's one change because topics are independent of each other, which includes their ordering with respect to each other. Once you have enough information to cause multiple changes in a single topic, you start to develop natural language dependencies in the content and lose the topic property of independence. The right way to handle "you have to know this other thing before you try to do this thing" is with explicit references, and DITA provides several mechanisms for doing making explicit topic-to-topic references.

Topics and Information Delivery

Using this definition, a topic is your unit of information delivery; it's enough information to cause one change in a member of your expected audience, plus enough context to understand that information. Because topics are organized in arbitrary groups, they require independence from one another, so all information dependencies should be expressed through explicit references to other topics, rather than through document order.
If you use topics in this way, there are a bunch of advantages; the most immediate advantage to a writer is that everything is in small pieces, relatively easy to keep track of and to have reviewed.Topics to this standard are also much easier to write, because the context is inherently limited; the context of organizations of topics making up an entire information delivery can be very large, and very complicated, but the context of an individual topic will be small and—by definition!—as simple as possible. This is a a big help to a writer.


Keith said...

I appreciate this definition, and yet as someone "working in the field" I only find this marginally useful when it comes to actually working with DITA topics.

A "significant unit of change" can be something as simple as changing "one teaspoon of sarsaparilla" to "two teaspoons of sarsaparilla". This is a significant change in a recipe (especially if you like or dislike sarsaparilla), but in most if not all cases would reflect a minor change in a much larger "ingredients" reference topic. So the change (and its implications) does not delimit the size of the topic in a useful perspective.

I think it is more to the point that in a publication cycle any form of change should be reflected in an increment in the publication version, but it's use as a determinant of topic boundaries is a bit like asking whether a particular flamingo is pink enough or not.

What I am finding is that DITA topics lend themselves to a particular size dependent on the content it contains. And given that you need to start with an initial topic before you can figure out what it's delta is, writers are more apt to be constrained by capturing like information in a single topic than determining beforehand what its delta is likely to be.

In the end I think your argument relies too heavily on determinants which cannot be resolved prior to writing the topic. Good information architecture and a content audit of existing material (if they exist) are to my mind a far better way to determine the size and boundaries of a given topic, and provide better practical guidance to the writer.



jennie said...

So a recipe for lobster soufflé is going to a lot of people:

a) on a recipe card for, Chef Marcel, an experienced professional gastronome

b) for the many and varied readers of a cooking weblog

c) in a leveled cookbook for novice cooks

So we have three different readers, and three different output formats for the information "How to Make a Lobster Souffle"

We could have n editor, modify the recipe three times—essentially writing three different recipe documents, each customized for the publication and the reader(s).

We could have one source file for the best lobster soufflé recipe, and update it, then grab that file for the three different output formats, and have an editor edit it for the three different readers.

We could even tag the content elements in the source file so that each text element displayed differently depending on the output format: the heading "Ingredients" could be in 14-pt bold Harrington over 16 pt leading, the ingredients list could be boxed for the cookbook, while for the blog post the heading might be in 12-pt bold Helvetica over 14-pt leading. Whatever.

But, as I understand it, we'd still have to go in and edit each document for its designated reader, because while we can change the output format, we can't change the output content, short of creating three different recipes.

What we can do, is break each type of content up. So we have a topic for "Ingredients" and one for "Preheating the oven," and one for "Preparing the lobster," "Making the custard," "Beating the eggwhites," "Combining the custard and the eggwhites," "Baking," and "Serving suggestions."

For the recipe card, we're going to skip "Preparing the lobster," and in the final document, our editor is going to edit the Ingredients list to "Meat from four lobsters, boiled and chopped, finely." Marcel will know how to choose, kill, and extract the meat from the lobsters. So the notation on the ingredients list results in no uncertainty for Marcel: it is a complete unit of information. For our blog readers, however, "Meat from four lobsters, boiled and chopped finely," raises questions: "Where do I buy lobsters? Do I have to kill them myself? Do I boil the meat before or after taking it out of the lobster? et cetera" It is an incomplete unit of information, because it lacks sufficient context to address the reader's uncertainty. Otherwise, though, we're going to use all of the content in approximately the order I've given it.

Our cookbook reader? Is deeply uncertain about buying, killing, or dealing with lobsters at all, and may need to be told not to buy lobster in a can. For our basic cookbook, we're going to have to explain everything, which necessitates either cross references to information on, for example, how to separate eggs, and how to beat egg whites, or maybe including that information in the document.

So, then, if I'm grokking this at all, DITA would allow me to assign the tag "Marcel" to some types of information, and "Blog reader" to other types, and "novice cookbook reader" to other types. I could work up personae for "blog reader," and "novice cookbook reader." I could create output formats for "recipe card," "blog post," "cookbook recipe." I could then create a lobster souffle recipe document that was, for example, a cookbook-format entry that worked for Marcel-type chefs, consisting of the instructions that a Marcel-type chef would want, in order to create a lobster souffle. Or I could create a recipe card, with the same information (form != content). Or I could create a website for novice cooks, which would go into the repository of topics tagged with "Lobster Souffle," "Novice cookbook reader," and contain, instead of page-catch xrefs, hyperlinks to relevant support topics.

Plus I could track changes to the recipe, so that if we eventually decided that the souffle worked better with chipotle peppers than, say, paprika, and changed the ingredients, that change would propagate throughout the various output formats (except for those committed to paper already).

Graydon said...

Keith --

I think perhaps you have misunderstood the definition I gave.

It's one change, plus enough context to understand, _in the context of the scenario_. I shall hopefully be putting an example up sometime Friday. (Not going to assume I've caught all the typeaux now!)

So your "one or two teaspoons" isn't going to be information unless the scenario is "serve food to the Sultan of Faroffistan, who beheads cooks who serve him dishes he doesn't like, and who has really specific opinions about sarsaparilla as a flavour". In a normal recipe, it's context for a "now I know how to make this" change that's persona-specific.

You right that you can't figure any of this out without having both the persona and the scenario. On the other hand, if you don't have those, you can't get to scenario-based authoring from topic-based, either.

Jennie --

That's about got it, except that I don't think you'd want to do recipes as chained topics so much as top-level tasks with lots of references to generic tasks like "make custard" from them. The recipe task becomes "now I know what I need to know how to do to make this" as its information-causes-change change.