20 August 2009

Content Conversion Issues

Traditional, narrative-authoring content is not likely to be fully structured. Converting existing content to the concept, task, and reference information typing of the structured writing model used by DITA requires forethought and planning to work well.

It's Not Supposed To Look The Same

Not looking the same is in part a formatting issue. The automated formatting simply will not look exactly the same as the formatting from the incumbent DTP program, whatever that program happens to be. There are high-level issues such as fonts, since the fonts DTP programs or the Windows operating system come with are licensed to forbid embedding in PDFs produced with other software; there are various low-level issues like "you know, that's not the same kerning algorithm". And of course there's the fundamental issue that the automated formatting does not allow for hand-tweaking. No putting in an extra blank line to force a table on to the next page, and so on.

Not looking the same is also a content structuring issue. During your content conversion process from whatever you are using to DITA XML, you will need to carefully dividing existing content into good topics following the concept/task/reference convention for information typing. This process inevitably involves changing the location of sentences and paragraphs of existing content. If you take, as you should, the conversion exercise as an opportunity to look at minimalism as a writing approach for your content, you will also remove existing words, sentences, and paragraphs. If your existing narrative or single-source structure has been set up to remove information redundancy, so that a single fact appears in one and only one place, you may even find yourself adding sentences and paragraphs, because in topic-based authoring, the unit is the topic, and each topic must be meaningful by itself. [1]

As a result of these two reasons for the DITA content not looking like the narrative content, it's very important that the objective for content conversion to DITA XML be understood as good semantic tagging and good information encapsulation, rather than replicating the look and feel of the existing content.

Grouping Content By Topic Type

Narrative content generally does not follow the DITA information typing convention of grouping all conceptual information in a concept topic, putting procedural information by itself without other information types, or keeping all quantified facts in a reference topic. Instead, paragraph-sized or smaller instances of each type of information will be scattered through the narrative.

Obtaining that grouping in your content requires agreement on what the local definitions of the DITA information types are. It will also require discussion to get the whole writing team using those definitions in the same way. It is a good idea to have one senior member of the team own the information typing definitions, and to be responsible for answering questions and settling disputes.

Generally, this process takes a certain amount of time until the light bulb goes of and the members of the writing team see how it is possible to restructure the content into the DITA information types. At that point, the information typing associated with conversion to a DITA authoring environment tends to go smoothly.

Restructure First, Then Tag

Someone who has extensive DITA experience, is comfortable with multiple levels of abstraction, and who has a reliable deep grasp of the local information typing and semantic tagging usage conventions can simultaneously restructure and tag content, but this level of ability is not something to plan on finding in your writing team. It is especially not something to plan on while you are doing your initial content conversion from DTP software to DITA XML.

While you can, in principle, do the conversion in either order, it's better to restructure first. Restructuring requires discussion and collaboration—are we all using the same liability disclaimer? which of these user documents use the same topics? who owns the local definition of a concept topic?—across the entire writing team, and this benefits from a certain amount of overt planning and organization on the part of the person in charge of the writing team.

Once the re-organization of the content has been agreed on, XML tagging can be done by individuals working alone. However, it is very important to agree on the list of DITA elements you're going to use before anyone starts tagging content. Not only is there often more than one way to do something with the DITA element set, it is not necessarily the case that every element will process, or necessarily in the way that you expect. It is much better to have planned which elements are to be used ahead of time than to realize that the element set being used by the writing team to tag content is not the element set that will process, and to have to go back and do the tagging for substantial amounts of content over again.

It is certainly possible to restructure your content in the incumbent DTP application and deliver it that way, and this may be a way to manage the possible rate of change. (Very few writing teams have the uncommitted time or resources to proceed with a complete content conversion as a single step.) There are drawbacks to this, particularly the difficulty with heading levels ("is the topic title an H2 or an H3 style in this document?") which tends to force topic duplication. If you need to proceed through your entire body of content at one time while converting, this might be your only option. If you have the option of fully converting a small portion of your delivered content to DITA, and then another portion, and so on, until everything has been converted, take that approach instead.

Automation Mostly Unhelpful

Content conversion is an exercise in semantic tagging; content that wasn't semantically tagged at the start needs to end up semantically tagged at the end. Semantic tagging as an activity requires answering the questions what is the function of these words in the topic? and what kind of meaning do these words have?. Both questions are questions of meaning and require a human being to answer.

There are software products available which can enclose the text content of Word or FrameMaker files in DITA (or other XML) element tags, and which are smart enough to do this on the basis of existing styles in the Word or FrameMaker file. This can function as a modest time-saver for a human, but cannot be used without complete human reworking of the results. There are three primary reasons for this:

  1. the semantic value of a style is rarely a 1:1 match with a DITA element
  2. the program does not, cannot, and should not attempt to re-arrange content to better conform to DITA information typing conventions
  3. Deciding on the structure of the content delivery—the DITA map—is a separate task from tagging content as XML
    • multiple instances of one topic's content will appear in the incumbent narrative content
    • multiple maps with references to one topic once content has been converted
    • it requires a human being to sort these issues out!

Since convert-to-DITA-tagging software products are relatively expensive, it's important to consider the cost-benefit ratio carefully when considering purchase of conversion software.


Content conversion is necessarily a manual step, with minimal opportunities to exploit automated support. It presents an opportunity to fully restructure your existing content into conformity with DITA information typing, full form/content separation via semantic tagging, and possibly to apply minimalist writing principles at the same time. Since all subsequent content delivery is constrained by the existing content you have available to build on in a high-reuse environment, it is particularly important to have a successful content conversion step when switching to a DITA CMS as your primary means of content delivery if you expect to derive substantial benefit from content re-use.

[1] Topics can and should reference other topics for context and completeness, but the intended audience should not be required to read anything else to extract the unit of information from the stand-alone contents of the topic.

No comments: