My last post highlighted the reasons why it is important to maintain clean content. This post will focus on what we call "HTML Addressing", which is a way for us to control the HTML (and other markup) that appears in our text fields, making the content more portable. HTML Addressing has several components to it, some of which are very typical of content management systems, while others are not. These components are described below.
Keep Content Separated into Distinct Fields
It is very important for us to make sure all content is populated into very distinct fields in the database. Some platforms, like blogging systems, have a big text block where images and other asset references are embedded within the text. This not only restricts our ability to modify the display of the image relative to the text, it also prevents us from doing something distinct with the image itself because it is now tightly bound to the text.
Limit HTML to be Allowed in Specific Fields
There are a range for free-text fields in our CMS. Most of these fields do not require any inline markup because the templates take care of all of the display rules. Others, like the teaser and full story text, however, do have instances where that markup does add to the editorial meaning of the content. As a result, we limited the fields that allow markup to only those that actually need them. To enforce the above limitation, we developed a series of JavaScript functions that apply to each field in the CMS to prevent markup from being entered into fields that do not allow HTML.
Limit HTML Tags in Allowable Fields
For those fields that allow HTML, we only allow very specifc tags (and different fields can allow for different tags). For example, in some fields, we may allow tags such as <strong> and <em>, but not <b>, <div> or <img> (<b> is deprecated, <div> could introduce too many variables within the context of the pages, and <img> is not allowed because images should be added in their appropriate fields). Finally, some allowable tags, such as <a>, allow parameters to be applied to them although we do restrict some of the event-based and style-based parameters. Again, our JavaScript functions ensure that only allowable tags are entered into the appropriate fields, that the parameters applied to those tags are viable, and that all tags are closed and nested correctly.
Storage of HTML in Database
Upon saving a story in the CMS, the server-side code identifies all markup in the content and pulls all tags out. In pulling the tags out, we capture the "address" of the open and close tags. By "address", we mean the character position in the field in which the tag appears. For example, in the string "this is a <strong>string</strong> of content", the open <strong> tag starts at character eleven and the close starts at character 25. So, when saving, we store only "this is a string of content" in the database field, but also put into a relational table the necessary information to reconstruct it with the tags. Included in that information are the story's unique ID, the unique ID for the field in the database where the tag was found, the character position for the opening and closing tags (stored as separate records in the database), the unique ID for the tag and any parameter and values attached the tag. When I say that we store the unique ID for the tag, it is because we don't actually store the tag in this table (for reasons I will describe below in the Benefits section).
Other Characters
HTML Addressing generally refers to the handling of HTML markup in our content. The functions, however, do more than just HTML. There are a range of characters that also create problems throughout the system, including smart quotes and mdashes. The HTML Addressing functionality does not store the locations of these characters (which are typically added to the content by copying/pasting from Microsoft Word). Rather, upon saving to the database, it replaces them with comparable characters that are more standard. For example, the smart quotes become regular quotes. This list of replacement characters is extensible.
Apply HTML Addressing to the Archive
The functionality described above applies to new stories getting created in our CMS (or old stories that get resaved). Because our older stories contained a wide array of tags, we also needed to be able to run this functionality against our archive. In doing so, we found tags like <font> that needed treatment. Generally speaking, the rules for the script were to move any tags recognized as valid for the sytem into the model described above while all other tags were removed completely. There were some exceptions to this, but that is generally what happened.
Benefits to HTML Addressing
Because we store the content without the tags, we can then present the content with or without them, depending on the output destination. If the content is getting output to NPR.org, we then recompile the markup into the content based on that relational table. If the content is getting output to podcasts, however, we simply print the raw content without the markup. In the NPR API, you can see an example of the difference between the two in our NPRML format (see the <text> and <textWithHtml> elements).
Obviously, removing the HTML from the content enables our content to be portable and to be distributed to virtually any platform. But similar goals could be met by storing the tags with the content in the database and running stripping scripts against the data when the output destination requires no HTML. So why go through the trouble of stripping out the HTML when storing it?
For us, the primary differentiating factor in stripping them out upon storage instead of upon presentation is in tag management. If tags need to change (for example, the <b> tag was deprecated in favor of <strong>), we need to make that change to only one field in the entire database. If the tags are in the content, we would need to run scripts that cycle through all fields and all records to make the change. Additionally, if we introduce micro-formats or other markup, we can be sure that they are all handled the same way.
Another interesting benefit to this approach is that when we output the content to different platforms, we could actually transform the tags on the fly. For example, since an iPod cannot parse <em> tags, we could print single-quotes in their place for that particular output destination. As different platforms develop their own markup for presentation, we simply need to maintain mappings between HTML and the other presentation markup tags.
Detriments to HTML Addressing
The main problem with HTML Addressing is the fact that the default action for building pages on NPR.org (currently, the primary destination for the content) requires the toughest action. That is, when we render pages for NPR.org, we have to do all the processing to add the HTML back into the content. We mitigate this by burning the content into our XML repository and serving the pages to users through our caching layers. The XML repository contains the compiled output for both the marked-up and unmarked-up content, so when the templates render the output, it simply needs to pull from the appropriate fields. Moreover, the caching layer ensures that the performance is optimized regardless of the output format.
There are obviously many different ways to skin this cat, but this approach has provided us with cleaner and more portable content. This is not to suggest that our system's content is flawless in its portability. There are more challenges than just markup in content that make portability difficult, many of which we deal with on a daily basis. But eliminating markup from the content is a big step in achieving the goal.
--Daniel Jacobson
categories: Technology


Comments
Please note that all comments must adhere to the NPR.org discussion rules and terms of use. See also the Community FAQ.
You must be logged in to leave a comment. Login | Register
More information needed to participate in the NPR online community.. Add this information