Inside NPR.org

Inside NPR
 

archive

Friday, February 13, 2009

As follow up from my post back in November there are a number of events upcoming to hear and meet with folks from NPR Digital Media first hand.

Feb. 19-21 - IMA Public Media 09

For those in Public Broadcasting family we hope to see you next week at IMA Public Media 09 . It is scheduled to be: five idea filled days with 55 sessions on everything from coding widgets and measuring impact to mastering the subtle techniques of building online communities.

  • As part of the the WebTech Summit on Wednesday 2/19 I will be appearing in a panel to discuss Mobile site development. The Mobile Site Development session will be a panel that includes myself, - Ann Breckbill and Melinda Driscoll, American Public Media; Keith Hopper, NPR/Public Interactive; Matt MacDonald, PRX; and the inimitable Doc Searls.

  • Later on Wednesday I will be discussing the development and use of Widgets and APIs. John Tynan and I will join Andrew Kuklewicz on such topics as how do you develop widgets and how and when should you pull data from other site's APIs or place widgets on your own sites.

  • As part of the core conference our Social Media Guru Andy Carvin will participate in several presentations. On Thursday 2/19 he will be on a panel discussing "A Social Media How To: Choosing and Using the Right Tools.

  • On Friday 2/20 you will have a chance to meet our new boss here at NPR. CEO Vivian Schiller will be addressing the Conference at 2pm. Also on Friday Andy will be on a panel discussing "Social Media: What's Worked and Lessons Learned."

Upcoming:

On Wednesday 2/26 I will be presenting NPR's revolutionary work in digital media as part of the We Media game changers conference.

On March 15th Dan Jacobson will be part of the discussion at SXSW Interactive festival discussing APIs and the changing face of news online.

Finally on April 3rd I will be at the O'Reilly Web 2.0 Expo and joining Robin Sloan the VP of strategy at Current, a media company co-founded by Al Gore and Joel Hyatt. We'll be discussing the future of content distribution from both TV and Radio perspectives.

We hope to see you soon!

-- Zach Brand

tags: , ,

categories: Administrative Stuff

11:34 - February 13, 2009

 
Monday, February 9, 2009

My last post highlighted the reasons why it is important to maintain clean content. This post will focus on what we call "HTML Addressing", which is a way for us to control the HTML (and other markup) that appears in our text fields, making the content more portable. HTML Addressing has several components to it, some of which are very typical of content management systems, while others are not. These components are described below.

Keep Content Separated into Distinct Fields
It is very important for us to make sure all content is populated into very distinct fields in the database. Some platforms, like blogging systems, have a big text block where images and other asset references are embedded within the text. This not only restricts our ability to modify the display of the image relative to the text, it also prevents us from doing something distinct with the image itself because it is now tightly bound to the text.

Limit HTML to be Allowed in Specific Fields
There are a range for free-text fields in our CMS. Most of these fields do not require any inline markup because the templates take care of all of the display rules. Others, like the teaser and full story text, however, do have instances where that markup does add to the editorial meaning of the content. As a result, we limited the fields that allow markup to only those that actually need them. To enforce the above limitation, we developed a series of JavaScript functions that apply to each field in the CMS to prevent markup from being entered into fields that do not allow HTML.

Limit HTML Tags in Allowable Fields
For those fields that allow HTML, we only allow very specifc tags (and different fields can allow for different tags). For example, in some fields, we may allow tags such as <strong> and <em>, but not <b>, <div> or <img> (<b> is deprecated, <div> could introduce too many variables within the context of the pages, and <img> is not allowed because images should be added in their appropriate fields). Finally, some allowable tags, such as <a>, allow parameters to be applied to them although we do restrict some of the event-based and style-based parameters. Again, our JavaScript functions ensure that only allowable tags are entered into the appropriate fields, that the parameters applied to those tags are viable, and that all tags are closed and nested correctly.

Storage of HTML in Database
Upon saving a story in the CMS, the server-side code identifies all markup in the content and pulls all tags out. In pulling the tags out, we capture the "address" of the open and close tags. By "address", we mean the character position in the field in which the tag appears. For example, in the string "this is a <strong>string</strong> of content", the open <strong> tag starts at character eleven and the close starts at character 25. So, when saving, we store only "this is a string of content" in the database field, but also put into a relational table the necessary information to reconstruct it with the tags. Included in that information are the story's unique ID, the unique ID for the field in the database where the tag was found, the character position for the opening and closing tags (stored as separate records in the database), the unique ID for the tag and any parameter and values attached the tag. When I say that we store the unique ID for the tag, it is because we don't actually store the tag in this table (for reasons I will describe below in the Benefits section).

Other Characters
HTML Addressing generally refers to the handling of HTML markup in our content. The functions, however, do more than just HTML. There are a range of characters that also create problems throughout the system, including smart quotes and mdashes. The HTML Addressing functionality does not store the locations of these characters (which are typically added to the content by copying/pasting from Microsoft Word). Rather, upon saving to the database, it replaces them with comparable characters that are more standard. For example, the smart quotes become regular quotes. This list of replacement characters is extensible.

Apply HTML Addressing to the Archive
The functionality described above applies to new stories getting created in our CMS (or old stories that get resaved). Because our older stories contained a wide array of tags, we also needed to be able to run this functionality against our archive. In doing so, we found tags like <font> that needed treatment. Generally speaking, the rules for the script were to move any tags recognized as valid for the sytem into the model described above while all other tags were removed completely. There were some exceptions to this, but that is generally what happened.


Benefits to HTML Addressing
Because we store the content without the tags, we can then present the content with or without them, depending on the output destination. If the content is getting output to NPR.org, we then recompile the markup into the content based on that relational table. If the content is getting output to podcasts, however, we simply print the raw content without the markup. In the NPR API, you can see an example of the difference between the two in our NPRML format (see the <text> and <textWithHtml> elements).

Obviously, removing the HTML from the content enables our content to be portable and to be distributed to virtually any platform. But similar goals could be met by storing the tags with the content in the database and running stripping scripts against the data when the output destination requires no HTML. So why go through the trouble of stripping out the HTML when storing it?

For us, the primary differentiating factor in stripping them out upon storage instead of upon presentation is in tag management. If tags need to change (for example, the <b> tag was deprecated in favor of <strong>), we need to make that change to only one field in the entire database. If the tags are in the content, we would need to run scripts that cycle through all fields and all records to make the change. Additionally, if we introduce micro-formats or other markup, we can be sure that they are all handled the same way.

Another interesting benefit to this approach is that when we output the content to different platforms, we could actually transform the tags on the fly. For example, since an iPod cannot parse <em> tags, we could print single-quotes in their place for that particular output destination. As different platforms develop their own markup for presentation, we simply need to maintain mappings between HTML and the other presentation markup tags.

Detriments to HTML Addressing
The main problem with HTML Addressing is the fact that the default action for building pages on NPR.org (currently, the primary destination for the content) requires the toughest action. That is, when we render pages for NPR.org, we have to do all the processing to add the HTML back into the content. We mitigate this by burning the content into our XML repository and serving the pages to users through our caching layers. The XML repository contains the compiled output for both the marked-up and unmarked-up content, so when the templates render the output, it simply needs to pull from the appropriate fields. Moreover, the caching layer ensures that the performance is optimized regardless of the output format.


There are obviously many different ways to skin this cat, but this approach has provided us with cleaner and more portable content. This is not to suggest that our system's content is flawless in its portability. There are more challenges than just markup in content that make portability difficult, many of which we deal with on a daily basis. But eliminating markup from the content is a big step in achieving the goal.
--Daniel Jacobson

tags: , , ,

categories: Technology

1:26 - February 9, 2009

 
Wednesday, February 4, 2009

As discussed in my earlier post about the strategy of building the API, one of the most important things for content producers is to remain relevant to their users. With content becoming more readily available to these users through distribution channels, it is up to these content producers to make sure the content is where the users are. That does not diminish the need for continued development and maintenance of the Web site, it just means that it is equally important to distribute the content to other places. In order for that to happen, the content has to be portable. So, what does it mean for content to be portable? For NPR Digital Media, one of the primary philosophies driving our systems, is Create Once, Publish Everywhere (COPE). To achieve COPE, here are some key principles that we adhere to:

Develop Content Management Tools, not Web Publishing Tools
Most content management systems for the online world are used to create Web pages. That said, the Web page is just one possible output for the content (albeit, an important one). In building our CMS at NPR, our goal was to make sure the tool could publish to anything, including NPR.org. If our focus did not consider other platforms, we could have ended up with a Web publishing system that binds the content too closely to the Web site itself.

Separate Content from Display
As mentioned above, if the content is too closely tied to a specific display, it cannot easily be pushed out to other platforms. Good separation, in addition to facilitating content portability, also makes redesigns of the Web site or alternate presentations of the content of the site easier. For example, because our content is separate from our display, we were able to to launch the NPR Music site without refactoring the system architecture or our presentation layer code.

Eliminate markup from content
Because we do not know where the content will ultimately end up, it is important to not have platform-specific markup embedded in the content. For example, iPods cannot parse HTML, so we need to make sure our content gets distributed to iPods without tags in it, while the same content must contain the tags for NPR.org.

Like many other content management systems, ours captures and stores content in a central database that is completely independent from any presentation layer (I discuss our architecture in this earlier post). For the content to really be portable, however, it needs to work on any platform, including browsers, RSS readers, iPods, radio displays, mobile devices, TVs, etc., which means we must eliminate markup from content, as described above. To solve this problem, we introduced some functionality to our system that we call "HTML Addressing", which will be the topic of my next post.
--Daniel Jacobson

tags:

categories: Technology

11:10 - February 4, 2009

 

About Inside NPR.org

Ever wanted to peer under the hood and learn about the inner workings of the NPR website? Have we got a blog for you, then. Here at Inside NPR.org, the NPR Digital Media team will keep you up-to-date on digital products and services we're developing, including social networking tools and our media player. For more info, please see our FAQ and our discussion rules.

search Inside NPR.org

Contact us

Got a question or comment you want to send to us privately? Use our contact form.