Inside NPR.org

API

API Decisions : Why Did We Create It?

As promised, I wanted to give some history about how we ended up creating the NPR API. The first major decision that we were faced with was whether or not we should open up our API. The decision was not whether or not to build it, as we'd already done that. Back in November, 2007, we built the foundation of the API to launch with NPR Music. This is basically an XML file repository (essentially in an extended NPRML format) that contains all data needed to build pages on NPR.org. In addition to the XML repository, it includes a PHP framework used to render the XML files to the appropriate presentation layer (these layers include NPR.org as well as RSS feeds, podcast feeds, mobile sites and other outputs that we serve). Here is a diagram of the architecture which includes all of the caching layers as well, some of which were incorporated with the actual release of the public API:



Click image to enlarge

There are several reasons for this architectural approach:

1. PERFORMANCE : Requests will first go through the Memcache and file cache layers, which will always be the most efficient. If the requested document is not in Memcache, we have PHP render the output using the XML files. If the XML file cannot be obtained, PHP will access the database for the data. If PHP hits the database, however, a version of the request will be stored back in Memcache to speed up the delivery of the next request. This ultimately takes strain off of the database, which is the most expensive operation in serving documents.

2. ABSTRACTION : Creating a separate layer between the various presentations and the actual database allows the presentation layers to be agnostic with respect to the data repository. Currently, our database is Oracle, but if want to move to MySQL, then the presentation layers don't really care because they are served primarily off of the XML repository (although the final fail-over to the database would require changes).

3. SIMPLIFICATION : The database itself is a complicated relational system. The schema is largely normalized for scalability and efficiency in our write operations. Building pages, as a result, requires expensive table joins across very tall tables. These queries, although tuned, add up when you consider how many queries there are throughout a story page, for example. Executing these queries once and storing the data in a flatter file system enables the pages to be built more efficiently (both because of the flatter model as well as not having to access the database).

4. SCALABILITY : Because of the rendering framework, we are able to easily add new transformation and presentation layers without having to write a lot of extra code or customized database queries. The rendering engine knows how to handle the XML files in a cohesive way because they are relatively flat, so the transformation layers really aren't that different from each other. The framework also allows for reuse of code in the presentation layers because most of the presentations are dealing with the same content and are displaying that content in similar ways. New presentations for NPR.org are the hardest because of all of the design nuances, but adding Atom and MediaRSS are pretty quick and painless. The difficult part is figuring out how to map our fields to those structures, not in the coding of it.

So, the system was largely in place almost a year ago, alleviating many of the technical hurdles in building an API. We knew that if we wanted to open the API up to the world we would still have some technical challenges left, including filtering engines, the registration engine, the query generator, etc. Before getting to those tasks, however, we needed to determine if the public API fits with the overall NPR strategy.

Comments

 

Please keep your community civil. All comments must follow the NPR.org Community rules and terms of use, and will be moderated prior to posting. NPR reserves the right to use the comments we receive, in whole or in part, and to use the commenter's name and location, in any medium. See also the Terms of Use, Privacy Policy and Community FAQ.

Hi Daniel! Nice post.

I like how you're right up front about filtering content to protect the owner's / author's rights.

Tell me, what was it about the NPR Music site that made you first think about creating these layers - was it a scalability thing? Was it that their metadata was so different from news stories?

I like how the diagram is open for transforming NPR content into as yet undefined formats; PBCore being a likely candidate, I'm sure. Along those lines, how accommodating will NPR.org be of microfomats? (I say as I visit NPR.org with a keen eye on the Operator plugin for firefox)?

I'm interested to hear at what point the NPR API enabled people's thinking towards the ideas outlined in NPR's Community Building Initiative (CBI):
http://technology360.typepad.com/technology360/2008/09/nprs-digital-di.html

Or, did the NPR CBI come before the API? Was there an imperative to create the API as a result of the NPR CBI?

Sent by John Tynan | 7:01 PM | 9-19-2008

John, the reason that we did the architecture changes when we were developing the NPR Music was that the music site gave us a good opportunity to step back and re-evaluate our existing code base. There were many new pages and much new functionality required for music, so we were able to do extensive refactoring. It was easier to do the architecture changes with a "clean slate" for the music pages than it would have been if we had tried adapt the existing site. Once we got it right for music, it was easier to port the existing news pages into the new architecture.

Sent by Harold Neal | 2:54 PM | 9-23-2008

Inside NPR.org
Support comes from: