As promised, I wanted to give some history about how we ended up creating the NPR API. The first major decision that we were faced with was whether or not we should open up our API. The decision was not whether or not to build it, as we'd already done that. Back in November, 2007, we built the foundation of the API to launch with NPR Music. This is basically an XML file repository (essentially in an extended NPRML format) that contains all data needed to build pages on NPR.org. In addition to the XML repository, it includes a PHP framework used to render the XML files to the appropriate presentation layer (these layers include NPR.org as well as RSS feeds, podcast feeds, mobile sites and other outputs that we serve). Here is a diagram of the architecture which includes all of the caching layers as well, some of which were incorporated with the actual release of the public API:
Click image to enlarge
There are several reasons for this architectural approach:
1. PERFORMANCE : Requests will first go through the Memcache and file cache layers, which will always be the most efficient. If the requested document is not in Memcache, we have PHP render the output using the XML files. If the XML file cannot be obtained, PHP will access the database for the data. If PHP hits the database, however, a version of the request will be stored back in Memcache to speed up the delivery of the next request. This ultimately takes strain off of the database, which is the most expensive operation in serving documents.
2. ABSTRACTION : Creating a separate layer between the various presentations and the actual database allows the presentation layers to be agnostic with respect to the data repository. Currently, our database is Oracle, but if want to move to MySQL, then the presentation layers don't really care because they are served primarily off of the XML repository (although the final fail-over to the database would require changes).
3. SIMPLIFICATION : The database itself is a complicated relational system. The schema is largely normalized for scalability and efficiency in our write operations. Building pages, as a result, requires expensive table joins across very tall tables. These queries, although tuned, add up when you consider how many queries there are throughout a story page, for example. Executing these queries once and storing the data in a flatter file system enables the pages to be built more efficiently (both because of the flatter model as well as not having to access the database).
4. SCALABILITY : Because of the rendering framework, we are able to easily add new transformation and presentation layers without having to write a lot of extra code or customized database queries. The rendering engine knows how to handle the XML files in a cohesive way because they are relatively flat, so the transformation layers really aren't that different from each other. The framework also allows for reuse of code in the presentation layers because most of the presentations are dealing with the same content and are displaying that content in similar ways. New presentations for NPR.org are the hardest because of all of the design nuances, but adding Atom and MediaRSS are pretty quick and painless. The difficult part is figuring out how to map our fields to those structures, not in the coding of it.
So, the system was largely in place almost a year ago, alleviating many of the technical hurdles in building an API. We knew that if we wanted to open the API up to the world we would still have some technical challenges left, including filtering engines, the registration engine, the query generator, etc. Before getting to those tasks, however, we needed to determine if the public API fits with the overall NPR strategy.
-- Daniel Jacobson