NPR logo Building the Ingest System


Building the Ingest System

Last week we added KQED as our third partner to the Ingest System Pilot. The Ingest System allows our partner stations to publish stories to our content repository so that they can distributed through our Story API. In this blog post, I will explain some of the technical details about how the Ingest System works.

The Ingest System currently accepts stories in two XML data formats. We accept a limited subset of NPRML—our own home-grown way of expressing story content in XML.  We also accept RSS with MediaRSS and NPRML extensions for audio and images and other resources. Currently, we are focused on audio, images, text, and related links, but we plan to add the ability to ingest video and other resource types later. Ingest partners post story documents in one of the two accepted formats via HTTP to the Ingest System's URL. In addition to creating new stories, partners can update stories they have previously posted or, if necessary,  delete them. The Ingest System is designed to be REST-ful, so we determine what action to take based on the HTTP method (GET, PUT, POST, or DELETE) used when interacting with the Ingest System.

While implementing the Ingest System, we faced an interesting technical challenge.  Our APIs, including the components we use to authenticate API access, are written in PHP, while our content management system (CMS) is written in Java.  All persistence of story content to the database is done through well-tested Java components that are part of the CMS.  Having a single path of entry for this data makes it much easier to manage and debug data issues.  Normally, editors create content in the CMS, which persists the data to the database through these Java components.  From the database, we create a file with an XML version of the Story Model, a data structure that contains all the information needed to represent the story.  Later, when you view a page on or make a Story API query, the data is assembled from the XML files, or if necessary, read from read-only versions of the database by the PHP code, which then renders the content.

Ingest Process Flow. nbsp;nbsp;nbsp;nbsp;nbsp; JSON is used to communicate across systems and programming languages. Harold Neal/NPR hide caption

toggle caption Harold Neal/NPR

For the Ingest System, we had to take a different path. Control of the ingest process starts in PHP, where we perform authentication of the request. Next we transform the partner's input document into one or more Story Models. While we currently only support this transformation for two document formats, we could easily add other document formats in the future. I found that doing these transformations in PHP was much easier than in Java due to the excellent SimpleXML library that is available in PHP.

Once the partner's document has been transformed into the Story Model, it is serialized by PHP into JSON (Javascript Serialized Object Notation) which is posted over HTTP to the CMS (Java), which deserializes it.  JSON is the bridge we use to work across systems and programming languages. PHP5 has built-in support for JSON, and we use a Java JSON library to interact with JSON data in Java.  The amount of code to manage the conversions to and from JSON is much less than if we had used a data exchange format such as XML.

After the CMS deserializes the Story Model, it validates the data and then persists the data to the database using Java components. The validation may find errors, such as a missing title, that will cause the system to reject the story completely.  The validation can also find warnings, such as an bad image URL. Warnings allow the rest of the story to be ingested. Both errors and warnings are noted in the document returned to the partner.

During the Java processing, we make a note of any audio files we need to download from the partner. We want the Ingest System to respond back to the partner quickly (within a few seconds at most), but downloading audio can take much longer, so we download the audio files asynchronously. If we encounter a problem downloading the audio, the system generates an email informing the partner of the problem.

Once the Java system has finished processing the story data, a result code and any error or warning messages are sent back to the PHP code, once again using JSON.  The PHP code then uses the Story API to provide a response back to the partner that shows what the story will look like in the Story API along with any errors or warnings that occurred during processing.

The Ingest System is a closed API, meaning it can only be used by select partners. However, it will benefit anyone who uses the open Story API by bringing a local aspect to the national and global news already available from the Story API. As the Ingest System ramps up to include more stations we hope the system will let you find news important to you no matter where you are.

We no longer support commenting on stories, but you can find us every day on Facebook, Twitter, email, and many other platforms. Learn more or contact us.