Inside NPR.org

Inside NPR
 

archive:

Thursday, November 19, 2009

By Jason Grosman (Programmer, Digital Media)

Before we release new code, such as our recent redesign, we do our best to test as well as possible, but we can't possibly figure out everything that could go wrong before we put it out there in the world. Even under the best circumstances, in a web site this complex and busy, problems will occur. Some of them are are avoidable, such as bugs in our code. And some of them are environmental issues, such as a network problem or database failure. Often, it's difficult to distinguish between the two. That's why it is essential to have a solid error handling system to track the errors as them come in, prioritize them, and give us enough information to be able to fix them.

Obviously, we have spent a lot of time building features of our system that users see. A large percentage of our time is also spent on error handling. The error handling system that we've developed has been through several rounds of incremental improvements, starting with writing to a log, then emailing a developer, and now updating our issue tracking system. These approaches are described below in greater detail.

Logging

Every time our PHP code encounters an error, it gets automatically logged by our Apache web servers. We have dozens of web servers and they generate thousands of lines of logging information every day. We could get lost in the torrential downpour of log messages. Parsing through the logs is a good way to track down a problem once we know about it, but it was horrendously inefficient way to stay on top of new issues as they come in. Not only were they coming in too fast for any one person to keep track of, they also didn't provide any context about what a user was doing when the error occurred (and context is critical to the fundamental truth of debugging code that you must be able to consistently reproduce the problem before you can fix it or test the fix).

The other problem with this kind of logging was that we were missing out on client side error messages. In the past year or so, NPR.org has added more and more javascript to support a growing list of features. This resulted in more client side errors, which were very hard to track. All of the different combinations of web browsers, browser settings, operating systems, proxy servers, etc., each one potentially handling our javascript differently, led to problems that we might never have encountered while testing in our limited set of test environments. And since they only occurred on the client machine, we would never know about them unless a user sent us an email or complained in some other way.

As a result, we implemented something last year that we call jslog. With AJAX, we are able to send messages back to the server without reloading the page. Now, every time one of our pages has a javascript error, instead of making the user's client browser handle the error, leading to a less than stellar user experience, we catch the error and send it to the server using AJAX to be logged in the Apache web server logs. This is completely transparent to the user and allows the page to keep loading, even if it is not 100% perfect.

Continue reading "What Happens When Stuff Breaks On NPR.org" >

tags: , ,

categories: Technology

11:34 - November 19, 2009

 
Friday, October 9, 2009

By Kinsey Wilson

The media landscape is changing at unprecedented speed. And news organizations everywhere are racing to keep up.

One way NPR tries to stay current is by connecting with people who are working at the forefront of technological change.

Usually, those are casual, behind-the-scenes encounters. But today, we've assembled an extraordinary group of technologists, entrepreneurs and innovators in San Francisco to spend the day thinking -- in public -- about NPR's future.

Continue reading "Thinking About NPR's Digital Future" >

categories: Technology

9:11 - October 9, 2009

 
Tuesday, August 4, 2009

by Adam J Martin

Along with the relaunch of our Web site last week, we've also made important changes to the NPR Media Player. The first thing you'll notice about the new player is its redesigned 'skin' that takes advantage of the cleaner layout found throughout the new npr.org. We hope this makes it easier for you to navigate the new features of the player and creates a more seamless experience with the website.

The next thing you'll notice is an enhanced listening and viewing experience. The new player was rebuilt to load faster, require less processing power and use less bandwidth than the previous version, which makes it faster to go from clicking a link to enjoying listening or watching a story.

We've added new features that make it easier to share your favorite stories as well. You've always been able to e-mail a story or send a link from the player but now for many stories, you can use the ↓ download button to save audio you haven 't listened to or want to listen to again on your MP3 player. Also, we've added an <> embed button that allows you to copy the media player code and post an embeddable version of it on your blog or Web site. It's the easiest way ever to show off your favorite NPR stories. We hope you'll find the new embeddable player a fun way to enjoy and discover NPR across the Web wherever you are.

Here's an example of the embedded player, using a Fresh Air story from earlier this summer:



Continue reading "The NPR Media Player: Better, Stronger, Faster - And Embeddable" >

tags: , , , ,

categories: Technology

9:21 - August 4, 2009

 
Wednesday, June 17, 2009

If you have used our search recently, you may have noticed that we just launched a 'new' search in beta. You can either follow the link from the search page or you can try it by clicking here.

So what's the big deal you ask? Visually, we made changes to have cleaner look while getting the results more prominently positioned on the page.   Behind the scenes the technology is completely different. The new search is powered by the Google Search Appliance. While our previous search tool has similar potential ability to yield accurate results, it required a high degree of technical expertise to tune. One of the core philosophies in NPR Digital Media's technology team is that we want to be a partner, not a bottleneck to innovation. As part of this we embrace the idea we call 'Self Service' -- the antithesis to maintaining a technology fiefdom. The idea being the more we can provide empowering tools to our colleagues, the more we can accomplish as a team. Prior to the inventions of the lighter and matches, starting a fire was a cumbersome affair, typically involving a tinderbox, flint and a piece of steel. In modern day we usually take making fire for granted -- not because we all have become experts in the discipline, but rather because we have self service tools that work really well. So that is the root of our approach: implement smart, maintainable tools that are easy for people to use.

So it is this same spirit of self service that led to the selection of the search appliance. In this case there was no need to build it ourselves, as other companies had invested quite a lot in making solid search tools. While we think Google is really smart with its search algorithms, even more appealing was the ease of tuning via the appliance's interface. A colleague of ours Javaun Moradi is in charge of search (among many other things). Without making any changes to our code or critical configurations he is able to easily make changes using the GSA interface to help ensure we are indexing and surfacing the results that are desired. Using the variety of information and meta-data available about our content, rules can be defined to bias towards more relevant pages, and make sure to exclude redundant or unnecessary items from the results. Even within the first 24 hours of the tool being up in beta, he informs me that he has already made several tuning changes to help surface information about our shows while also surfacing breaking news.

Another aspect we especially like is the ability of the appliance to render its results in XML. While a suggested implementation is to use XSLT directly on the box to yield result pages, we appreciate the flexibility to make service calls to the appliance and then work with the very clean, portable XML results. Currently we are rendering out the XML results as search result pages using PHP -- which mirrors the architecture we use for the rest of the site. We expect in the future we will be able to use this to better integrate other content and features with search, and search with other features.

This leads us to why we are launching this new search in 'beta'. While we have done some tuning and configuration we are still working to get it right. By putting this tool out in a preliminary beta we can watch to see the queries it is getting and tune it to make it better. On the web, seeing how you all actually use our website is the most authoritative way to judge what is working and what isn't. Everyday is an opportunity to improve upon what we did yesterday. One example of this is that we see many searches that people mistakenly believe NPR produce, such as "This American Life" or "A Prairie Home Companion". By seeing that users are making these searches, we can make sure appropriate results are showing up. So whether you are searching for a story you heard this morning, or want to find those performances at Bob Boilen's tiny desk -- hopefully we get you what you want. We currently anticipate moving it out of 'beta' and as our primary search tool later this summer as we announce some other changes to our digital media tools and products.

Please share any observations or feedback below, or via this comment tool we setup specifically for the new search. We anticipate numerous improvements in the months ahead.

Happy Searching.

-- Zach Brand

categories: Technology

12:20 - June 17, 2009

 
Monday, February 9, 2009

My last post highlighted the reasons why it is important to maintain clean content. This post will focus on what we call "HTML Addressing", which is a way for us to control the HTML (and other markup) that appears in our text fields, making the content more portable. HTML Addressing has several components to it, some of which are very typical of content management systems, while others are not. These components are described below.

Keep Content Separated into Distinct Fields
It is very important for us to make sure all content is populated into very distinct fields in the database. Some platforms, like blogging systems, have a big text block where images and other asset references are embedded within the text. This not only restricts our ability to modify the display of the image relative to the text, it also prevents us from doing something distinct with the image itself because it is now tightly bound to the text.

Limit HTML to be Allowed in Specific Fields
There are a range for free-text fields in our CMS. Most of these fields do not require any inline markup because the templates take care of all of the display rules. Others, like the teaser and full story text, however, do have instances where that markup does add to the editorial meaning of the content. As a result, we limited the fields that allow markup to only those that actually need them. To enforce the above limitation, we developed a series of JavaScript functions that apply to each field in the CMS to prevent markup from being entered into fields that do not allow HTML.

Limit HTML Tags in Allowable Fields
For those fields that allow HTML, we only allow very specifc tags (and different fields can allow for different tags). For example, in some fields, we may allow tags such as <strong> and <em>, but not <b>, <div> or <img> (<b> is deprecated, <div> could introduce too many variables within the context of the pages, and <img> is not allowed because images should be added in their appropriate fields). Finally, some allowable tags, such as <a>, allow parameters to be applied to them although we do restrict some of the event-based and style-based parameters. Again, our JavaScript functions ensure that only allowable tags are entered into the appropriate fields, that the parameters applied to those tags are viable, and that all tags are closed and nested correctly.

Storage of HTML in Database
Upon saving a story in the CMS, the server-side code identifies all markup in the content and pulls all tags out. In pulling the tags out, we capture the "address" of the open and close tags. By "address", we mean the character position in the field in which the tag appears. For example, in the string "this is a <strong>string</strong> of content", the open <strong> tag starts at character eleven and the close starts at character 25. So, when saving, we store only "this is a string of content" in the database field, but also put into a relational table the necessary information to reconstruct it with the tags. Included in that information are the story's unique ID, the unique ID for the field in the database where the tag was found, the character position for the opening and closing tags (stored as separate records in the database), the unique ID for the tag and any parameter and values attached the tag. When I say that we store the unique ID for the tag, it is because we don't actually store the tag in this table (for reasons I will describe below in the Benefits section).

Other Characters
HTML Addressing generally refers to the handling of HTML markup in our content. The functions, however, do more than just HTML. There are a range of characters that also create problems throughout the system, including smart quotes and mdashes. The HTML Addressing functionality does not store the locations of these characters (which are typically added to the content by copying/pasting from Microsoft Word). Rather, upon saving to the database, it replaces them with comparable characters that are more standard. For example, the smart quotes become regular quotes. This list of replacement characters is extensible.

Apply HTML Addressing to the Archive
The functionality described above applies to new stories getting created in our CMS (or old stories that get resaved). Because our older stories contained a wide array of tags, we also needed to be able to run this functionality against our archive. In doing so, we found tags like <font> that needed treatment. Generally speaking, the rules for the script were to move any tags recognized as valid for the sytem into the model described above while all other tags were removed completely. There were some exceptions to this, but that is generally what happened.


Benefits to HTML Addressing
Because we store the content without the tags, we can then present the content with or without them, depending on the output destination. If the content is getting output to NPR.org, we then recompile the markup into the content based on that relational table. If the content is getting output to podcasts, however, we simply print the raw content without the markup. In the NPR API, you can see an example of the difference between the two in our NPRML format (see the <text> and <textWithHtml> elements).

Obviously, removing the HTML from the content enables our content to be portable and to be distributed to virtually any platform. But similar goals could be met by storing the tags with the content in the database and running stripping scripts against the data when the output destination requires no HTML. So why go through the trouble of stripping out the HTML when storing it?

For us, the primary differentiating factor in stripping them out upon storage instead of upon presentation is in tag management. If tags need to change (for example, the <b> tag was deprecated in favor of <strong>), we need to make that change to only one field in the entire database. If the tags are in the content, we would need to run scripts that cycle through all fields and all records to make the change. Additionally, if we introduce micro-formats or other markup, we can be sure that they are all handled the same way.

Another interesting benefit to this approach is that when we output the content to different platforms, we could actually transform the tags on the fly. For example, since an iPod cannot parse <em> tags, we could print single-quotes in their place for that particular output destination. As different platforms develop their own markup for presentation, we simply need to maintain mappings between HTML and the other presentation markup tags.

Detriments to HTML Addressing
The main problem with HTML Addressing is the fact that the default action for building pages on NPR.org (currently, the primary destination for the content) requires the toughest action. That is, when we render pages for NPR.org, we have to do all the processing to add the HTML back into the content. We mitigate this by burning the content into our XML repository and serving the pages to users through our caching layers. The XML repository contains the compiled output for both the marked-up and unmarked-up content, so when the templates render the output, it simply needs to pull from the appropriate fields. Moreover, the caching layer ensures that the performance is optimized regardless of the output format.


There are obviously many different ways to skin this cat, but this approach has provided us with cleaner and more portable content. This is not to suggest that our system's content is flawless in its portability. There are more challenges than just markup in content that make portability difficult, many of which we deal with on a daily basis. But eliminating markup from the content is a big step in achieving the goal.
--Daniel Jacobson

tags: , , ,

categories: Technology

1:26 - February 9, 2009

 
Wednesday, February 4, 2009

As discussed in my earlier post about the strategy of building the API, one of the most important things for content producers is to remain relevant to their users. With content becoming more readily available to these users through distribution channels, it is up to these content producers to make sure the content is where the users are. That does not diminish the need for continued development and maintenance of the Web site, it just means that it is equally important to distribute the content to other places. In order for that to happen, the content has to be portable. So, what does it mean for content to be portable? For NPR Digital Media, one of the primary philosophies driving our systems, is Create Once, Publish Everywhere (COPE). To achieve COPE, here are some key principles that we adhere to:

Develop Content Management Tools, not Web Publishing Tools
Most content management systems for the online world are used to create Web pages. That said, the Web page is just one possible output for the content (albeit, an important one). In building our CMS at NPR, our goal was to make sure the tool could publish to anything, including NPR.org. If our focus did not consider other platforms, we could have ended up with a Web publishing system that binds the content too closely to the Web site itself.

Separate Content from Display
As mentioned above, if the content is too closely tied to a specific display, it cannot easily be pushed out to other platforms. Good separation, in addition to facilitating content portability, also makes redesigns of the Web site or alternate presentations of the content of the site easier. For example, because our content is separate from our display, we were able to to launch the NPR Music site without refactoring the system architecture or our presentation layer code.

Eliminate markup from content
Because we do not know where the content will ultimately end up, it is important to not have platform-specific markup embedded in the content. For example, iPods cannot parse HTML, so we need to make sure our content gets distributed to iPods without tags in it, while the same content must contain the tags for NPR.org.

Like many other content management systems, ours captures and stores content in a central database that is completely independent from any presentation layer (I discuss our architecture in this earlier post). For the content to really be portable, however, it needs to work on any platform, including browsers, RSS readers, iPods, radio displays, mobile devices, TVs, etc., which means we must eliminate markup from content, as described above. To solve this problem, we introduced some functionality to our system that we call "HTML Addressing", which will be the topic of my next post.
--Daniel Jacobson

tags:

categories: Technology

11:10 - February 4, 2009

 
Monday, January 12, 2009

As my previous post mentioned, we recently re-launched our Station Finder Map. This post will discuss in more detail how the map works. Now, to the guts...

The Underlying Data
The system has several underlying database tables, including zip codes, cities and station data. The zip code and city tables, in addition to containing information about the locations, also include the latitude and longitude for the centroid each location. These are pretty simple, flat tables that contain the approximately 41,000 zip codes and 150,000 cities in the United States.

The station tables, on the other hand, are much more complex. They contain all of the nearly 2000 stations that carry NPR programming (as well as their translators) along with a wide array of information about those stations, including licensee data and pertinent URLs associated with the station (e.g. their home page, schedule page, donation page, audio streams, RSS feeds and podcast feeds). These tables also include the latitude, longitude and broadcast power information for each antenna.

The broadcast power information tells us how far that antenna's broadcast signal can reach in each direction. Our data is broken up into 72 directions, starting with due north and shifting five degrees around the circle until we are back at due north. For each direction, our database contains five different ranges, detailing how far the antenna can reach in that given direction. The range itself determines two things. First, it tells us how far away you can be from the antenna and still hear its signal - this takes into account some impediments, such as mountains. The second thing it tells us is what the quality of the signal will be. The closer you are to the antenna, generally, the more clear the signal will be (although this is not always the case).

Finally, most of the data in these tables is publicly available in our recently launched Station Finder API (the coverage data is not available, but everything else is). The functionality of the map is driven off of the API.

How Does the Search Work?
At the core, the system works based on latitudes and longitudes. If you search the system by zip code or city/state, the system will convert the search term into a latitude and longitude before looking for stations. Similarly, when you look for NPR stations along a driving route, the system identifies a series of points along the route and converts those points into latitudes and longitudes. The waypoints for driving routes include any turn, crossing of a border, start and end points, and some artificially inserted points that we create. (Searches based on call letters bypass the geo-searches and hits the station tables directly.)

Once we have the latitude and longitude, we perform a series of calculations based on the Great Circle Calcuation (GCC), which helps us to determine distances on a curved surface (ie. the Earth - and we are assuming that it is not flat). Using the GCC, we look for stations near the latitude and longitude, based on a 100 mile radius from that point. From that list of stations, which is too inclusive, we start our process of narrowing down the results to the actual stations that can be heard.

For each station returned from our initial search, we first determine the direction (one of the 72 described earlier) from the antenna to the requested latitude/longitude. Then we find out the distance between the antenna and the latitude/longitude using the GCC. Once we have the distance and the direction, we simply need to do a lookup in our database to determine if the broadcast distance of the station is greater than the distance between the antenna and the latitude/longitude. If the broadcast distance is greater, then the station can be heard in the latitude/longitude. If it is not, then the station cannot be heard.

Now, when I say "check to see if the broadcast distance is greater", we are really checking five different broadcast distances in the database. We do this to find out what the quality of the broadcast signal will be for that latitude/longitude. The further the distance, assuming it is still within range, the more likely the signal will worsen. There are other variables, but that is the basic idea.

Displaying on the Map
The display of this information on the map is pretty straight-forward. We simply drop an antenna icon at each latitude/longitude where a station's antenna is actually located. For that antenna, we use the polygon feature in Virtual Earth to draw and shade the coverage circles on the map. The contours of the coverage circles are drawn by taking the distance of the broadcast range in each of the 72 directions, drawing a line connecting the points, then shading in the circle. We do this for three of the five broadcast ranges in our database. The overlay of the shading for each of these three circles results in the inner circle being darker than the middle circle, which is darker than the outer circle.

Other Notes
One other thing I should point out about this data is that it is great for the purposes of this type of application - a web-based service to inform our audience as to which NPR stations are available throughout the country. There are other more sophisticated, more precise ways to identify the station coverage maps which are really overkill for this type of service.

To see another representation of this same functionality, go to nprroadtrip.com. This is a map mashup produced by an NPR enthusiast (not affiliated with NPR).
--Daniel Jacobson

tags: ,

categories: Technology

3:57 - January 12, 2009

 
Monday, January 5, 2009

I am happy to announce the re-release of our Station Finder Map, including our Road Trip functionality. This version includes several features worth noting, as follows:

- It allows you to identify local NPR station based on zip code, city/state, station call letters or by broadcasting network.

- It allows you to identify local NPR stations along a driving route.

- It allows you to identify local NPR stations that can be heard at a specific address.

- For stations returned by the finder, you can view the station's coverage map, view more information about the station, and click through to the stations' group page within the NPR Community.

- It is fully supported by our recently released Station Finder API.

We are very excited to have this feature back on the site and hope that it will help our listeners find NPR wherever they may go. In a later post, I will be providing a detailed technical explanation of how the Station Finder Map works.
--Daniel Jacobson

tags: , ,

categories: Technology

11:03 - January 5, 2009

 
Wednesday, July 9, 2008

Zach Brand here -- I head up technology for NPR's Digital Media efforts. Our most recent additions to the codebase is our new registration engine / authentication tool. Initially, we're using the registration system for newsletter subscriptions, but in the coming months it will also allow users to participate in social networking features on the site. I realize that -- like a lot of technology -- as long as it works, you don't really notice it. That said, I think our new registration and log-in process is very easy, intuitive and pretty snappy. Check it out. The PHP development on this was the work of Joanne Garlow, Jason Grosman and Ivan Lazarte. The project process itself was managed by Jennifer Tuohy with help from K. Libner. Kudos to them and the rest of the team involved.

We are still looking to tune the authentication and SSL certs so it creates the fewest prompts in the various browser / OS combinations. Of course like all Web apps, I expect it will change and evolve as we go.

During this project a couple questions arose. First, was there any open source tool that would do the job? We pained a bit over this one since we do try to be as open source friendly as possible. Despite a couple valid contenders, none of them were well-suited to our current and future needs, so we did decide to build it ourselves. Which leads to the second question: do we integrate with OpenID? This time, our answer was yes. Unfortunately, to meet the timeline needed, we were not able to include OpenID on day one. Sooooo... the architecture of the system was built in such a way that that we will be able to add OpenID compatibility into it down the road. How quickly it is incorporated will likely be impacted by how much demand we do or don't hear. So please, chime in with your thoughts, critiques or even compliments.

-- Zach Brand

tags: , ,

categories: Technology

2:32 - July 9, 2008

 

About Inside NPR.org

Ever wanted to peer under the hood and learn about the inner workings of the NPR website? Have we got a blog for you, then. Here at Inside NPR.org, the NPR Digital Media team will keep you up-to-date on digital products and services we're developing, including social networking tools and our media player. For more info, please see our FAQ and our discussion rules.

search Inside NPR.org

Contact us

Got a question or comment you want to send to us privately? Use our contact form.