Creative Commons, State Library of New South Wales collection
Good production support is like the third girl from the left at the bottom of this human pyramid: rock-solid, handled with aplomb and grace (Look at her feet!)
Good production support is like the third girl from the left at the bottom of this human pyramid: rock-solid, handled with aplomb and grace (Look at her feet!) Creative Commons, State Library of New South Wales collection
Every day NPR sends hundreds of stories out to the Internet and millions of users. They read, listen to, comment on, and share those stories on NPR.org, and on partner sites that ingest our content via our APIs. Every day there are also behind-the-scenes system hiccups, publishing blips, and technical potholes that we fix before they can turn into site-eating sinkholes.
If the release of new code is the grinning pixie madly waving her shiny pom-poms at the pinnacle of a human pyramid, production support is the dependable, strong-backed, even-tempered folks who make up the base. Without them, the pixie wouldn't have a prayer of getting up there.
In this post, we'll give you an overview of our production-support process, some recent changes and plans to make it stronger.
Where Are The Errors Coming From?
- Seamus: NPR's content management system. Our writers, editors and producers use Seamus to build stories and blog posts, input rundowns and send out breaking news emails.
- API interface: Essentially provides a structured way for other computer applications to get NPR stories in a predictable, flexible and powerful way.
- Web pages: NPR public Web pages showing NPR stories, series, topics, programs and music events.
- Scripts: Run automatically by back-end systems on a predetermined schedule, or run manually as needed.
Who Asks The Tech Team For Support?
- Inside users: NPR staff who can build a story or blog post in Seamus and publish to the NPR web site and API interface.
- Outside users: Includes Web site visitors who can access the NPR web site and the API interface. NPR's User Care team handles the majority of problems these users have, but they send us the odd or hard-to-solve issues.
- Invisible users: NPR's system-scheduled tasks. If a task fails, an email reporting the failure is automatically sent to the tech team.
Current Production Support Model
Generally speaking, users ask for production support via email. When a user encounters an issue in the system, they send an email to a specific email address to ask for help. At least one developer is assigned to monitor and respond to these emails. This production support person usually responds to an email within 10-30 minutes and often solves the problem the same day. If it's a bigger, more complicated issue, the developer working on production support asks the user to file a ticket in Jira so the fix can be prioritized and scheduled for an upcoming release.
If something is really wrong – like npr.org slowing to a crawl or crashing – we usually know and respond within a couple minutes. NPR staff bombard the support email address, call development team managers, and sometimes even jog over to our desks to warn us. In this kind of production emergency, one of the Digital Media managers will make sure the correct people are working on the problem and send out status emails to the Digital Media team every 10 or 15 minutes until the problem is fixed.
The tech team isn't large enough to match the almost-around-the-clock staffing of the editorial team. For emergency help during hours when the tech team isn't working, an editor can call a support line staffed 24/7 by NPR's IT staff, who use a table of problem scenarios to determine the next steps. For example, if a user can't log in to Seamus, a system administrator gets a call. If it's a publishing error on a breaking news story, IT reaches out to the developer on call.
In addition to reacting to support requests from our users, we also have a variety of system monitors and dashboards that alert us to issues as they arise (or just before they do) well before we get any reports from outside of the team. Our system administrators use several different monitoring packages that give us insight into system health of our Web servers, databases, and our API.
Future Production Support Model
Recently, we started using Splunk to index our log messages and help us search and analyze our systems. We're now catching a lot of issues before they become widespread, critical problems, just by regularly looking at reports each day on the Splunk dashboard. Following every release we look at the responsiveness of our systems using Splunk to be sure it's in line with the baseline established before the release. We'll also try to note any patterns forming around warnings and errors that may be a result of code in the release and address them before they become problematic.
We're slowly shifting from a reactive, user-driven error-reporting model to proactive, system-driven maintenance checks. System stability will become increasingly important as NPR moves to more robust, round-the-clock reporting. Our goal is to reduce the number of live errors to miniscule levels and for editors to use the support email address less and less over time. There will always be behind-the-scenes system hiccups and publishing blips, but in the future we hope to fix more of them before the editors ever see the problem.