Inside NPR.org

Plumbing

What Happens When Stuff Breaks On NPR.org

Before we release new code, such as our recent redesign, we do our best to test as well as possible, but we can't possibly figure out everything that could go wrong before we put it out there in the world. Even under the best circumstances, in a web site this complex and busy, problems will occur. Some of them are are avoidable, such as bugs in our code. And some of them are environmental issues, such as a network problem or database failure. Often, it's difficult to distinguish between the two. That's why it is essential to have a solid error handling system to track the errors as them come in, prioritize them, and give us enough information to be able to fix them.

Obviously, we have spent a lot of time building features of our system that users see. A large percentage of our time is also spent on error handling. The error handling system that we've developed has been through several rounds of incremental improvements, starting with writing to a log, then emailing a developer, and now updating our issue tracking system. These approaches are described below in greater detail.

Logging

Every time our PHP code encounters an error, it gets automatically logged by our Apache web servers. We have dozens of web servers and they generate thousands of lines of logging information every day. We could get lost in the torrential downpour of log messages. Parsing through the logs is a good way to track down a problem once we know about it, but it was horrendously inefficient way to stay on top of new issues as they come in. Not only were they coming in too fast for any one person to keep track of, they also didn't provide any context about what a user was doing when the error occurred (and context is critical to the fundamental truth of debugging code that you must be able to consistently reproduce the problem before you can fix it or test the fix).

The other problem with this kind of logging was that we were missing out on client side error messages. In the past year or so, NPR.org has added more and more javascript to support a growing list of features. This resulted in more client side errors, which were very hard to track. All of the different combinations of web browsers, browser settings, operating systems, proxy servers, etc., each one potentially handling our javascript differently, led to problems that we might never have encountered while testing in our limited set of test environments. And since they only occurred on the client machine, we would never know about them unless a user sent us an email or complained in some other way.

As a result, we implemented something last year that we call jslog. With AJAX, we are able to send messages back to the server without reloading the page. Now, every time one of our pages has a javascript error, instead of making the user's client browser handle the error, leading to a less than stellar user experience, we catch the error and send it to the server using AJAX to be logged in the Apache web server logs. This is completely transparent to the user and allows the page to keep loading, even if it is not 100% perfect.

Code Examples


 
NPR.community.displayCommentError = function()
{
try {
var commentBlock = document.getElementById('comments');
commentBlock.innerHTML = "< h3>Community features and
content, including commenting and recommending stories, are
unavailable at this time. We apologize for the inconvenience.
Please try this page later.< /h3>";
}
catch (e) {
NPR.messaging.exception(e, 'NPR.community.storyId = ' +
NPR.community.storyId, 'displayCommentError',
NPR.messaging.constants.COMMUNITY_JS_ERROR);
}
};

The above javascript code snippet from our implementation of commenting has a try,catch block to handle an exception, it calls the following function which captures as much information as possible about what what wrong. We trim the exception information to a max of 255 characters to keep the message size reasonable. Some browsers include a stack trace inside the exception, so that could be very long.


 
NPR.messaging.exception = function(_e, _debugData, _function, _id)
{
if(_e.message.match('RequestBatch is undefined')===null) { _debugData += " {Exception: ";
var counter = 0;
for (var key in _e) {
try {
var value = _e[key];
if (!$.isFunction(value)) {
if (counter++ != 0) {
_debugData += ",";
} if (value.length > 255) {
value = value.substring(0, 255) + "...";
} _debugData += "{" + key + ":'" + value + "'}";
}
}
catch (propEx) {
// ignore
}
}
_debugData += "}"; var lineNumber = 'unknown';
if (_e.lineNumber != null) {
lineNumber = _e.lineNumber;
} NPR.messaging.logMessage(_debugData, _function, _id,
lineNumber);
}
return { };
};

For javascript errors that we didn't catch, we add a function handler to the window.onerror event. This will be called for every uncaught javascript exception on the page. The code is below.


 
window.onerror = function(message, url, lineNumber) {
message = "Window.OnError: " + message;
NPR.messaging.logMessage(message, "window.onerror",
NPR.messaging.constants.JAVASCRIPT_GENERAL_ISSUE, lineNumber);
window.onerror = function () {
return true;
};
return true; // prevents browser error messages
};

Both of these functions call a common logging function that creates a JSON string out of the message data and then uses the JQuery library to post a message to the server using AJAX.


 
NPR.messaging.logMessage = function (_msg, _function, _id, _line) {
try {
var message = {};
message.msg = _msg;
message.func = _function;
message.id = _id;
message.line = _line;
message.agent = navigator.userAgent
message.url = window.location.href; var debugData = '';
debugData = message.msg + ' | f: ' + message.func + ' | '
+ message.agent + " | in URL: " + message.url; message = JSON.stringify(message);
$.post("path/to/jslog.php", {type: 'Warning',
message: message});
}catch(e){}
};

Of course, we still have the problem of too many log messages, made even worse now that we're adding javascript errors to the mix. Which lead us to our next step, email.

Email

We use the PHP function set_error_handler to set up a special handler to intercept an error or warning before it is printed to the logs. We used this function to capture as much data about the problem as we can, then email it to developers. We can now see every error that occurs on the website as soon as it occurs, in our inbox. However, we still have the problem of the huge volume of logging that occurs. We tried to address that in two way.

One was to tag each type of error with an id number so that we could group similar errors. The other way was to keep track of how many of each type of error we were getting, only sending email messages once we exceeded a certain threshold. We could customize the threshold for each the error type, allowing us to focus on only the most frequent and serious errors. An intermittent network connection error is not something the developers need to see every time, but if it happens 5000 times in 10 minutes, that's a different ball game.

Still, receiving an email is a little too immediate. Most of the time, we can't drop everything we're working on to figure out what is causing a syntax error in the PHP. Most error messages that we get don't necessarily mean that a page or some feature is completely broken. It's still useful to eliminate as many trivial error messages as possible so that the big and important ones that do break the system don't get lost.

We also discovered that we had to push the thresholds up to some very large numbers in order to get some relief from the flood of emails. Since we received emails for only a small subset of errors, many issues got lost in between and we may never find out about the problem. We were not much better off than keeping everything in the logs.

That lead us to our last (and current) step, adding a ticket to our issue tracking system when a problem occurs.

Issue Tracking

Every time that our website encounters an error in either the PHP code or Javascript, we now post to a PHP page whose only job is log the issue in a MySQL database. This database keeps track of the filename, function name, line number, error message, and most usefully for debugging, a stack trace. We are very careful to make sure that this action doesn't take any longer than necessary and we've set a timeout on the post so that the public facing page continues to load in a timely manner, even if something is wrong in the backend error handling. Every few minutes, a different process scans the MySQL database and looks to see if any of the issues can be found in Jira, our issue tracking system.

The following diagram shows the path an error takes on its way through the NPR site.

Image i i
Image

(Click here for a larger version of this diagram)

Rather than grouping issues by error type, as in the email system, we use the filename and function name to create a unique key identifying this specific error. If the error already exists in Jira, we update the count so we know how many times this particular error has occurred. If the error has not been reported, we add it as a new issue in Jira. We can use these error frequencies to prioritize which issues need to be fixed when.

We've set up a separate project in Jira just for application errors and can assign specific issues to the development team to fix. For those issues that are not from coding problems or are innocuous enough to not require any attention, we close them, assuming that it will automatically be reopened and incremented if it becomes more critical. For certain critical error types, we are still able to email those issues directly to developers. This way we can still keep track of the state of the website without being overwhelmed by it. All of these backend errors can be triaged into our regular cycle of development work.

We've just added this issue tracking integration into the system, so we are still fine tuning it. Although the logs and error emails were too numerous to keep track of, it turns out that the number of individual issues added to Jira is a manageable number.

One interesting issue that we didn't anticipate was error messages coming from JavaScript running on web browsers in languages other than English. Although the error is the same, each time the problem is encountered in Russian, Spanish, or even Icelandic, a new ticket is added to Jira. Unfortunately, none of our developers speak Icelandic. Our next step in improving this process will be to group similar errors together, regardless of language.

As we move forward and add new features to the website, the quality of the website will consistently be improving, one error message at a time.

Comments

 

Please keep your community civil. All comments must follow the NPR.org Community rules and terms of use, and will be moderated prior to posting. NPR reserves the right to use the comments we receive, in whole or in part, and to use the commenter's name and location, in any medium. See also the Terms of Use, Privacy Policy and Community FAQ.

Inside NPR.org
Support comes from: