By Jason Grosman (Programmer, Digital Media)
Before we release new code, such as our recent redesign, we do our best to test as well as possible, but we can't possibly figure out everything that could go wrong before we put it out there in the world. Even under the best circumstances, in a web site this complex and busy, problems will occur. Some of them are are avoidable, such as bugs in our code. And some of them are environmental issues, such as a network problem or database failure. Often, it's difficult to distinguish between the two. That's why it is essential to have a solid error handling system to track the errors as them come in, prioritize them, and give us enough information to be able to fix them.
Obviously, we have spent a lot of time building features of our system that users see. A large percentage of our time is also spent on error handling. The error handling system that we've developed has been through several rounds of incremental improvements, starting with writing to a log, then emailing a developer, and now updating our issue tracking system. These approaches are described below in greater detail.
Logging
Every time our PHP code encounters an error, it gets automatically logged by our Apache web servers. We have dozens of web servers and they generate thousands of lines of logging information every day. We could get lost in the torrential downpour of log messages. Parsing through the logs is a good way to track down a problem once we know about it, but it was horrendously inefficient way to stay on top of new issues as they come in. Not only were they coming in too fast for any one person to keep track of, they also didn't provide any context about what a user was doing when the error occurred (and context is critical to the fundamental truth of debugging code that you must be able to consistently reproduce the problem before you can fix it or test the fix).
The other problem with this kind of logging was that we were missing out on client side error messages. In the past year or so, NPR.org has added more and more javascript to support a growing list of features. This resulted in more client side errors, which were very hard to track. All of the different combinations of web browsers, browser settings, operating systems, proxy servers, etc., each one potentially handling our javascript differently, led to problems that we might never have encountered while testing in our limited set of test environments. And since they only occurred on the client machine, we would never know about them unless a user sent us an email or complained in some other way.
As a result, we implemented something last year that we call jslog. With AJAX, we are able to send messages back to the server without reloading the page. Now, every time one of our pages has a javascript error, instead of making the user's client browser handle the error, leading to a less than stellar user experience, we catch the error and send it to the server using AJAX to be logged in the Apache web server logs. This is completely transparent to the user and allows the page to keep loading, even if it is not 100% perfect.
Continue reading "What Happens When Stuff Breaks On NPR.org" >
categories: Technology

