Imagine two pieces of software. One has “robust error handling” while the other prints a stacktrace and dies. Which one do you prefer?
I like the one that dies. Loudly. Depending on the type of application you’re building, the dead program might serve you better, too. Why? Because it’s obvious when something goes horrifically wrong. The message won’t be buried in a snowcrash of logging output.
What is “robust error handling” anyway? My team is currently building a sophisticated message bus and we’ve run into a few subtle concurrency issues. These are the hardest things to find when writing a distributed application with many threads on different physical nodes. Our error handling consisted of trapping the exception, logging it to Log4J at ERROR level, setting the correct state on our class (or so we thought), and going back to what the code was doing. It seemed like a good first pass, except that it didn’t work. We didn’t or couldn’t predicate every possible state in our system, so we made our best guesses, but naturally some corner cases bit us with an unexpected deadlock. Everything stopped and we didn’t know why.
Our logs, naturally, were huge with debugging turned on. Application logging is largely useless, anyway, without a plan to use it. And without a debug statement prefixing every line of code, you’re going to have a hard time finding deadlock situations across JVMs on the network.
So how’d we find our gremlin? We killed the process. Our “robust error handling” now looks like this:
try {
// attempt the work
} catch(Exception e){
e.printStackTrace();
System.exit(-1);
}
Once we made that change and deployed our software to all the nodes, we found our deadlock.
Our problem was an unsychronized getter (you think gets are reads and therefore thread safe? Ha!). It turns out that somewhere in the callstack, this getter called toArray() on an ArrayList, which internally uses an iterator to build the array. If you’ve done any multi-threaded programming, you probably know what happens when another thread tries to modify your collection/list concurrently while using an iterator.
Our problem arose in a parent message (those that divide work among many child messages for parallel execution) which would leave orphaned children in certain error scenarios. We didn’t discover this cornercase in the logs, but we found it quickly when we crashed the program and exited.
I understand that exiting a running program isn’t the correct solution for all problems, but it was for our’s and it was dramatically more revealing than looking through tons of debug gibberish in log files. So, if you have the kind of program that can safely exit, then I say…
Fail loudly and proudly.
I'm Mark Turansky, I'm the founder of
#1 by Georgi on February 5, 2008 - 1:49 pm
Quote
Hi Mark,
I had a deja vue straight ahead reading your article. We had exactly the same problem with toArray() in a server, it was a list that stored notification e-mails to be sent in case a fatal error occured. All program parts and Threads can access this interface. A customer called stating that no emails are sent any more so we analysed the problem, found the bug pretty fast and fixed it within minutes. It was a one in a million chance: Two parts of the program “went fatal” and wanted to send an email simultaneously.
Finding the erroneous part that quick only was possible because we have a kind of logging-guidelines and do improve the software from time to time. This way, we do not have that much “error” – logging entries (and merely never a “fatal” one) and can follow them up pretty fast. Will say: We try to avoid to log into the error-channel where possible. (Part)Recoverable actions that occur (i.e. connection errors where you can retry etc.) are logged into the info-channel more and more as the program evolves.
Thus I think it is not an option to exit a program (or more worse: To System.exit() in case you write a Framework or module for your program). Just imagine you go to productive level and forget a System.exit() – statement … .
It will help much more to improve where/what/how to log. If you want to, have a look at http://goit-postal.blogspot.com/2007/01/how-to-successful-log-in-programming.html .
Just my 0,02$, Georgi
#2 by Paul W. Homer on February 5, 2008 - 4:01 pm
Quote
The trick, I’ve always found is to reduce the time between the problem occurring, and the program stopping, when in development. Way back, we would use assert statements in C and set the debug flag to true while we were developing. Any problem, no matter how tiny would grind the whole system to a halt. You can’t move forward until you’ve fixed the base problems. That would force the developers to fix the problems instead of ignoring them.
Problems often cascade, so an earlier ignored problem might actually the part of the cause of a latter one. It is always best to deal with the problems in the order they occurred.
That strategy is great for development, but lousy for support. Once in ‘wild’ the code should try to work for as long as possible (producing lots of nasty log messages if needed), and there should be plenty of room for work-arounds. There should be a way to get a huge amount of information ‘easily’ from the users once the system has crashed. In that type of case, too much information is never a problem, too little is. Once a problem has been diagnosed, a work-around is the first (and cheapest) approach, while a patch is a fall back.
I’ve always found that mixing between the two strategies based on running environment is the best way to handle the two completely opposite problems.
Paul.
http://theprogrammersparadox.blogspot.com
#3 by Ram on February 5, 2008 - 6:41 pm
Quote
Thats why you have the thread dump ! Thread dumps are always more useful than exception stack trace. Tools like Lockness make the job of identifying a problem. If your whole application model is state based (Read NIO) with a lot of transitions – you are back to square one where depending on logs is the best bet – since a prior state affects the behavior of the next transitioned state. These are hard to replicate problems.
#4 by Dan on February 7, 2008 - 9:22 am
Quote
I am on Mark’s team.
Paul:
Good point about development-time assertions. They are a typically overlooked strategy in most Java-land projects I’ve seen (my own included!). I think I’m going to make it a point when we get back to do a sweep of the codebase and strengthen our pre/postcondition assertions into actual Assert assertions. I believe we got pretty focused on the “try to work for as long as possible” aspect prematurely. Thanks for the sanity check.
Ram:
Another good point. For some reason, Mark didn’t mention in his post that we relied heavily upon remote debugging and thread dumps to track down these problems. I don’t know what we’d do without the JMX console and jdb… I think it’d be nearly impossible to build a large, stable concurrent system like without them.
Pingback: » Code complete doesn’t mean you’re done
#5 by Roger Hayes on September 10, 2008 - 1:06 pm
Quote
See “fail-stop”, Schlichting 1983.