Imagine two pieces of software. One has “robust error handling” while the other prints a stacktrace and dies. Which one do you prefer?

I like the one that dies. Loudly. Depending on the type of application you’re building, the dead program might serve you better, too. Why? Because it’s obvious when something goes horrifically wrong. The message won’t be buried in a snowcrash of logging output.

What is “robust error handling” anyway? My team is currently building a sophisticated message bus and we’ve run into a few subtle concurrency issues. These are the hardest things to find when writing a distributed application with many threads on different physical nodes. Our error handling consisted of trapping the exception, logging it to Log4J at ERROR level, setting the correct state on our class (or so we thought), and going back to what the code was doing. It seemed like a good first pass, except that it didn’t work. We didn’t or couldn’t predicate every possible state in our system, so we made our best guesses, but naturally some corner cases bit us with an unexpected deadlock. Everything stopped and we didn’t know why.

Our logs, naturally, were huge with debugging turned on. Application logging is largely useless, anyway, without a plan to use it. And without a debug statement prefixing every line of code, you’re going to have a hard time finding deadlock situations across JVMs on the network.

So how’d we find our gremlin? We killed the process. Our “robust error handling” now looks like this:

try {
     // attempt the work
} catch(Exception e){

Once we made that change and deployed our software to all the nodes, we found our deadlock.

Our problem was an unsychronized getter (you think gets are reads and therefore thread safe? Ha!). It turns out that somewhere in the callstack, this getter called toArray() on an ArrayList, which internally uses an iterator to build the array. If you’ve done any multi-threaded programming, you probably know what happens when another thread tries to modify your collection/list concurrently while using an iterator.

Our problem arose in a parent message (those that divide work among many child messages for parallel execution) which would leave orphaned children in certain error scenarios. We didn’t discover this cornercase in the logs, but we found it quickly when we crashed the program and exited.

I understand that exiting a running program isn’t the correct solution for all problems, but it was for our’s and it was dramatically more revealing than looking through tons of debug gibberish in log files. So, if you have the kind of program that can safely exit, then I say…

Fail loudly and proudly.