Archive for February, 2008

Linux is killing Solaris

This just in from the “duh, obvious” department… Linux is killing Solaris.

Search Google for “htop on Solaris.” You’ll probably find the very page you’re reading right now. There were plenty of hits for solaris, top, and htop, but none for solaris and htop together. (Editor’s note: not 5 minutes after publishing this article, Google has this very page as the first hit for ‘htop on solaris.’ See comments.)

We got used to htop’s color-coded bars in a console for our messaging system, and then we deployed to a datacenter on Sun hardware without htop. prstat just isn’t quite the same.

What’s htop? htop is a little command-line tool for Linux that’s similar to top but shows CPU and Memory usage visually in simple text format. It’s not flashy or whiz-bang. It’s a simple yet effective way of seeing what’s going on inside your OS at a quick glance.

So, what does this have to do with killing Solaris? We couldn’t find even a single person interested in running htop on Solaris (besides us). Other programs (like pound) have either been ported to Solaris or at least talked about somewhere else on the internet. We couldn’t even find a discussion about htop on Solaris. There’s just no interest.

I admit this is a specious argument at best, the thinnest of strawmen. But more subtly, an entire generation of Linux geeks are getting used to GNU tools that are similar but not quite like their non-GNU Unix counterparts. For example, I’m frustrated that I can’t simply type “tail -n 200 <file>” on Solaris. The -n argument is not the same. Add up enough of these little differences and I find myself wanting to work in a Linux environment where I’m more familiar. Linux captured the low-end of the market, revitalizing the old PCs people had lying around the house. The next generation is cutting their teeth on GNU/Linux, not Solaris. Sorry, BSD.

But like I said, this is from the “duh, obvious” department. It’s not particularly insightful. I’m merely a consumer reflecting on my choice of server OS. But if I’ve learned one thing in life, it’s that we’re not as unique as we think we are. There are probably other people thinking/feeling/experiencing the same thing you are. You might be part of a larger trend.

Happily, we’re only temporarily deploying to Solaris. Our company is in the middle of a move to a larger data center with a lot more capacity. We’ll have shiny, new blade servers to deploy to. They’ll be running Linux, naturally.

The Great Indian Outsourcing is over

The Great Indian Outsourcing movement will be over within two years.

That’s what an architect turned blogger who writes anonymously from Bangalore is predicting. The author is writing from the movement’s Ground Zero, so he may have better insight than the rest of us. But I’ve got good anecdotal evidence from a local outsourcing company that lends weight to his prediction.

I live and work in Charleston, S.C., an area known more for its beautiful beaches and gorgeous live oak trees than high technology (though we do have Robert X. Cringley). But Charleston’s location can attract businesses that don’t necessarily need high technology, just smart people. Outsourcing is one of those types of businesses.

I know personally a project manager at a local outsourcing company. Our daughters go to school together. We were talking at a recent birthday party about outsourcing, cost, and the availability of talent. Business is booming, but it has little to do with cost, she tells me. She says its the lack of local talent that drives most of their business. They deal largely with the marketing end of technology, making websites and fancy Flash applications. Madison Avenue marketing firms would rather hire local Flash experts, she says, but they’ve hired them all. They’d prefer the rapid turnaround that local talent can give them. There just aren’t enough talented people in NYC to fill the huge demand, so they outsource to Charleston, S.C. In turn, this local company sends the work to a development center they own in Costa Rica. Costa Rica is, I’m surprised to learn, a hot up-and-coming technology spot. And you don’t have to wait 12 hours for Costa Rican project managers and developers to reply to email or voice mail.

The Tired Architect – our Bangalorian blogger – talks about the availability of talent in Eastern Europe and China, and there’s obviously talent in Central America. Brazil is another up-and-coming technology hot spot.

I agree with The Tired Architect that India’s monopoly on the outsourcing market is over.

Some wheels need reinventing

Reinventing a square wheel is a common anti-pattern. The idea is a) we don’t need to reinvent the wheel because b) we’re likely to recreate it poorly compared to what is already available. But if we never reinvent any wheels, then we never progress beyond what we have. The real question, then, is when does it make sense to recreate a wheel? Some wheels need to be recreated.

I recently reinvented a wheel. A big one. The wheel is “Enterprise Messaging,” which much be complex because it has “enterprise” right in the name! I’d be a fool to reinvent that wheel, right? Maybe. Maybe not. We fit our “enterprise messaging system” into 92kb:


Some won’t consider 92kb to be “enterprisey” enough, but that’s ok with me. I know we were able to put 1.3 million real-world messages through our bus over a weekend. That’s enterprisey.

Jonas Bonér wrote an article about building a POJO datagrid using Terracotta Server, and I replied on his blog saying we did something similar by using Terracotta Server as a message bus. Another reader asked why I did this instead of using JMS.

I think there are several benefits to this reinvented wheel:


92kb contains the entire server framework. We have another jar containing common interfaces we share with client applications that weighs in at 18kb.

It works!

A single “consumer” in our framework is bootstrapped into an isolated classloader, which enables our framework to load applications (the various apps we need to integrate) into their own isolated classloaders. One consumer can process a message for any integrated application.

This is utility computing without expensive VMWare license fees.

We’re consolidating servers instead of giving each application dedicated hardware. The servers were mostly idle, anyway, which is why enterprises are looking to utility computing and virtualization to make more efficient use those spare CPU cycles. In our framework, hardware becomes generic processing power without the need for virtualizing servers. Scaling out the server farm benefits all applications equally, whereas the prior deployments required separate capital expenditures for each new server.


Our framework runs inside an IDE without any server infrastructure at all. No ActiveMQ, no MySQL, and no Terracotta Server. Developers can stand up their own message bus in their IDE, post messages to it, and debug their message code right in the framework itself.

We introduce Terracotta Server as a configuration detail in a test environment. Configuration Management handles this part, leaving developers to focus on the business logic.

So, I might not be writing my own servlet container anytime soon (not when Tomcat and Jetty are open source and high quality), but I think it made a lot of sense to reinvent the “enterprise messaging” wheel. Terracotta Server allows me, in effect, to treat my network as one big JVM. My simple POJO model went wide as a configuration detail. That makes my bus (and TC) remarkably transparent.

Beware the non-namespaced classpath resource

We spent too long hunting down a bug in a database method call that consistently returned “null” in our integration environment, but returned the correct value in a development unit test.

Do you see the problem here? Compare these two screenshots. Both screenshots show a look into two different jars on our classpath. The title of this article is a big hint.

jar_two.JPG jar_one.JPG

Two points goes to the first person to explain the problem. (-2 points for any of my teammates that answer first!)

The moral of the story is that namespaces are one heck of a good idea and that we should always be very careful when relying on classpath-based resources.  I’ll write more about the problem after giving others a chance to solve the puzzle.

Code complete doesn’t mean you’re done

Joe Coder runs through his feature in the UI. It works. The doodad renders beautifully on the screen, and when he clicks the button, all the right things happen on the server. He checks his code in, writes a quick comment in Jira, changes the issue status to “Completed & Checked-in”, and goes to his next task. Lo and behold, his To Do list is empty! Joe’s done coding!

Or is he?

Configuration Management cuts a branch off the trunk code. Joe’s code goes through QA.

Some QA departments try to break developers’ code, others just test the “go path” to insure their test scripts work. In Joe’s case, the QA department has a test script that is a step-by-step use case for how the feature should work. QA signs off on his feature. The doodad worked exactly as specified, which is to say, the select/combo box was correctly filled with test data from the database.

Joe’s code goes to production … and takes down an entire server.

How can this be? It went through QA! Testers verified that his code did what it was supposed to do!

Most software organizations use the term robust wrongly. I generally hear it used in a context that implies the software has more features. Robustness has nothing to do with features or capabilities! Software products are robust because they are healthy, strong, and don’t fall over when the table that populates a select/combo box has a million rows in it.

Joe Coder never tested his feature with a large result set. Michael Nygard — in his excellent book Release It! from the Pragmatic Press — calls this problem an unbounded result set. It’s a common problem and an easy trap to fall into.

Writing code for a small result set enables rapid feedback for a developer. This is important, because the developer’s first task is to write his program correctly. It seems, though, that this first step is oftentimes the only step in the process! Without testing the program against a large result set, the new feature is a performance problem waiting to happen. The most common consequences are memory errors (have you ever tried filling a combo/select box with a million rows from the database?) or unscalable algorithms (see Schlemiel the Painter’s Algorithm).

In our case, testing with big numbers revealed concurrency issues that we did not and could not find when developing with simple, smaller tests. Our multi-threaded, multi-node messaging system would routinely deadlock whenever we slammed it with lots and lots of messages. It didn’t do this when we posted simple messages to the endpoint during development. Similarly, we noticed that we held on to objects for too long during big batch runs. They all got garbage collected when the batch completed, so it wasn’t exactly a memory leak, but there was no need to hold on to the references. After we fixed that, we noticed that our servers would stay within a tight and predictable memory band, as opposed to GC’ing all at once at the end of a batch. Terracotta server expertly pages large heaps to disk, so we were never at risk of running out of memory. Still, it’s nice to see our JVMs running lean and mean.

We’re still stress testing our system today. This past weekend, we pumped over one million real-world messages through our message bus. Our concurrency issues are gone, memory usage is predictable, and we stopped our test only to let our QA department have at our servers. There are zero technical reasons why our test couldn’t have processed another million messages. Terracotta Server held up strong throughout.

But we’re still not done testing yet. We still have to see what happens if we pull the plug (literally) on our message bus. We still have to throw poorly written messages/jobs at our bus to see how they’re handled. We need to validate that we’ve got enough business visibility into our messaging system so that operations folks have enough info at runtime to do their jobs. We need canaries in the coal mine to inform us when things are going wrong, and for that we need better metrics and reporting built into our system that shows performance over time.

We’re code complete, but we’re not done yet.

Dead Programs Tell No Tales (or “We don’t need no stinkin’ error handling!”)

Imagine two pieces of software. One has “robust error handling” while the other prints a stacktrace and dies. Which one do you prefer?

I like the one that dies. Loudly. Depending on the type of application you’re building, the dead program might serve you better, too. Why? Because it’s obvious when something goes horrifically wrong. The message won’t be buried in a snowcrash of logging output.

What is “robust error handling” anyway? My team is currently building a sophisticated message bus and we’ve run into a few subtle concurrency issues. These are the hardest things to find when writing a distributed application with many threads on different physical nodes. Our error handling consisted of trapping the exception, logging it to Log4J at ERROR level, setting the correct state on our class (or so we thought), and going back to what the code was doing. It seemed like a good first pass, except that it didn’t work. We didn’t or couldn’t predicate every possible state in our system, so we made our best guesses, but naturally some corner cases bit us with an unexpected deadlock. Everything stopped and we didn’t know why.

Our logs, naturally, were huge with debugging turned on. Application logging is largely useless, anyway, without a plan to use it. And without a debug statement prefixing every line of code, you’re going to have a hard time finding deadlock situations across JVMs on the network.

So how’d we find our gremlin? We killed the process. Our “robust error handling” now looks like this:

try {
     // attempt the work
} catch(Exception e){

Once we made that change and deployed our software to all the nodes, we found our deadlock.

Our problem was an unsychronized getter (you think gets are reads and therefore thread safe? Ha!). It turns out that somewhere in the callstack, this getter called toArray() on an ArrayList, which internally uses an iterator to build the array. If you’ve done any multi-threaded programming, you probably know what happens when another thread tries to modify your collection/list concurrently while using an iterator.

Our problem arose in a parent message (those that divide work among many child messages for parallel execution) which would leave orphaned children in certain error scenarios. We didn’t discover this cornercase in the logs, but we found it quickly when we crashed the program and exited.

I understand that exiting a running program isn’t the correct solution for all problems, but it was for our’s and it was dramatically more revealing than looking through tons of debug gibberish in log files. So, if you have the kind of program that can safely exit, then I say…

Fail loudly and proudly.

Caveat emptor

I think the claims made by the people hawking this book are some of the most disingenuous things I’ve ever read:

They are selling a rehash of the classic Gang of Four (GoF) Design Patterns book as a PDF, making preposterous claims which I’ll cut & paste here. You can’t make this stuff up.

The Whys:

Why should I read it?
When you finish reading this book, you can go to your boss and ask him for a promotion.

Why? Because using design patterns will allow you to get your tasks done twice faster, write bugless code and create an efficient and reliable software architecture.

How to become programming guru?
The main difference between a programming guru and a novice is the knowledge of secret coding tricks, as well as awareness of most cornerstones and the ability to avoid them.

Design patterns were created as a Bible of avoiding problems related to software design. Isn’t it a true guru’s handbook?

“Bugless code” after learning design patterns! I must be a poor learner, because I still have bugs in my code and I’ve lived with the GoF book for many years now. Unless you are writing code for the space shuttle, you’ve probably written your share of bugs, too. And bosses don’t give promotions because you read a book about patterns. Mine gave me promotions because of hard work, passion, and creativity in problem solving.

If you follow through to the order page, you see the publisher is based in the Ukraine. That explains the broken English “testimonials”:

Daniel Sommers, UK

I have learned all design patterns about a 3 days with this book. Thank you very much!!!!


Jeremy Parkinson, USA
Four month ago I was just coder in Boeing corp. I had a lot things to learn to get a level up in my skills, and I started with this book. Now I am a software architect and I happy, because nobody in my department is so good with programming as me!

I’d be pissed off if I were Boeing. That “testimonial” makes me think the talent there must be terrible. Boeing does “aerospace engineering.” Rocket scientists. Literally. I suspect they are a smarter bunch than “Jeremy Parkinson.”

Free Book Offer:

Buy our book now and get a free gift! (limited offer)
It is classic “Design Patterns” book by “Gang of Four”.

Amazon is selling the classic GoF book for $38, but this publisher is going to sell you a PDF for $20 and give you a $38 book for free. If they gave a single GoF book away for free, would that be considered a “limited offer”?

The odd thing is that the quality of the rest of the site appears, at first blush, quite good. The writing on the pages describing each of the patterns is good and without any outlandish claims. It makes me wonder if they didn’t get that copy from somewhere else. Regardless, the content on the site is sufficiently good that one wouldn’t need to buy their PDF.

Save your money. Or better yet, buy the real GoF book. It’s a classic for a reason, and there are products that nicely complement the book, like a wall poster detailing the original patterns in a striking visual way.

Switch to our mobile site