Home > Architecture, Engineering, Technology, Terracotta > Scalability & High Availability with Terracotta Server

Scalability & High Availability with Terracotta Server

Our message bus will be deployed to production this month. We’re currently sailing through QA. Whatever bugs we’ve found have been in the business logic of the messages themselves (and assorted processing classes). Our infrastructure — the message bus backed by Terracotta — is strong.

SCALABILITY

People are asking questions about scalability. Quite frankly, I’m not worried about it.

Scalability is a function of architecture. If you get it right, you can scale easily with new hardware. We got it right. I can say that with confidence because we’ve load tested the hell out of it. We put 1.3 million real world messages through our bus in a weekend. That may or may not be high throughput for you and your business, but I guarantee you it is for our’s.

The messages we put through our bus take a fair amount of processing power. That means they take more time to produce their result than they do to route through our bus. How does that affect our server load? Terracotta sat idle most of the time. The box hosting TC is the beefiest one in our cluster. Two dual-core hyperthreaded procs, which look like 8 CPUs in htop. We figured we would need the most powerful server to host the brains of the bus. Turns out we were wrong, so we put some message consumers on the TC box, widening our cluster for greater throughput. Now the box is hard at work, but only because we put four message consumers on it.

When we slam our bus with simple messages (e.g, messages that add 1+1), we see TC hard at work. The CPUs light up and the bus is running as fast as it can. 1+1 doesn’t carry much overhead. It’s the perfect test to stress the interlocking components of our bus. You can’t get any faster than 1+1 messages. But when we switched to real world messages, our consumers took all the time, their CPUs hit the ceiling, and our bus was largely idle. The whole bus, not just TC. We’ve got consumers that perform logging and callbacks and other sundry functions. All of these are mostly idle when our message consumers process real world workloads.

We’ve got our test farm on 4 physical nodes, each running between 4 and 8 Java processes (our various consumers) for a total of 24 separate JVMs. All of these JVMs are consumers of queues, half of them are consumers of our main request queue that performs all the real work. The other half are web service endpoints, batch processors, loggers, callback consumers, etc. and each are redundant on different phsyical nodes. Because our message processing carries greater overhead than bussing, I know we can add dozens more consumers for greater throughput without unduly taxing Terracotta. If we hit a ceiling, we can very easily create another cluster and load balance between them. That’s how Google scales. They’ve got thousands of clusters in a data center. This is perfectly acceptable for our requirements. It may or may not be suitable for your’s.

You might be thinking that dozens of nodes isn’t a massive cluster, but our database would beg to differ. Once we launch our messaging system and start processing with it, we’ll begin to adversely impact our database. Scaling out that tier (more cheaply than buying new RAC nodes) is coming next. I hope we can scale our database as cheaply and easily as our message bus. That’ll enable us to grow our bus to hundreds of processors.

Like I said, I’m not worried about scaling our bus.

HIGH AVAILABILITY

I might not be worried about scalability, but I am worried about high availability. My company is currently migrating to two new data centers. One will be used for our production servers while the other is slated for User Acceptance Test and Disaster Recovery. That’s right, an entire data center for failover. High availability is very important for our business and any business bound by Service Level Agreements.

Terracotta Server has an Active-Passive over Network solution for high availability. There is also a shared disk solution, but the network option fits our needs well. Our two data centers are connected by a big fat pipe, and Terracotta Server can have N number of passive servers. That means we can have a redundant server in our production data center and another one across the wire in our DR data center. We’ve also got a SAN that replicates disks between data centers. We might go with the shared disk solution if we find it performs better.

Overall, though, it is more important for our business to get back online quickly than it is to perform at the nth degree of efficiency. Messaging, after all, doesn’t guarantee when your stuff gets run, just that it eventually runs. And if everything is asynchronous, then performance, too, is a secondary consideration to high availability.

CONCLUSION

If there’s one lesson to be learned through this blog article, it’s that one size does not fit all. Not all requirements are created equal. Our message bus is the right solution for our needs. Your mileage may vary. Some factors may outweigh others. For example, having a tight and tiny message bus that any developer can run in their IDE without a server (even without TC) is a great feature. No APIs lets us do that with Terracotta. You might have very different requirements than we do and find yourself with a very different solution.

  1. March 8th, 2008 at 15:23 | #1

    Mark,

    Thanks for sharing, and good luck with Terracotta. I’ve been taking a look at their technology and find it interesting, but haven’t yet found the right fit for my environment. I like to see what kinds of problems people are solving with it….

    BTW, there is a real truth in your blog today– 1.5M messages a weekend is not super high for some lines of business. (Case in point: I’m currently working with an application that hopes to process maybe 1 BILLION database rows in a short number of hours.) We’re an oddball case, but it does demonstrate the point you made– ’scalable’ is relative to your business problem.

    Good luck, and thanks again for posting info on your environment. Please share as you go, we can all learn from each other.

    Rick

    http://www.rickwagner.blogspot.com/

  2. March 8th, 2008 at 21:15 | #2

    Rick,

    You are right, it’s all relative. 1.3m messages is a significant amount of business for what my company does. Very significant. Naturally, I can’t divulge the company secrets, but if 1.3m messages represents $X millions of dollars of business, then 1 billion messages would represent a hundred-fold increase in business for us. That’d bring us to the multi-billion dollar range. I think I can say confidently on behalf of my company that we’re ready for that challenge. ;)

    Mark

  3. Glen
    April 9th, 2008 at 16:10 | #3

    Mark,

    Interesting reading, agree completely on the relativity of ’scalable’. I am currently working on a system with traffic growing an order of magnitude every year, and the business wants a system to last 3 years …

    Terracotta is high on my list as a possible for simplifying parts of the system and its good to hear your experiences with it.

    Glen

  1. No trackbacks yet.

Switch to our mobile site