25th Jan 2008 by Mark Turansky

Terracotta Server as a Message Bus

Terracotta is excellent software to glue messaging components together. This article is a high-level view of how we used TC to create our own messaging backbone.

Just a few weeks ago I made two predictions for 2008, but both centered around Terracotta. Since that time, I’ve gone deeper into the server and used it to write a message bus for a non-trivial integration project.

I’m impressed.

Our first implementation used a MySQL database for a single queue. JTA transactions running “select for update” statements against InnoDB worked just fine, actually, but there were other clunky things about that implementation. All roads looked like they led to queuing and routing. In a nutshell: enterprise messaging with multiple queues, not just batch jobs on a single queue.

Our second implementation (I believe strongly in prototyping, a la Fred Brooks “Plan to throw one away”) used JMS. Early in our design process, we talked about implementing our own messaging system using TC. We managed to talk ourselves out of it because a) no one else that we know of has done it and b) ActiveMQ is also open source, mature, and Camel looked very cool insofar as they give you a small domain specific language for routing rules between queues. The Camel project claims to have implemented all the patterns in EIP.

Well, we managed to deadlock ActiveMQ with multiple clients running with Spring’s JmsTemplate. Our request queue would just stop. We’d get an error saying our message couldn’t be found and the queue would simply stall. We couldn’t restart it without bouncing ActiveMQ. New clients all blocked on the queue. ActiveMQ did not survive our load test well. When we inquired, we were told about an know problem between Spring and ActiveMQ and that we should use the latest snapshot.

DISCLAIMER: I understand the preceding paragraph is entirely FUD unless I provide tangible evidence otherwise. We’ve since moved on from that implementation and removed all the JmsTemplates from our Spring apps. I won’t be providing screenshots or sample code to deadlock their server. To be fair, we did not choose to try again with another FOSS JMS queue, like JBoss. Our configuration of ActiveMQ and our Spring JmsTemplate clients may have been wrong. Feel free to take my criticism above with the proverbial grain of salt.

Happily, my team understands good design and the value of clean interfaces. All JMS-related code was hidden by handler/listener interfaces. Our consumer logic did not know where the messages (our own domain objects) came from. Implementations of the handlers and listeners were injected by Spring. As a result, it took just 90 minutes to swap in a crude but effective queueing and routing system using Terracotta. We’ve since cleaned it up, made it robust, added functionality for business visibility, and load tested the hell out of it. It all works beautifully.

Here are the main ingredients you need to roll your own message bus with Terracotta:

  1. Knowledge of Java 5’s Concurrent API for queueing
  2. Java’s Concurrent API expertly handles nearly all of your threading issues. Bounded LinkedBlockingQueues (also ArrayBlockingQueues) will neatly throttle your entire system for you. Consumers live in their own threads (and through the magic of Terracotta they can live in their own JVMs!) and can safely remove the next item from the queue, optionally waiting for a period of time for something to become available. Producers can add messages to a BlockingQueue in a thread-safe way, also optionally waiting for space to become available.

  3. Knowledge of Java threading for consumers and producers
  4. You’ll need to be able to start and stop your own threads in order to create producers and consumers.

  5. Daemon Runners
  6. Daemon Runners (my term for them, a better one may already exist) are long running POJO Java processes that you can cleanly shutdown later. Browsing Tomcat’s source code taught me a neat trick for hooking into a running JVM. Write a main program which spawns a thread that runs your actual application. Have the main thread open a ServerSocket and await a connection. When a token such as “stop” comes through, main stops its child thread and your application can exit gracefully. Anything else over the socket can be ignored, which lets your ServerSocket go right back to listening. We implemented a “gc” command, among others, to provide simple but effective hooks into our running processes anywhere on the network. You just need the IP and Port. You can optionally put IP checks into your daemon runner to validate that the IP sending the token is a trusted one. Our runners only accept tokens from 127.0.0.1. SSH lets us run scripts from across the network.

  7. Named classloaders
  8. Named classloaders is a TC trick needed to run multiple stand-alone Spring applications yet have them share the same clustered data. TC ties applications together using specific names for classloaders. Modules they’ve built already know how to cluster Tomcat Spring applications, for example, because the classloaders are the same every time. In standalone apps, you’re not guaranteed that the system classloader even has a name, let alone the same name across JVMs. See this post on TC’s forums to make a named classloader. It wasn’t hard. There may be another way to cluster standalone Spring apps. The named classloader did the trick for us. You will need to bootstrap your application to make this work. You should probably be doing this anyway.

  9. Spooler
  10. A Spooler lets your messaging system accept messages long after the rest of the queues get throttled by a bounded BlockingQueue. Your Spooler is an endpoint (maybe a web service endpoint) that will put everything it receives into an unbounded queue: your spool. A Spool consumer will read from the spool and forward to the next queue. Because the next queue is bounded, you’ve achieved throttling. You may have other components in your messaging system that require spooling. For example, we’ve got a consumer that performs callbacks and posts the results of the message to the callback URL. What happens if the callback endpoint is down? We don’t want our throttled message system to stop processing messages, so we spooled messages going into the callback queue.

  11. Consumer interface
  12. You’ll need to create a class or two around queue consumption. Our first crude implementation simply injected the queue itself into the listening thread. The listening thread blocks/waits on the blocking queue (hence the name!) until something is available. We’ve refined that a bit so that we now have listener classes that monitor the queues and pass the messages to consumer classes. The business logic is pure POJO Java logic, which is easily unit testable. This is, in essence, an event-driven system where your POJO class accepts events (messages) but doesn’t know or care where it came from. You want to decouple the business logic from the plumbing.

  13. Terracotta Server — messaging backbone & glue
  14. Last but not least, you need some queues, you need multi-JVM consumers, you need persistent data (a message store) that won’t get wiped out with a catastrophic failure, you need business visibility to track health and status of all queues and consumers, and you need to glue them all together. Terracotta Server handles these requirements very well.

TC really came through for us. We were curious about some of its behavior in a clustered environment. We made some assumptions about its behavior based on what would be ideal for minimizing network chatter and limiting heap size. TC nailed every single one of our assumptions.

We made the following assumptions and were happy to find out that all held up under load testing:

  1. L1 clients that were write-only wouldn’t ever need to have the entire clustered/shared dataset faulted to its heap. If you’re not going to read it, you don’t need it locally.
  2. Clustered ConcurrentMaps have their keys faulted to all L1 clients, but values are retrieved lazily.
  3. Reading from a BlockingQueue would fault just one object to the L1 client, instead of faulting in the entire queue, because the single object is retrieved in a TC transaction.
  4. TC and our unbounded spools wouldn’t run out of memory because TC pages object graphs to disk. Our unbounded L1 clients would work within an acceptable memory band.
  5. We can add/remove consumers to any point in our messaging system without affecting the entire system.

We’ve got our canaries in the coal mine, so we see what the entire system is doing in real time. We’re happy to see that our memory bands are predictable and that we’re entirely CPU bound. This is excellent for horizontal scalability. We can simply throw more processors at any part of our system to scale out. It doesn’t look like Terracotta server will be a bottleneck because the messages we’re processing take significantly more time to crunch than it takes to route through our queues. We have enough juice on our TC box to handle dozens more consumers across the network, which would give us significant throughput gains. We can revisit this when we have the need for hundreds of consumers. I’ll assume TC server will scale up with us, but if it can’t for any reason, it is perfectly acceptable to have more than one messaging cluster. That’s how Google scales. There are lots and lots of clusters in Google’s datacenters. Bridging between two messaging systems is a solved problem. That’s what messaging is, after all, a connection between disparate systems.

What did we gain?

Initially, we had MySQL. Then we added ActiveMQ, which is backed by MySQL. We saw how TC server would be beneficial if only to cluster POJOs that gather runtime data, so we had TC server in the mix. That’s three different servers in our system all of which needed high availability and routine backups. All were configured in Spring, making our dependency injection a maze to follow through.

When we switched to a TC message bus, we got rid of 2/3 of the infrastructure and most of the Spring configurations. We now have just one piece of infrastructure to maintain in a highly available way.

But I’m a guy that really likes simple. TC lets us make an entirely POJO system that runs beautifully in IntelliJ. A single “container” type main program can run all our components in a single JVM simply by loading all our various Spring configs. Developers can run the entire messaging system on their desktop, in their IDE, and run their code against it. They can post messages to an endpoint listening on 127.0.0.1 and debug their message code inside the messaging system itself.

We replace our container main with Terracotta in our integration and test environments. TC seamlessly and invisibly wires together all the components of the system, irrespective of where they live on the network. The POJO model goes wide with Terracotta server. It’s elegant, simple, and Just Works™.

What's next? CommentsLeave a comment Digg it Save This Page

25 Responses to “Terracotta Server as a Message Bus”

  1. Lari H Says:

    ActiveMq 4.1.1 is very stable. ActiveMq 5.0 is not production quality, it has the problems you described (jira contains several bugs like AMQ-1544 https://issues.apache.org/activemq/browse/AMQ-1544). The problems will be fixed in the 5.1 release.

  2. Mark Turansky Says:

    Thanks, Lari, for explaining why we may have had problems. We were using 5.0 instead of 4.1.1. ActiveMQ seems like a cool project, which is why we used it first. Keep up the good work.

  3. Joe blow Says:

    How does Terracotta guarantee that messages are not lost? Where is the persistent storage?

    JB

  4. Mark Turansky Says:

    See http://www.terracotta.org/confluence/display/docs1/Configuring+a+Terracotta+Server+Cluster for more detail on how TC achieves high availability and failover. Briefly, saving the object graph to disk is part of the TC transaction, if you have it configured that way.

  5. Fred Flintstone Says:

    Terracotta should have a little project like this in their examples svn.

    Replacing ActiveMQ with something that scales. :)

  6. beza1e1 Says:

    Brooks later wrote that “Plan to throw one away” is wrong and incremental development is better. See The Mythical Man Month - Anniversary edition, page 270. ;)

  7. Mark Turansky Says:

    @beza1e1: True, and to be fair, we didn’t start from scratch every time. We did major refactoring at each step. We didn’t throw out the baby with the bathwater. I think the idea that you learn the real solution only after implementing it once is still a good one, though. We certainly did on this project, which led us to change our direction after understanding more deeply what it is we had to do.

  8. Adam Malter Says:

    I find this stuff very interesting seeing people ‘doing more with less’ - I still am confused about what kind of transactional reliability does Terracotta provide? I see that someone mentioned ‘Terracotta Transactions’ - Are these transaction guaranteed to be Atomic and can I configure how other nodes in the cluster see the tx (i.e. Read Uncommitted, Read Stable, etc)

    Also, can I integrated this transactions into a standard JTA tx, so, can I hook it into an XA database connection?

    MQ Series is probably our biggest cost per node right now, but it is dead reliable (both from a XA and HA perspective). The stuff we run through has high tx reliability concerns, is this an application that Terracotta is suitable for?

    Thanks for blogging on this stuff! Keep up the interesting work..

  9. Mark Turansky Says:

    Thanks for the interesting questions, Adam. I think the guys at Terracotta would have the best answers. I like their product a lot and we’re learning more about it everyday, but the company that wrote it can best answer to the specific details.

  10. Taylor Gautier Says:

    @Adam,

    Terracotta is ACID. Transactions from Terracotta’s point of view follow the Java memory model - in fact that is exactly how Terracotta does what it does - it plugs in to the Memory Model so there is no behavioral difference between threads in one JVM and threads in multiple JVMs. This means that happens-before is honored across the cluster - stable views are managed using Java semantics - namely the synchronized keyword with a bit more power than standard synchronized coming from native support for read and write locks - read/write is also hooked in to java.util.concurrent.ReentrantReadWriteLock. I blogged about this a bit here: http://javathink.blogspot.com/2008/01/stupid-jvm-tricks-read-lock-from-just.html

    The Java memory model is not hooked in with JTA, and so Terracotta does not support JTA. If you are using JTA to write across more than one system, and that is a hard requirement, then you might need to look elsewhere for a while. However, we find that most people really don’t need full 2pc - it’s too slow and not really needed. What you do need is stable, restartable, durable memory, and Terracotta does give you that.

    Hope that helps.

    Taylor Gautier, Product Manager, Terracotta

  11. Tom Says:

    Actually, ActiveMQ is not mature. It just looks mature, since it has a nice version number (4.11).

    ActiveMQ operates in a bizarro world. ActiveMQ supports the Stomp protocol. But when you use Stomp, ActiveMQ 4.11 does not validate passwords. You can connect to any queue/topic you want, with any password. In any project, such a bug would involve a security advisory, and an immediate bug. ActiveMQ has never released a security advisory. It took more than four months for this to get fixed in 5.0.

    An even more bizarre bug, is the the fact that ActiveMQ does not know how many messages it has in the persistent store. Put 10 messages into a queue, restart ActiveMQ, and now ActiveMQ says you have 0 messages in the queue. Run a client to collect the messages, and now ActiveMQ says you have -10 messages in the queue! This bug has existed for more than year, and is still unfixed.

    ActiveMQ is development is too insular. ActiveMQ is stable, so there are no bugs to fix, right?

  12. » Dead Programs Tell No Tales (or “We don’t need no stinkin’ error handling!”) Says:

    […] is “robust error handling” anyway? My team is currently building a sophisticated message bus and we’ve run into a few subtle concurrency issues. These are the hardest things to find when […]

  13. Debasish Ghosh Says:

    @Taylor,

    In our application we currently have JMS and JDBC based resources participating in user transactions using XA. I would like to use Terracotta as the message bus and take off JMS. How do I manage the transactions in this case ?

    Any help will be appreciated.

  14. » Code complete doesn’t mean you’re done Says:

    […] our system today. This past weekend, we pumped over one million real-world messages through our message bus. Our concurrency issues are gone, memory usage is predictable, and we stopped our test only to let […]

  15. » htop on Solaris Says:

    […] got used to htop’s color-coded bars in a console for our messaging system, and then we deployed to a datacenter on Sun hardware without htop.  prstat just isn’t quite […]

  16. » Horses For Courses 2 (or Tools for Fools) Says:

    […] you’ve got a dozen nodes on the network, each one hosting components of your enterprise message bus. You and your operations folks need visibility into the entire system so that you can tell at a […]

  17. » Scalability & High Availability with Terracotta Server Says:

    […] message bus will be deployed to production this month. We’re currently sailing through QA. Whatever bugs […]

  18. Jordan Says:

    This looks very promising! btw - is there a known clean way to handle a topic or pub-sub model as opposed to a distributed queue using similar components.

  19. » No one should work alone. Ever. Says:

    […] pieces of software from the monolithic whole and deployment them as separate components on our message bus. All our components are deployed in isolated classloaders, which will solve their problem. The […]

  20. » Why Linux will never be the world’s primary desktop Says:

    […] I’ve installed Ubuntu on a work machine. Damned Small Linux is our OS of choice for our message bus.  I’m in the minority of users.  It takes one to know […]

  21. David Says:

    @Tom: I have to concur with your take on ActiveMQ stability. I had to pull back from using it, as members were randomly leaving the network of brokers, never joining it back again.

    FYI. I have switched to a clustered deployment of Sun’s Open MQ, which is sturdy and less fancy than other OSS JMS implementations out there, and, so far, I am more than happy with it.

  22. » Very Old School — Walking down memory lane Says:

    […] we are, a decade later, and I’m busy integrating legacy applications into our shiny new message bus. It’s highly concurrent, runs all our integrated applications in a single JVM but in isolated […]

  23. » You can’t keep a good idea down Says:

    […] message bus project was more than just replacing JMS with a POJO messaging system. It’s a whole piece of […]

  24. » More proof that you can’t keep a good idea down? Says:

    […] Nygard describes an architecture that echoes many of the features I implicitly spoke of in my first blog article about my big integration project / message bus. You may be asking yourself right now, why does he keep talking about this particular project? […]

  25. William K. Says:

    Any performance benchmarks ? Tibco Rendezvous does up to 6K messages per second, Server publishing, client consuming, both on the same box.

    Also it would be helpful to know across network performance for messaging consumers and publishers.

    thanks a bundle for the great article.

Leave a Reply