We achieved a 3.5X increase in throughput by implementing “mini-batching” in our grid-enabled jobs.

We have a parent BatchService that creates child Services where each individual Service is a unit of work.  A Service implementation might perform some calculation for a single employee of a large employer group.  When the individual Services are very fast and the cost of bussing them around the network is greater than the cost of processing the Service, then adding more consumers makes the BatchService run slower!  It is slower because these fine grained units of work require more queue locks, more network traffic, and more handling calls when the child Service is returned back to the parent BatchService for accumulation.

The secret, then, is to give each consumer enough work to make the overhead of bussing negligible.  That is, give each consumer a “mini-batch” of Services to run instead of sending just one Service to a consumer.

Here’s a graph of some of our benchmarks:


Some of the data surprised us.  For example, we expected 3 big batches to run fairly slowly across 11 consumers because there would be 8 consumers sitting idle, but we were not expecting 11 batches to run more slowly than 43 batches.  We thought dividing the work equally across consumers in the exact number of batches would be the lowest point on the graph.  We were wrong.  We expected the U-shape, but we thought the trough would be at a different batch size.

Our test system can only support up to 11 consumers, so we haven’t yet tested batch sizes with more than 11, but the graph implies that we’ll have a deeper trough when we add consumers and tweak the batch size.  There should be, in theory, a point where we can’t process jobs any faster due without killing the database.  I’ve warned our DBAs that we’re looking to hit that point.

If you’re doing any kind of grid computing (by way of Terracotta’s Master-Worker project, GridGain, or rolling your own), check out the effects mini-batching can have on your throughput.  You might be surprised by your benchmarking metrics!