I once read a great quote:

Oracle may scale technologically, but it doesn’t scale financially

We run Oracle in our shop, and we’ve got a big database. Hundreds of tables. Millions and millions of rows of data. Extensive and detailed audit records of every database transaction. We recently deployed Oracle RAC to help our data layer scale, and it works! We can separate web traffic from internal operations. We can target specific RAC nodes with heavy hitting reporting queries while leaving other nodes available to crunch jobs.

There’s only one problem… new nodes cost a lot of $$$$$.

So, what do you do when you see your fancy clustered database running at 50% of capacity? Beg the CFO to budget for a few more nodes next year? What if you get hit with a sudden burst of web traffic or increased demand for reports or you have to crunch umpteen thousand jobs by next weekend?

Really, the question is, how do you scale your database simply, easily, and cheaply?

Some would say to dump Oracle for a cheaper database. Google famously runs MySQL for AdWords. If it’s good enough for a multi-billion dollar business, it’s probably good enough for my needs. While I might agree with that assessment, I can also say unequivocally that our business cannot migrate off of Oracle in the near term. That would require a rewrite of major pieces of our business.

So, the question still… what do you do when you see your fancy clustered database running at 50% of capacity?

I’ve got a theory. Oracle CPU cycles are expensive. CPU cycles on commodity x86 hardware running commodity software (Linux, HTTPD, and Tomcat) are cheap. If we can move the cycles from Oracle to commodity boxen, we’d be using our database less, the average cost per CPU cycle drops, and we can scale further for less money.

An Oracle RAC cluster is a chatty cluster. One node is kept in sync by constant updates from the other nodes. This is in contrast to the “Master/Slave” configuration of other databases where there is a lag between updates as a node waits for log file shipments.

Chatty cluster, eh? Real time updates? Hmmmm….. I believe I can recreate this at the application level using commodity hardware and software. I believe that we can hold lots and lots and lots of data in memory in an application cluster, thereby reducing the number of queries we run against Oracle. I believe the application cluster can request data from another node in the cluster that’s faster then getting it from Oracle. After all, both are a network call, but one (the db) might also require IO, while the app cluster has it in memory.

I also believe we can test this whole theory and let the numbers guide our decision. We can run benchmarks against Oracle by scaling the number of clients we have connected to it. We’ll know that X number of app servers brings a single RAC node to 100% capacity. If we test this with linear growth of app servers, then we can do some simple math. We need Y RAC nodes to handle X app servers.

And then we benchmark the commodity application cluster. In theory, Oracle should now be used for lazy reads and writes. Y should be smaller than in our previous test, and X should be able to grow to a very large number. The larger X is, the more memory you have in the cluster and the less you rely on lazy reads from Oracle, keeping Y small.

That’s the theory. Ok, it’s not much of a theory. It’s also known as caching. In this case, it’s distributed caching. The concept is simple, but designing and deploying an elegant solution that scales is harder. Retrofitting a 5 year old organically grown enterprise application with a database that’s pushing 300gb in size to use distributed caching while remaining 100% backwards compatible is harder still. Caching is a good idea, but good ideas aren’t worth much by themselves. The plan to implement the idea and the skills required to deliver it are worth a lot more.

I will be laying out the architecture and benchmark results in future posts.