Archive for December, 2007

Performance != Scalability

Performance and scalability are not the same.

It is insufficient to run through a QA environment and say, “Everything seems fast enough!” You may load test the same QA environment and conclude that performance is good, so you’ll be able to scale. You may be wrong! Performance and scalability are two very different things. You can often get more scalability out of applications that perform well — they aren’t mutually exclusive — but you don’t actually need one to get the other.

Performance is often about algorithms, which may be fast but not scalable. Building a select box in a server page might be very, very fast when you only have 10 items to put in it, but it’ll be a dog if you try 1000. This happens all the time. In fact, it just happened where I work. The simplest solution for the search functionality at hand was to put all users of a specific type into a select box. It worked great for months when we didn’t have many users of that type. But the data grew… and one day it’s a major performance issue for a major customer of our’s. Oops! Joel Spolsky writes about Schlemiel that Painter’s Algorithm, which is an excellent anecdote about algorithms that seem fast, but can’t scale.

Scalability doesn’t require good performance. It helps, but it’s not a prerequisite. Consider a page or process that is slow, say 10 seconds response time. If that page is consistently 10 seconds for every box you throw at it, then you can say it is scalable, if inefficient. You’ve achieved linear growth with a known response time for average performance. You have a known, measured metric in hand with which to estimate hardware requirements to support some future concurrent usage.

Scaling is also a measurement of how your application performs when spread across dozens or even hundreds of nodes! Does your application use a “shared nothing” architecture? Are the nodes chatty, doing multicasts to the entire cluster? If one node talks to every other node in the cluster to perform some kind of update in real time, it won’t scale. You’ll eventually hit a number of nodes where you spend more time keeping the other nodes in sync than doing real work. This is the same reason most OSes can’t keep accurate time below 10ms. It would spend more time keeping time than doing real work.

I’ll close with an excellent visual of performance vs. scalability, given by John Engates, CTO of Rackspace. He recently gave a talk at a MySQL conference about scaling web applications, and I think these two picture are worth more than their two thousand words:

PERFORMANCE
ferrari.jpg

SCALABILITY
highway.jpg

Saving database cycles

I once read a great quote:

Oracle may scale technologically, but it doesn’t scale financially

We run Oracle in our shop, and we’ve got a big database. Hundreds of tables. Millions and millions of rows of data. Extensive and detailed audit records of every database transaction. We recently deployed Oracle RAC to help our data layer scale, and it works! We can separate web traffic from internal operations. We can target specific RAC nodes with heavy hitting reporting queries while leaving other nodes available to crunch jobs.

There’s only one problem… new nodes cost a lot of $$$$$.

So, what do you do when you see your fancy clustered database running at 50% of capacity? Beg the CFO to budget for a few more nodes next year? What if you get hit with a sudden burst of web traffic or increased demand for reports or you have to crunch umpteen thousand jobs by next weekend?

Really, the question is, how do you scale your database simply, easily, and cheaply?

Some would say to dump Oracle for a cheaper database. Google famously runs MySQL for AdWords. If it’s good enough for a multi-billion dollar business, it’s probably good enough for my needs. While I might agree with that assessment, I can also say unequivocally that our business cannot migrate off of Oracle in the near term. That would require a rewrite of major pieces of our business.

So, the question still… what do you do when you see your fancy clustered database running at 50% of capacity?

I’ve got a theory. Oracle CPU cycles are expensive. CPU cycles on commodity x86 hardware running commodity software (Linux, HTTPD, and Tomcat) are cheap. If we can move the cycles from Oracle to commodity boxen, we’d be using our database less, the average cost per CPU cycle drops, and we can scale further for less money.

An Oracle RAC cluster is a chatty cluster. One node is kept in sync by constant updates from the other nodes. This is in contrast to the “Master/Slave” configuration of other databases where there is a lag between updates as a node waits for log file shipments.

Chatty cluster, eh? Real time updates? Hmmmm….. I believe I can recreate this at the application level using commodity hardware and software. I believe that we can hold lots and lots and lots of data in memory in an application cluster, thereby reducing the number of queries we run against Oracle. I believe the application cluster can request data from another node in the cluster that’s faster then getting it from Oracle. After all, both are a network call, but one (the db) might also require IO, while the app cluster has it in memory.

I also believe we can test this whole theory and let the numbers guide our decision. We can run benchmarks against Oracle by scaling the number of clients we have connected to it. We’ll know that X number of app servers brings a single RAC node to 100% capacity. If we test this with linear growth of app servers, then we can do some simple math. We need Y RAC nodes to handle X app servers.

And then we benchmark the commodity application cluster. In theory, Oracle should now be used for lazy reads and writes. Y should be smaller than in our previous test, and X should be able to grow to a very large number. The larger X is, the more memory you have in the cluster and the less you rely on lazy reads from Oracle, keeping Y small.

That’s the theory. Ok, it’s not much of a theory. It’s also known as caching. In this case, it’s distributed caching. The concept is simple, but designing and deploying an elegant solution that scales is harder. Retrofitting a 5 year old organically grown enterprise application with a database that’s pushing 300gb in size to use distributed caching while remaining 100% backwards compatible is harder still. Caching is a good idea, but good ideas aren’t worth much by themselves. The plan to implement the idea and the skills required to deliver it are worth a lot more.

I will be laying out the architecture and benchmark results in future posts.

Tags: , , ,

Horses for courses

Pick the right tool for the job.  It can save you a lot of time and effort.  This article explains how I replaced dozens of Java class files, an XSD, myriad library dependencies, a build script and a whole lot of complexity with 40 lines of Python.

A friend of mine recently bemoaned my use of the term “pick the right tool for the job.” Apparently, he thinks it’s a tired metaphor and a boring cliche. And so in deference to my friend, I will instead talk about choosing the right horse for the course.

Some horses are mudders, some are good at short-distance sprints, still others are the ones you want for the long race. If you’re eying the race’s purse, you choose the right horse for the course.

So, how do you choose the right horse? I’m guessing it has something to do with experience, breadth of knowledge, and understanding when a functional program makes more sense than an imperative one or knowing when a script is better than a fully engineered OO implementation. Here’s an example to fan the Java vs. Python flames.

Not too long ago, in my previous life working for a consulting company, I wrote a small Java application to monitor our many web applications. The requirements were simple: the server page should return “OK” (text/plain) else the contents of the entire page would be mailed to a list of interested recipients. This technique allows a developer to put whatever test they want in their server page (database connectivity, unit tests, whatever) knowing that any exception they write to the page would be mailed to them.

Not hard, right? Easy implementation? It was both times I wrote it, but one of them was a much better horse for the course.

The configuration file in Java was XML, natch, which required an XSD. The XSD required JAXB(or Castor, XMLBeans, etc.) to generate bindings for the XML. The Main program was well factored in that each class did one thing well. As a result, I had a class to poll a site, one of load the configuration, a class to send email, etc. Between Main and classes from Castor, I was up to a dozen .java files. Main, of course, required libraries (Castor, mail.jar, activation.jar, etc.), and those libraries required a script (.cmd in our case) to load them all onto the classpath. Oh, let’s not forget about building with Ant. Add a build file to the heap.

It worked, but damn, that’s a lot of files and jars and complexity for a simple monitoring program!

I rewrote it in Python in less than 40 lines of code. Two files. There are more comment lines than executable code.

Choose the right horse for the course. Occam’s Razor says the 40 line solution is the right one, but you can decide for yourself.

Here’s the python program and configuration file.

Many developers learn one or two mainstream languages and always run their favorite horse, irrespective of course. The best developers will be those that love learning new languages and techniques. I’m a better Java programmer today because of what I’ve learned from Python. I’m a better web developer today because of what I’ve learned from Ruby on Rails.

Learn more. Broaden your horizons. Try new things and new styles of development. Learn to pick the best horse for the course. Always use the right tool for the job.

Tags: , , ,

Switch to our mobile site