Archive for the 'Architecture' Category
20th Dec 2007
Performance != Scalability
Performance and scalability are not the same.
It is insufficient to run through a QA environment and say, “Everything seems fast enough!” You may load test the same QA environment and conclude that performance is good, so you’ll be able to scale. You may be wrong! Performance and scalability are two very different things. You can often get more scalability out of applications that perform well — they aren’t mutually exclusive — but you don’t actually need one to get the other.
Performance is often about algorithms, which may be fast but not scalable. Building a select box in a server page might be very, very fast when you only have 10 items to put in it, but it’ll be a dog if you try 1000. This happens all the time. In fact, it just happened where I work. The simplest solution for the search functionality at hand was to put all users of a specific type into a select box. It worked great for months when we didn’t have many users of that type. But the data grew… and one day it’s a major performance issue for a major customer of our’s. Oops! Joel Spolsky writes about Schlemiel that Painter’s Algorithm, which is an excellent anecdote about algorithms that seem fast, but can’t scale.
Scalability doesn’t require good performance. It helps, but it’s not a prerequisite. Consider a page or process that is slow, say 10 seconds response time. If that page is consistently 10 seconds for every box you throw at it, then you can say it is scalable, if inefficient. You’ve achieved linear growth with a known response time for average performance. You have a known, measured metric in hand with which to estimate hardware requirements to support some future concurrent usage.
Scaling is also a measurement of how your application performs when spread across dozens or even hundreds of nodes! Does your application use a “shared nothing” architecture? Are the nodes chatty, doing multicasts to the entire cluster? If one node talks to every other node in the cluster to perform some kind of update in real time, it won’t scale. You’ll eventually hit a number of nodes where you spend more time keeping the other nodes in sync than doing real work. This is the same reason most OSes can’t keep accurate time below 10ms. It would spend more time keeping time than doing real work.
I’ll close with an excellent visual of performance vs. scalability, given by John Engates, CTO of Rackspace. He recently gave a talk at a MySQL conference about scaling web applications, and I think these two picture are worth more than their two thousand words:
PERFORMANCE
SCALABILITY
Performance and scalability are not the same.
It is insufficient to run through a QA environment and say, “Everything seems fast enough!” You may load test the same QA environment and conclude that performance is good, so you’ll be able to scale. You may be wrong! Performance and scalability are two very different things. You can often get more scalability out of applications that perform well — they aren’t mutually exclusive — but you don’t actually need one to get the other.
Performance is often about algorithms, which may be fast but not scalable. Building a select box in a server page might be very, very fast when you only have 10 items to put in it, but it’ll be a dog if you try 1000. This happens all the time. In fact, it just happened where I work. The simplest solution for the search functionality at hand was to put all users of a specific type into a select box. It worked great for months when we didn’t have many users of that type. But the data grew… and one day it’s a major performance issue for a major customer of our’s. Oops! Joel Spolsky writes about Schlemiel that Painter’s Algorithm, which is an excellent anecdote about algorithms that seem fast, but can’t scale.
Scalability doesn’t require good performance. It helps, but it’s not a prerequisite. Consider a page or process that is slow, say 10 seconds response time. If that page is consistently 10 seconds for every box you throw at it, then you can say it is scalable, if inefficient. You’ve achieved linear growth with a known response time for average performance. You have a known, measured metric in hand with which to estimate hardware requirements to support some future concurrent usage.
Scaling is also a measurement of how your application performs when spread across dozens or even hundreds of nodes! Does your application use a “shared nothing” architecture? Are the nodes chatty, doing multicasts to the entire cluster? If one node talks to every other node in the cluster to perform some kind of update in real time, it won’t scale. You’ll eventually hit a number of nodes where you spend more time keeping the other nodes in sync than doing real work. This is the same reason most OSes can’t keep accurate time below 10ms. It would spend more time keeping time than doing real work.
I’ll close with an excellent visual of performance vs. scalability, given by John Engates, CTO of Rackspace. He recently gave a talk at a MySQL conference about scaling web applications, and I think these two picture are worth more than their two thousand words:
PERFORMANCE
SCALABILITY
Posted by Mark Turansky under
Architecture
1 Comment »
15th Dec 2007
Saving database cycles
I once read a great quote:
Oracle may scale technologically, but it doesn’t scale financially
We run Oracle in our shop, and we’ve got a big database. Hundreds of tables. Millions and millions of rows of data. Extensive and detailed audit records of every database transaction. We recently deployed Oracle RAC to help our data layer scale, and it works! We can separate web traffic from internal operations. We can target specific RAC nodes with heavy hitting reporting queries while leaving other nodes available to crunch jobs.
There’s only one problem… new nodes cost a lot of $$$$$.
So, what do you do when you see your fancy clustered database running at 50% of capacity? Beg the CFO to budget for a few more nodes next year? What if you get hit with a sudden burst of web traffic or increased demand for reports or you have to crunch umpteen thousand jobs by next weekend?
Really, the question is, how do you scale your database simply, easily, and cheaply?
Some would say to dump Oracle for a cheaper database. Google famously runs MySQL for AdWords. If it’s good enough for a multi-billion dollar business, it’s probably good enough for my needs. While I might agree with that assessment, I can also say unequivocally that our business cannot migrate off of Oracle in the near term. That would require a rewrite of major pieces of our business.
So, the question still… what do you do when you see your fancy clustered database running at 50% of capacity?
I’ve got a theory. Oracle CPU cycles are expensive. CPU cycles on commodity x86 hardware running commodity software (Linux, HTTPD, and Tomcat) are cheap. If we can move the cycles from Oracle to commodity boxen, we’d be using our database less, the average cost per CPU cycle drops, and we can scale further for less money.
An Oracle RAC cluster is a chatty cluster. One node is kept in sync by constant updates from the other nodes. This is in contrast to the “Master/Slave” configuration of other databases where there is a lag between updates as a node waits for log file shipments.
Chatty cluster, eh? Real time updates? Hmmmm….. I believe I can recreate this at the application level using commodity hardware and software. I believe that we can hold lots and lots and lots of data in memory in an application cluster, thereby reducing the number of queries we run against Oracle. I believe the application cluster can request data from another node in the cluster that’s faster then getting it from Oracle. After all, both are a network call, but one (the db) might also require IO, while the app cluster has it in memory.
I also believe we can test this whole theory and let the numbers guide our decision. We can run benchmarks against Oracle by scaling the number of clients we have connected to it. We’ll know that X number of app servers brings a single RAC node to 100% capacity. If we test this with linear growth of app servers, then we can do some simple math. We need Y RAC nodes to handle X app servers.
And then we benchmark the commodity application cluster. In theory, Oracle should now be used for lazy reads and writes. Y should be smaller than in our previous test, and X should be able to grow to a very large number. The larger X is, the more memory you have in the cluster and the less you rely on lazy reads from Oracle, keeping Y small.
That’s the theory. Ok, it’s not much of a theory. It’s also known as caching. In this case, it’s distributed caching. The concept is simple, but designing and deploying an elegant solution that scales is harder. Retrofitting a 5 year old organically grown enterprise application with a database that’s pushing 300gb in size to use distributed caching while remaining 100% backwards compatible is harder still. Caching is a good idea, but good ideas aren’t worth much by themselves. The plan to implement the idea and the skills required to deliver it are worth a lot more.
I will be laying out the architecture and benchmark results in future posts.
I once read a great quote:
Oracle may scale technologically, but it doesn’t scale financially
We run Oracle in our shop, and we’ve got a big database. Hundreds of tables. Millions and millions of rows of data. Extensive and detailed audit records of every database transaction. We recently deployed Oracle RAC to help our data layer scale, and it works! We can separate web traffic from internal operations. We can target specific RAC nodes with heavy hitting reporting queries while leaving other nodes available to crunch jobs.
There’s only one problem… new nodes cost a lot of $$$$$.
So, what do you do when you see your fancy clustered database running at 50% of capacity? Beg the CFO to budget for a few more nodes next year? What if you get hit with a sudden burst of web traffic or increased demand for reports or you have to crunch umpteen thousand jobs by next weekend?
Really, the question is, how do you scale your database simply, easily, and cheaply?
Some would say to dump Oracle for a cheaper database. Google famously runs MySQL for AdWords. If it’s good enough for a multi-billion dollar business, it’s probably good enough for my needs. While I might agree with that assessment, I can also say unequivocally that our business cannot migrate off of Oracle in the near term. That would require a rewrite of major pieces of our business.
So, the question still… what do you do when you see your fancy clustered database running at 50% of capacity?
I’ve got a theory. Oracle CPU cycles are expensive. CPU cycles on commodity x86 hardware running commodity software (Linux, HTTPD, and Tomcat) are cheap. If we can move the cycles from Oracle to commodity boxen, we’d be using our database less, the average cost per CPU cycle drops, and we can scale further for less money.
An Oracle RAC cluster is a chatty cluster. One node is kept in sync by constant updates from the other nodes. This is in contrast to the “Master/Slave” configuration of other databases where there is a lag between updates as a node waits for log file shipments.
Chatty cluster, eh? Real time updates? Hmmmm….. I believe I can recreate this at the application level using commodity hardware and software. I believe that we can hold lots and lots and lots of data in memory in an application cluster, thereby reducing the number of queries we run against Oracle. I believe the application cluster can request data from another node in the cluster that’s faster then getting it from Oracle. After all, both are a network call, but one (the db) might also require IO, while the app cluster has it in memory.
I also believe we can test this whole theory and let the numbers guide our decision. We can run benchmarks against Oracle by scaling the number of clients we have connected to it. We’ll know that X number of app servers brings a single RAC node to 100% capacity. If we test this with linear growth of app servers, then we can do some simple math. We need Y RAC nodes to handle X app servers.
And then we benchmark the commodity application cluster. In theory, Oracle should now be used for lazy reads and writes. Y should be smaller than in our previous test, and X should be able to grow to a very large number. The larger X is, the more memory you have in the cluster and the less you rely on lazy reads from Oracle, keeping Y small.
That’s the theory. Ok, it’s not much of a theory. It’s also known as caching. In this case, it’s distributed caching. The concept is simple, but designing and deploying an elegant solution that scales is harder. Retrofitting a 5 year old organically grown enterprise application with a database that’s pushing 300gb in size to use distributed caching while remaining 100% backwards compatible is harder still. Caching is a good idea, but good ideas aren’t worth much by themselves. The plan to implement the idea and the skills required to deliver it are worth a lot more.
I will be laying out the architecture and benchmark results in future posts.
Posted by Mark Turansky under
Architecture
1 Comment »


