Canaries in the coal mine

Application logging is only as useful as your plan to actually use the logs. Without a plan to mine the data, collect metrics, and plot graphs, your logs are useless. It’s snowcrash in a console window. It’s gigs of spam in a file.

This reminds me of the Philosophy and Zen of Unix:

Rule of Silence: When a program has nothing surprising to say, it should say nothing.

But how do you know your program is running? You’ve got several options available to you, all good, and you should probably implement them all.

Canaries in the coal mine

In the good ol’ days, miners had a crude but effective way to test a mine shaft for adequate levels of oxygen: if the bird died, miners got out of the shaft.

Your program needs a canary in the coal mine. You need a way to smoke test your application when it first boots up and while it’s running. It either works or it doesn’t. The bird is dead, dying, or singing.

What kind of canary? One that tests some discrete bit of functionality of your application. You can use a simple site monitoring program with basic tests baked into a server page. You can run a load testing tool like JMeter to script mimicking what an end user would do. Just run a test of 1. In the messaging application I’m currently building, we send periodic test messages to the queues. The messages aren’t fancy, just tiny XML messages posted from a Python client requesting 2+2.

But 2+2 is important. It’s like your first Hello, World! program in a new language. Getting 2+2 running means you’ve successfully setup your environment, you understand the basics of compilation, packaging, deployment, and configuration management. You’ve also got your first benchmark of how quickly a simple message can pass through your system.

You want to log the data from the canaries in the coal mine. Ping your canaries every ten minutes. Keep those results and metrics. Create a plan to report on them, which puts you on the path of Statistical Process Control.

Statistical Process Control

You can spend a decade trying to attain CMM Level 5 accreditation or gaining your black belt in Six Sigma and probably still never completely grok the enormity of Statistical Process Control. You can, however, start improving your technical operations by using meaningful statistics to smooth out your Configuration Management practices.

So what is Statistical Process Control? Quoting Wikipedia:

Statistical Process Control (SPC) is an effective method of monitoring a process through the use of control charts. Much of its power lies in the ability to monitor both process centre and its variation about that centre. By collecting data from samples at various points within the process, variations in the process that may affect the quality of the end product or service can be detected and corrected, thus reducing waste and as well as the likelihood that problems will be passed on to the customer. With its emphasis on early detection and prevention of problems, SPC has a distinct advantage over quality methods, such as inspection, that apply resources to detecting and correcting problems in the end product or service.

In addition to reducing waste, SPC can lead to a reduction in the time required to produce the product or service from end to end. This is partially due to a diminished likelihood that the final product will have to be reworked, but it may also result from using SPC data to identify bottlenecks, wait times, and other sources of delays within the process. Process cycle time reductions coupled with improvements in yield have made SPC a valuable tool from both a cost reduction and a customer satisfaction standpoint.

In layman’s terms, you aggregate the data from your canaries into a graph. You watch the graph every day to eventually find your “center”, the normal singing voice of your canary. Your data tells you he sings at X decibels when healthy, and your graphs show you when there’s not enough oxygen in the mineshaft.

This entire process should be automated! If it’s not, you won’t do it. Your Ops center and CM folks should have at least one box set aside for automation and monitoring. Maybe it’s your build box. Put all your scripts there. Create cron jobs or Windows scheduled tasks to constantly parse your log files for data. Use Log4J’s JMS or JDBC Appender if you don’t want to parse text files. Get all your data in one place, mine it, and graph it.

Test Driven Deployment

Page 1 of 2 | Next page