I was called on to provide a method of alerting from within nagios that was more active and direct than the usual use of email or SMS messages.  So I came up with a simple way to have a nagios notification place a phone call to our off hours tier3 support line to report certain very rare but serious problems.

I’m often ask what it is I do for a living… and being lazy I usually just say ‘computer stuff’.   In an effort to provide a little more context to anyone who may be interested this is one in a series of postings where I’ll cover some aspect of what it is I do.

In my current role I spend part of the time doing development projects. (aka programming) I’m not a hard core developer though.. it’s not my full time occupation nor do I want it to be.  I work mostly with perl and php when necessary, mysql and occasionally PostgreSQL or Oracle all under various flavors of linux. (debian is my favorite). Usually these development tasks are related to some sort of management automation for a global VoIP network but sometimes they involve making complex things easier to understand.  Part of that involves automating the collection of large amounts of data and then presenting in a meaningful way so that problems and long term trends can be identified.  What follows are some examples of the sorts of things I mean.

We have had a Power DNS recursing cacher deployed at one of our busiest sites for a few months now and I thought others might benefit from some real world performance info.  This is running on some older hardware.. dual Xenon 2.8Ghz system with 4G of ram and the only job it’s doing is running this recursor. These three graphs tell the tale.  The first shows that the system is handling peaks of about 3800 queries per second and that about 99% of those are being answered in a fraction of a millisecond.  The second shows that cache hits are averaging about 70-75% and the third shows that it’s doing this work while using at most one quarter of the CPU.  Add to those impressive performance levels that I’ve had zero issues since putting it in production six months ago.