At Rails Machine, we provide 24×7 monitoring as part of our Managed Hosting, and Scout plays a huge part of that. Today, I'll be talking about what we monitor with Scout, and our philosophy of monitoring.
When talking about monitoring, I like to separate it into two discernable pieces: metrics and alerts. Metrics are the things that we are monitoring: response times, disk usage, memory, etc . Alerts are the peaks, trends, etc of the metrics that we want to get notifications for.
Our general philosophy is to measure all the metrics, and alert on metrics that are actionable.
Measure all the metrics
To get started with measuring all the metrics, we use Scout's Cloud Monitoring to create a server template with plugins that make sense on any Linux server. Here's the plugins we are using, and the questions we use them to answer:
Server Overview (Disk usage, Memory usage, and Load)
- Are we running out of disk or memory? These plugins are installed by default.
- Is the IO being saturated?
- Are processes being blocked by IO?
- Is there CPU steal from the hypervisor?
- Are we hitting limits on open connections?
- Are we being swampped on http, ssh, etc?
- Are we using excessive amount of bandwidth?
- Are we running out of inodes?
We also include a selection of plugins that make sense for servers running Rails through Passenger. We disable these so they don't alert if it's a non- Rails server, and then review and enable or remove the plugins after a new server is created from the template.
- How much memory is Passenger using?
- Is there an increase in traffic?
- Is there an increase in response time?
- Are there idle workers to handle new requests?
Armed with these plugins, we have enough metrics to know enough about nearly any type of outage situation.
Alerts on metrics that are actionable
Now we can start thinking about what ones are important enough to pull someone out of bed to check out. We value our beauty sleep, so we are very selective in what we alert on.
The main considerations for alerts are that:
- something has gone wrong (or is about to)
- there is a clear path to resolution
Here are the things we alert on, from our default set of plugins:
Server Overview's Disk Capacity
- many processes flip out when there's no disk space to write logs (apache, rails, mysql)
Server Overview's % Memory Used (%)
- if you run out memory, you're going to have a bad time
Disk Inode Usage's Inodes Used (%)
- similar to Disk Capacity, most processes will flip out if they can't create new files
When configuring these triggers, there is thought about what values to trigger on. You want it to be low enough that there's time to do something about it, but high enough that it's not noisy. For the three I described, we default to 90%, but adjust it depending on the server's capacity and it's general usage.
So, we have some reasonable alerts now. The expectation is that if you receive one of these, you drop what you are doing to work on it until resolution.
For memory, that means identifying what processes are consuming the most and killing or restarting them to free memory. For disk and inode, that means identifying what files and directories are the biggest culprits, and working to remove them to prevent the disk from filling up.
As an anti-example of a trigger, check out the Ruby on Rails plugins's default trigger. It alerts on the request rate going up 200% from the previous week. While this is probably noteworthy, it doesn't really mean there's a problem. If, on the other hand, you were monitoring your application's URL, and it became unresponsive, then it's useful to have metrics for the requests/second to determine if that's the cause.
Why not use Nagios / Pingdom / New Relic / Collectd / God / INSERT LATEST
We like to think of monitoring stack like a tool belt, rather than a consisting of a golden hammer. So, we actually use all of these things.
To me, the advantages of Scout are that:
- It's really easy to manage and update plugins across a large number of servers
- There is a large directory of ready-to-go plugins
- If you can write ruby, you can write a plugin
We do sometimes run into some limitations just using Scout, but we're able to supplement it with other tools:
Plugins execute from the servers they are on, meaning things like URL Monitoring aren't useful for determining your own uptime, but it is useful for monitoring the availability of things you depend on from your server. We use a combination of Pingdom and Nagios for monitoring site availability.
Servers do at some point have to talk to Scout across the interwebs, which can be problematic for some network topographies. We use Ganglia and other tools to collect similar information on our internal network to avoid having talk to the outside world.
Many metrics are relatively low level, like % IO Utilization, Apache response time, number of open connections, etc. This is usually a strength when debugging something specific and can be good for some things, but not for getting a high level view of where an application is spending it's time. We use New Relic to provide more insight into where requests spend there time, augmented by details in Scout to find a specific cause for why time is spent in a particular spot
Measure all the metrics and alert on metrics that are actionable. And that's all I have to say about that (for now). Other questions about how we use Scout, or monitoring in general, at Rails Machine? Let me know in the comments! *Thanks to Derek for the assist.