Puppet performance tuning

Sock_puppet_2At BrightRoll we use Puppet for configuration management on all of our systems. We’re also experiencing rapid growth, which means we’re adding more systems all the time, and our Puppet infrastructure is starting to run a little bit “warm.” I was asked to put a plan together to improve the capacity of our installation. I decided to think in terms of indefinite ongoing growth, so that as we reach certain milestones in size we already know what we’ll have to do to support the next phase.

After doing some investigation, it looked like the roadmap to capacity expansion should, at a high level, look like this:

  1. Look for easy performance gains from tuning our existing install
  2. Upgrade to the latest version of Puppet on a newer version of Ruby
  3. Consider bigger architectural changes and then roadmap them separately

For this post I’ll just be focusing on the first task, tuning our current install. Our existing Puppet system runs Puppet 2.7, on Ruby 1.8.7, on Ubuntu Precise. We have four puppetmaster servers behind an haproxy load balancer, just handling catalog and file requests. We have another system dedicated to Puppet CA and puppetdb work, but we won’t be tuning those for now. This setup is supporting over 1,000 nodes with 30 minute run intervals today, and the node count is increasing regularly.

The puppetmasters are EC2 memory-optimized instances with 60GB of RAM and eight 2.4GHz cores. Those four systems were running at a steady 60% CPU load with spikes higher than that. It is understood that upgrading our versions of Puppet and Ruby would confer automatic performance improvements, but particularly the Puppet version jump from 2.7 to 3.7 is not something you can just flip a switch on. We have to invest some time to validate all our Puppet code on the new version before the change can be finalized.

I was suspicious that we might just be abusing ourselves with bad Apache or Passenger tuning parameters to some extent, so I spent some time profiling the puppetmasters to see where their load was coming from. The metric I wanted to optimize was Puppet catalog compilation time. When an agent run starts, it first syncs plugins, generates its facts locally, and sends those to the master with a request for a catalog. A catalog is compiled on the master and transmitted to the agent, and the time this takes is logged by Puppet. Looking at /var/log/syslog I could see at a glance that most compilations were taking about 15 seconds, but I asked one of our Ops engineers to create a graph of that so I could get a more concrete picture. With that done, I found that compilations were taking a minimum of 5 seconds, an average of 15s, and a maximum of 66s. Yikes, 66 seconds to compile a catalog.

Next, I jumped on one of the puppetmasters to start trying to figure out where Puppet’s performance was bottlenecked. iostat confirmed that there was a negligible amount of iowait, so we did not appear to be IO-bound. free and top indicated that we were not memory-bound; although some of the PMs were using most of their 60GB of RAM, this was just down to the Linux kernel’s caching memory management policy. There was no swap used at all.

Running passenger-memory-stats confirmed that Puppet itself was using about 12GB of RAM. With top running at the same time as a tail -f /var/log/syslog, it became apparent that when Puppet was doing catalog compiles it would peg a CPU core at 100%. I noticed something else interesting in the output of passenger-memory-stats: we had 28 puppetmasterd processes running on a system with only 8 cores. Furthermore, clearly some of them were leaking memory. Most were using 200-300MB, but a couple were using over 1000MB. At one point I had a screen session going with a few split windows, running top in one, tailing syslog in another, and running passenger-status and passenger-memory-stats via watch, just to sort of take the pulse of the system.

My next step was to look at our Passenger configuration. The things that interested me there were that we had PassengerMaxPoolSize set to 32 and PassengerMaxRequests set to 10,000. So we have a CPU-bound application running on memory-optimized boxes, and we’re allowing up to four times as many processes to run as we have cores to dedicate to them. And those processes leak memory, which just makes it worse that we have an excess of them.

I decided to try tuning these values downward to see what the effect on performance would be. I set PassengerMaxPoolSize to 8 and PassengerMaxRequests to 2,000 (so puppetmasterd processes will be recycled after they handle 2,000 requests). Here is the delightful before and after result in graph form:

2.7_compile_time_after_passenger_tuning_better_graph

The bottom group of lines is minimum compile time, the middle group is average, and the top is maximum. The results speak for themselves: min time is unchanged, average fell from 15 seconds to 10, and max time fell from peaks over 60 seconds to a fairly smooth 20s. Variation there is basically down to variation in coding practices within our Puppet manifests themselves. CPU load was similarly reduced from a very spiky 60% average to a much smoother 40%. Puppet memory usage was reduced from about 12GB down to just under 2GB.

So how did we get here? In our case, we had applied Passenger tuning values that we found in a conference talk slide about scaling Puppet, without considering whether the speaker’s environment was the same as ours, and without really checking to see what the effect of the change was. At that time we also had fewer nodes, so the change may not have hurt us at first.

But there is also a more general caution that I’d like to share. If you research Passenger tuning, you will find a number of good, detailed articles about it. But what I realized as I read through them is, they’re all written from the perspective of optimizing throughput in Rails apps that are interfacing with humans. Puppet is a different beast; you can afford to have your agent runs wait a few seconds longer to get their catalog if it means your masters aren’t on fire. Throughput is not that critical.

The result of lowering the MaxPoolSize value is that when more than 8 concurrent requests come in, Passenger queues the excess and processes them in order. These requests include catalog compilations as well as requests for flat files and other related stuff being served by Puppet, so not every request is waiting for 10 seconds of catalog compile time. But the important thing to note is that the Puppet agent’s timeout for talking to the master defaults to 120 seconds. I’d sooner increase that timeout than sacrifice performance on my masters, if I had to choose. As Eric Sorenson of Puppet Labs told me, “As your number of agents grows you’re basically just trying to stay on top of a massive DDoS against yourself.”

We learned some good lessons here. First, consider the source when learning from anecdotal “how I did it” accounts. Make sure you understand the similarities and differences between your environment and theirs (note how much specific information about my environment I included at the beginning of this post). Second, make sure you have metrics in place that will show you the effect that your changes have, but don’t assume that because it looks good now it’ll be good forever. Third, make sure you understand what you’re tuning and how it impacts the overall system and services.

Now that we’re in a better place with our existing install, I’m focused on getting us upgraded to Puppet 3.7 with Ruby 1.9.3 on CPU-optimized instances. That should relieve all our immediate capacity concerns. After that I’ll be researching possibly geo-locating Puppet in each of our datacenters, Varnish caching, and other more esoteric changes to help us scale smoothly. I’ll be sure to blog about my learnings along the way!