Puppet performance tuning

Sock_puppet_2At BrightRoll we use Puppet for configuration management on all of our systems. We’re also experiencing rapid growth, which means we’re adding more systems all the time, and our Puppet infrastructure is starting to run a little bit “warm.” I was asked to put a plan together to improve the capacity of our installation. I decided to think in terms of indefinite ongoing growth, so that as we reach certain milestones in size we already know what we’ll have to do to support the next phase.

After doing some investigation, it looked like the roadmap to capacity expansion should, at a high level, look like this:

  1. Look for easy performance gains from tuning our existing install
  2. Upgrade to the latest version of Puppet on a newer version of Ruby
  3. Consider bigger architectural changes and then roadmap them separately

For this post I’ll just be focusing on the first task, tuning our current install. Our existing Puppet system runs Puppet 2.7, on Ruby 1.8.7, on Ubuntu Precise. We have four puppetmaster servers behind an haproxy load balancer, just handling catalog and file requests. We have another system dedicated to Puppet CA and puppetdb work, but we won’t be tuning those for now. This setup is supporting over 1,000 nodes with 30 minute run intervals today, and the node count is increasing regularly.

The puppetmasters are EC2 memory-optimized instances with 60GB of RAM and eight 2.4GHz cores. Those four systems were running at a steady 60% CPU load with spikes higher than that. It is understood that upgrading our versions of Puppet and Ruby would confer automatic performance improvements, but particularly the Puppet version jump from 2.7 to 3.7 is not something you can just flip a switch on. We have to invest some time to validate all our Puppet code on the new version before the change can be finalized.

I was suspicious that we might just be abusing ourselves with bad Apache or Passenger tuning parameters to some extent, so I spent some time profiling the puppetmasters to see where their load was coming from. The metric I wanted to optimize was Puppet catalog compilation time. When an agent run starts, it first syncs plugins, generates its facts locally, and sends those to the master with a request for a catalog. A catalog is compiled on the master and transmitted to the agent, and the time this takes is logged by Puppet. Looking at /var/log/syslog I could see at a glance that most compilations were taking about 15 seconds, but I asked one of our Ops engineers to create a graph of that so I could get a more concrete picture. With that done, I found that compilations were taking a minimum of 5 seconds, an average of 15s, and a maximum of 66s. Yikes, 66 seconds to compile a catalog.

Next, I jumped on one of the puppetmasters to start trying to figure out where Puppet’s performance was bottlenecked. iostat confirmed that there was a negligible amount of iowait, so we did not appear to be IO-bound. free and top indicated that we were not memory-bound; although some of the PMs were using most of their 60GB of RAM, this was just down to the Linux kernel’s caching memory management policy. There was no swap used at all.

Running passenger-memory-stats confirmed that Puppet itself was using about 12GB of RAM. With top running at the same time as a tail -f /var/log/syslog, it became apparent that when Puppet was doing catalog compiles it would peg a CPU core at 100%. I noticed something else interesting in the output of passenger-memory-stats: we had 28 puppetmasterd processes running on a system with only 8 cores. Furthermore, clearly some of them were leaking memory. Most were using 200-300MB, but a couple were using over 1000MB. At one point I had a screen session going with a few split windows, running top in one, tailing syslog in another, and running passenger-status and passenger-memory-stats via watch, just to sort of take the pulse of the system.

My next step was to look at our Passenger configuration. The things that interested me there were that we had PassengerMaxPoolSize set to 32 and PassengerMaxRequests set to 10,000. So we have a CPU-bound application running on memory-optimized boxes, and we’re allowing up to four times as many processes to run as we have cores to dedicate to them. And those processes leak memory, which just makes it worse that we have an excess of them.

I decided to try tuning these values downward to see what the effect on performance would be. I set PassengerMaxPoolSize to 8 and PassengerMaxRequests to 2,000 (so puppetmasterd processes will be recycled after they handle 2,000 requests). Here is the delightful before and after result in graph form:


The bottom group of lines is minimum compile time, the middle group is average, and the top is maximum. The results speak for themselves: min time is unchanged, average fell from 15 seconds to 10, and max time fell from peaks over 60 seconds to a fairly smooth 20s. Variation there is basically down to variation in coding practices within our Puppet manifests themselves. CPU load was similarly reduced from a very spiky 60% average to a much smoother 40%. Puppet memory usage was reduced from about 12GB down to just under 2GB.

So how did we get here? In our case, we had applied Passenger tuning values that we found in a conference talk slide about scaling Puppet, without considering whether the speaker’s environment was the same as ours, and without really checking to see what the effect of the change was. At that time we also had fewer nodes, so the change may not have hurt us at first.

But there is also a more general caution that I’d like to share. If you research Passenger tuning, you will find a number of good, detailed articles about it. But what I realized as I read through them is, they’re all written from the perspective of optimizing throughput in Rails apps that are interfacing with humans. Puppet is a different beast; you can afford to have your agent runs wait a few seconds longer to get their catalog if it means your masters aren’t on fire. Throughput is not that critical.

The result of lowering the MaxPoolSize value is that when more than 8 concurrent requests come in, Passenger queues the excess and processes them in order. These requests include catalog compilations as well as requests for flat files and other related stuff being served by Puppet, so not every request is waiting for 10 seconds of catalog compile time. But the important thing to note is that the Puppet agent’s timeout for talking to the master defaults to 120 seconds. I’d sooner increase that timeout than sacrifice performance on my masters, if I had to choose. As Eric Sorenson of Puppet Labs told me, “As your number of agents grows you’re basically just trying to stay on top of a massive DDoS against yourself.”

We learned some good lessons here. First, consider the source when learning from anecdotal “how I did it” accounts. Make sure you understand the similarities and differences between your environment and theirs (note how much specific information about my environment I included at the beginning of this post). Second, make sure you have metrics in place that will show you the effect that your changes have, but don’t assume that because it looks good now it’ll be good forever. Third, make sure you understand what you’re tuning and how it impacts the overall system and services.

Now that we’re in a better place with our existing install, I’m focused on getting us upgraded to Puppet 3.7 with Ruby 1.9.3 on CPU-optimized instances. That should relieve all our immediate capacity concerns. After that I’ll be researching possibly geo-locating Puppet in each of our datacenters, Varnish caching, and other more esoteric changes to help us scale smoothly. I’ll be sure to blog about my learnings along the way!

Useful Jenkins plugins – Job Config History

The Job Config History plugin is another Jenkins plugin that is vitally important to me.

  • It tracks every change to job configurations
  • It shows you in a job’s build history if there was a config change before a given run
  • It tells you who made the change (if you have authentication configured – you DO, right?)
  • It can show you a side by side diff of any two change points
  • It will let you revert a job config back to any previous point in time
  • It can also track changes to the master Jenkins config, but this is disabled by default

If you’ve been using Jenkins seriously for any length of time and you haven’t heard of Job Config History, I hope you’re already heading off to install it. It has saved my skin more times than I can count. Without it, unless your configuration is really locked down, anyone can make changes silently and they aren’t tracked at all. It leads to a fair amount of, “WTF, this was working…” moments.

There isn’t a lot more that I can say about the utility of this plugin, but I’ll show some example screenshots. First, here’s what the Build History list looks like when there have been config changes. The little wrench icon indicates builds that happened after a change.

Screen shot 2013-12-06 at 4.09.12 PM

If you click on one of the wrench icons you can see a side by side diff of what was changed. It also gives you a button to revert to the previous version.

Screen shot 2013-12-06 at 4.12.09 PM

You’ll also notice that there is a Job Config History link above the Build History. Clicking this shows you a view of all known changes to the job. Obviously only changes made since the plugin was installed are shown.

Screen shot 2013-12-06 at 4.20.49 PM


Backing up your Jenkins configuration

After your team has been using Jenkins for a while you’ll want to start thinking about backing up your configuration regularly. It’s the first step toward making your Jenkins installation “highly available” (quotes intentional). One day I stopped and asked myself, “How painful would it be if I had to re-create all of our configuration data from scratch?” The answer was that it would probably cost us several days of reduced productivity, not a risk worth taking given the very rapid pace of development and deployment we’re trying to sustain. I looked at a few different backup solutions, and eventually ended up just extending a generic one that we already use in-house.

plattersBefore I get into the different ways you can do this, I highly recommend that you take a look at this white paper that Jenkins creator Kohsuke Kawaguchi wrote, 7 Ways to Optimize Jenkins [pdf]. One section is about backups, and gives a lot of good advice. This is what I used to determine which things were most important to back up.

There are a few Jenkins plugins which will implement backups for you. The Backup plugin is no longer maintained and hasn’t had a release since 2011. It also backs up everything in your JENKINS_HOME directory, which is unnecessary and will quickly become a massive amount of data. Not recommended.

The SCM Sync Configuration plugin will write your Jenkins and individual job configurations to an SCM system. It currently only supports Git and Subversion, and every time you change a config and save it the plugin will interrupt you to ask for a comment to go with the SCM commit. I would find that annoying because I make configuration changes regularly, and I already use the Job Config History plugin to track, diff, and revert changes when necessary with no SCM required.

The thinBackup plugin only backs up essential configuration and has a bunch of good configuration options so you can tune it to your needs (back up plugins too, for example). It can also restore your backup for you. This is probably the best of the plugin options, but it can only store the backup locally on your Jenkins master. You would still have to script a separate process to move the backups off to another system for safe keeping. That’s what keeps me from using it.

At BrightRoll we are heavy users of Amazon AWS, and we have a skeleton BASH script (!) that can be run from cron to back up data to S3, typically as a tar file. My Jenkins version of that runs these commands to generate the tar file:

nice -n 19 tar --exclude-from jenkins_backup_exclusions -zcf $DUMP $JENKINS_HOME

Not rocket science, that. What I really need to show you is the contents of jenkins_backup_exclusions, because that’s what tells tar what to leave out of the backup. We used to back up our entire JENKINS_HOME, and then I needed to restore it and I found out the backup was 600MB and was going to take awhile to pull out of S3. Now they’re 138MB, which is not bad for 114 jobs. Here’s what we exclude:


All of those paths are relative to JENKINS_HOME. config-history contains all the configuration history managed by the Job Config History plugin that I mentioned earlier. No need to back that up. You don’t need to back up any job’s workspace, most likely, nor any archived build artifacts. You can always download your plugins again, but keeping these would probably make complete restoration faster. They bloat the backup by 224MB in my case though. [See update below.] The war and cache directories are 69MB and 79MB, respectively, in my case. Also not necessary for a restoration. You’ll want to look at the contents of your own JENKINS_HOME and possibly add other things to the exclusions list, depending on what plugins you use and so on.

So our BASH script just runs in crontab every night, creates this tar.gz file and uploads it to S3 using Amazon’s standard Linux command line tools. The storage costs there are low enough that we can justify keeping backups indefinitely, so I can restore from any point in time. This has come in handy on a few occasions where a job configuration was broken and no one realized it for a long time.

So that’s it! Now you have no excuse to not perform backups of your Jenkins configs. Get to work! Let me know in the comments if you have any questions.

[Updated 11/18/2013]
I didn’t look closely enough at what I’m doing with the backup exceptions list, and it has been awhile since I set this up. In the case of the plugins directory, I’m only excluding the uncompressed contents of each plugin and any previous plugin versions which Jenkins keeps (so you can roll back if you find a bug). The actual plugin .jpi files are backed up, and that’s all you need to do a restoration.


Useful Jenkins plugins – Conditional BuildStep

The Conditional BuildStep plugin fills a need that I’ve felt in Jenkins for a long time: in the core Jenkins, there is no easy way to do conditional (if/then) logic. For example, in my installation I create parameterized builds, where one parameter is a dropdown list of environments to deploy to if the build succeeds. Deployment is then handled by a different Jenkins job.

Before Conditional BuildStep came along I tried some other workarounds. At first I used a shell script build step to evaluate the $DEPLOY_ENV environment variable. If it wasn’t the default value of “none,” I would use curl to trigger the downstream job via HTTP and pass it the appropriate parameters in the URL. There are two problems with this solution: first, since you’re triggering the deploy from the “build” phase of your first job, it happens before, and with no knowledge of, any of the post-build actions. Second, with this setup the first build job has no official connection to the deploy job, as far as Jenkins is concerned, so you can’t do things like make the build job wait and fail if the deploy fails.

Another approach that I used for a longer time was to use the Promoted Builds plugin in all my main build jobs. I created a sort of “shim” job that would evaluate $DEPLOY_ENV and fail if it was “none.” In my main build I set up Promoted Builds to trigger the shim job with the params from the main one, and if the shim was successful then trigger the deploy job. At least in this way the deploy happened after all the post-build steps in the main job, but I still had the problem of not being able to make the main job wait for the deploy to finish.

Conditional BuildStep was the answer to my prayers, and I’m now using it in two places in my builds. The first is in the build phase, to determine if we should create a package. Then, in the post-build phase, we use it to trigger the deploy job if appropriate. The configuration is not always that intuitive; here is a screenshot.

In this example, our build has a choice/dropdown menu parameter called DEPLOY_ENV where the user selects which environment they want to deploy to, if any. If they have chosen something then the first thing we need to do (after compiling and running tests, and assuming all that passes) is build a package.


Once a package has been built, in the post-build phase we want to trigger a deploy.

Due to a truly pernicious confluence of bugs that all interact with each other, there is no clean way to add a conditional build step as a post-build action. The resolution of JENKINS-14494 will help with that. I also use matrix/multi-configuration builds everywhere, which seem to be sort of the red-headed stepchild of Jenkins job types. There are many plugins that at best actively don’t support matrix jobs, or at worst just blow up when you try to use them in that context. Our solution is to use the PostBuildScript plugin, which lets you use build steps in a post-build context, in matrix jobs. It’s a plugin to call another plugin. Gross, but it works.

I will talk about matrix jobs more in an upcoming post.


Useful Jenkins plugins – Parameterized Trigger

This will be the first in a series of posts about Jenkins plugins that I like and use heavily. A big part of the value in Jenkins is the great collection of plugins that exist for it, though you do have to watch out for ones that are buggy or have been abandoned.

The Parameterized Trigger plugin lets you trigger other builds and pass parameters to them which can be used in their configurations. This assumes that the downstream jobs are configured with parameters. You can do this as a build step or a post-build step. In our case, we use it in the post-build process of all our build jobs, to optionally deploy the built code to one of several test environments.

Our build jobs all have parameters of git branch name (defaults to master) and deploy environment (defaults to none). If you choose an environment when you build, it causes the job to create a package if the build (compilation and tests) is successful. During the package step, we write (echo from a shell) a Java properties file, which is just a text file of key/value pairs, one per line, in the form ‘key=value.’ We store the keys package_name, version_number, and deploy_environment in that. In the post-build phase, we add a ‘Trigger parameterized build on other projects’ step. It calls the ‘util_deploy’ job, only when the current build is stable, and sends the build properties file that I created earlier.


Screen shot 2013-05-07 at 2.57.24 PM

The util_deploy job has parameters of package_name, version_number and deploy_environment. It runs a shell script which takes those variables and kicks off a deploy as requested. The console output of the parent job will include a link to the deploy job, so users can quickly jump there and look at the log if something went wrong.

Parameterized Trigger is, by my informal estimate, one of the most widely-used Jenkins plugins. It was downloaded over 10,000 times in April of this year alone. It gives you a much more powerful way to chain Jenkins jobs together than the built-in ‘Build other projects’ step, because once you can actually pass data around between them you can do almost anything you want. You don’t have to create a ‘properties’ file, either. You can just choose to pass all the parameters/values your parent job has, the Subversion revision or Git commit that was used in the build, or arbitrary parameters that you define.


Why I love Jenkins

This will be the first of many posts about Jenkins, I’m sure. I’ve been using it for about four years, and it is an invaluable part of my team’s workflow. More than just a build system, I think of Jenkins as a really flexible tool for performing all kinds of commanded jobs. We use it as a frontend for doing deploys, launching virtualized test environments, and managing integration test infrastructure. Herewith, a few of the reasons why I love Jenkins.

It’s easy to deploy
Getting started with Jenkins involves downloading the .war file from http://jenkins-ci.org and running ‘java -jar jenkins.war’ on the command line. In a minute or so you have a fully-functioning build system that you can access via HTTP on port 8080.

It’s easy to scale
Jenkins has a concept of slave nodes that can be used just for distributing the build workload, or for performing specialized tasks on dedicated systems. Setting one up is trivial; in the easiest scenario, make sure that the user Jenkins is running as has passwordless ssh access to the slave node, then add it through the Jenkins UI. Jenkins will ssh in, push over its own JRE if needed, then push over the Jenkins slave .war file, start it up and connect to it. Presto, you have a slave build node in about 15 seconds.

By default, any job will run on any available slave. The alternate configuration, which we use, is to reserve slaves for “tied jobs.” In this way you can target specific builds to specific slaves or groups of slaves. We need to build on a matrix of Debian Lenny, Ubuntu Oneiric, and Ubuntu Precise, with both 32 and 64 bit for Lenny. We have four instances in total, one master and three slaves. Each slave is tagged for its distro and architecture combination, and then all of our builds are “matrix builds” in Jenkins parlance. This means that some basic job setup tasks happen on the master, and then the meat of the build happens in parallel on each slave that is needed. The slaves are specified by tag in the build, so that in the future if we outgrow any particular one, we can spin up additional identical slaves, give them the same tag, and Jenkins will automatically use them where appropriate.

It’s highly extensible
Jenkins’s biggest strength, in my opinion, is its fantastic collection of plugins. The base installation can access a few popular SCMs out of the box, build using Maven or just an arbitrary shell script, send emails about build failures, and do most other common build tasks. But eventually you start to find yourself thinking, “I wish I could do THIS with Jenkins…” That’s when you go looking for a plugin, and you’re almost always rewarded.

Some of the plugins we use most heavily are:

  • Parameterized Trigger, to send arguments to downstream builds when a primary build succeeds or fails.
  • OpenID, to allow us to authenticate users using our Google Apps for Business credentials.
  • Environment Script, to set environment variables and propagate them to slaves during each build.
  • Conditional BuildStep, to trigger actions within the build job based on variables; a rudimentary if/then construct, basically.

There are lots of others that we use for little nice-to-have things as well.

The developer community is awesome
Jenkins is open source, and the developer community that surrounds it is one of the best I’ve ever interacted with. The users mailing list is really active, and by virtue of the fact that it’s on Google Groups, mining it for tips or solutions to a problem is easy. But even more awesome than that is the #jenkins IRC channel on Freenode. Not only do many of the core committers to the project, such as kohsuke (its creator), abayer, and rpetti hang out there, but they and many others are always generous with their time. Whether you’re a new user with a simple question, or you’re looking for a plugin that might help you accomplish something, or you have a possible bug to report, you’ll probably get a quick and helpful response.

In one instance I found a bug in a plugin I wanted to use which completely blocked me. I went in #jenkins and brought it up, and it turned out that abayer was the maintainer of it. Within a few minutes he had fixed the bug and sent me a custom build of the plugin to test. The fix even worked the first time! In another case, I went to the channel wondering if it would be possible to get the OpenID plugin to support authenticating with the Google Apps for Business API. This would let me configure Jenkins to allow access to anyone at work who was already logged in to their corporate Gmail account. Kohsuke turned out to be the maintainer of that plugin, and thought it would be a useful feature, so he coded it up, sent me a build to test, and it worked!

Those are just two of the many, many examples I have of people in the community helping me out, and in a very immediate way. More recently, I’ve been encouraged to get involved myself. I’ve had a pull request accepted to Jenkins core, I’ve tried to be an active support resource on IRC, and I convinced my employer, BrightRoll, to sponsor the 2013 Palo Alto Jenkins User Conference!

Take a great tool as your base, make it easy to use and easy to enhance, and foster a vibrant community around it; that’s the way to run a successful open source project!