Docker, The Art of Monitoring, and Kickstarter Engineering with James Turnbull
Interviewed by Christophe Limpalair on 08/17/2015
Monitoring is a recurring topic because of its importance. Meanwhile, Docker is making waves. James is an expert in both of these topics, and he's also the VP of Engineering at Kickstarter.
As you'll see, James explains the importance of collecting metrics. How do you track error metrics? I use Rollbar. Try it for free.
Links and Resources
What does monitoring mean to you?
Monitoring is about understanding how systems and applications you own are providing business value.
Traditional monitoring is about disk, CPU, ram, throughput, the number of 5xx errors, etc.. Those are important, but what James really cares about is: "How is my application performing for my customers." How is it performing in terms of the volume and rate of transactions and the quality of service?
Is it responsive? Are users seeing errors or are they having a good experience?
Did you decide to write a book to share this philosophy?
It's part of the reason why, but he certainly doesn't claim to have created this philosophy.
Instead of the traditional form of monitoring, which is a very large monolithic application with tools like Nagios, James wants to provide an alternative viewpoint and present people with another way to approach monitoring.
What does Kickstarter's monitoring look like?
Their primary focus is on looking at graphs, performance, and understanding their application's behavior.
Kickstarter is very customer facing and aimed at the community, so they focus a lot on metrics and application state.
They instrument a lot of their code so that they can see when an unexpected error or behavior occurs, and that in turn triggers alerts.
Having this sort of monitoring let's them see the throughput of transactions and make sure that people are able to pledge money. These are the important business components, and that's why they put so much emphasis on them.
They use tools like Graphite, statsd and Honeybadger. They also have tools like Sensu to monitor their infrastructure.
If you were starting Kickstarter from scratch, what monitoring process would you implement?
I asked this question because sirex007 from reddit said: I know the [monitoring] tools, what i don't know is how to get started.
James says he would identify what the application does, and what business value it provides.
With a web app, for example, he would care about response time, the throughput of any sort of transaction that is generated by the system, and he would care about the user experience.
Kickstarter likes to track events and perform A/B tests. For example, when a backer goes to the site and sees a project and then backs that project, Kickstarter wants to know how that user's experience was. Where did they go? What did they do? How can we make their experience better? Can we get them to where they want to go faster next time?
Instead of having users click three times to get where they want to go, let's make it one click. Theses are the metrics that matter.
He wraps up that question by saying to worry less about the system-level metrics until you figure out these other metrics.
What are the simplest tools you'd start with? Or does it really depend on what you're trying to measure?
It is a bit dependent on what you're trying to measure, but a tool like statsd is a really good example. It's very simple and it comes with a huge selection of supported languages.
Add logging throughout your code methods to generate events and have statsd log them, and then pass them through Graphite or something like Librato.
There are also tools like NewRelic that let you monitor your apps, servers, etc... I wrote about that here.
A popular approach to metrics is to build an in-house solution rather than using some commercial or open source tools. Do you think this is because there is some functionality missing in the commercial or open source world, or just that people don't know about the stuff that exists already? Asked by: kendallm from reddit
There has been an evolutionary change in monitoring. When a lot of people started looking at monitoring there weren't a lot of options. Especially good open source tools.
Additionally, a lot of open source tools just didn't scale. Netflix is a prime example of this. They built their own tools because they operate a very large and complicated environment. Jeremy Edberg joked in his interview that Netflix is a logging service that happens to serve video.
Once we have monitoring in place, do we just look for patterns in graphs? Or wait for alarms to go off?
They tend to do both at Kickstarter. They have a set of dashboards that show the state of the site as a whole, like the key metrics they care about (that we discussed earlier).
When they deploy code, which is about 5-6x a day, somebody keeps an eye on the dashboard to make sure everything goes smoothly. They also have individual alerts set up. Some of those alerts keep an eye on anomalies, others are set up to monitor code for stack traces and things like that.
In addition, they have services that keep an eye on uptime like Pingdom. That way they can tell if the Kickstarter site is down even if their monitoring system is down for whatever reason.
Are you covering Nagios in your book, The Art of Monitoring?
He is not. He wrote a book called Pro Nagios 2.0 about 10 years ago. He doesn't really see Nagios as having changed too much. Obviously there have been things added but generally and architecturally speaking, Nagios is still the same tool it was.
James finds that problematic, especially since we live in a very dynamic cloud-based world where instances come and go and we have Docker containers that are far more ephemeral. He doesn't think Nagios is an ideal solution for those types of environments.
That's why he doesn't cover it in the book. It's not an "anti Nagios" book, but it aims to provide alternatives for more dynamic environments.
You have a sample chapter available for download. It talks about Riemann. Why? And what is Riemann?
Riemann is a metrics and events routing engine.
It basically runs as a daemon and filters through events. You configure it to look for specific events and then take action on those events. For example, you can send them to statsd, Graphite, or other tools.
James looks at it like a central nervous system. All of your data comes in, you tell Riemann what you care about, and you route it off to a service to alert someone. Or you identify something you want to keep track of, and you convert it to a metric and route it off to Graphite (for example).
It's an incredibly powerful tool because it is configured in Clojure. "You can do amazing things with Riemann."
Since it runs in the JVM and is configured in Clojure, it runs very fast. There are people who process millions of events per second.
So why doesn't every tool take this approach?
There are varying degrees of complexity in people's requirements. For a lot of people out there with a small number of hosts and not very complex systems, there are lots of very simple tools, like Nagios, that are very appealing. They're a little bit confusing and convoluted sometimes but there are many blog posts out there that explain how to set it up.
James says that Riemann isn't hard to set up, but it is a reasonably complex implementation. It takes a different approach than traditional tools, where events are pushed to it instead of pulled (like Nagios).
With a pull system, you connect to a machine and poll different services. Is the webserver running?
Riemann assumes that you have systems that are pushing metrics and events to it. It's not actively going out and polling anything. That is a different way of thinking about monitoring for many people, and can be confusing.
So Riemann listens and then indexes events?
Riemann is mostly stateless. It doesn't store data. Instead, it treats events with a time to live. Every event either has a time set on the system that admitted it, or it is set by default in Riemann. The index is basically like the current cache of events, or the current pull of events, who's TTL hasn't expired yet.
This allows you to look at events and keep track of them. Assuming everything is functioning properly, a new event will appear in the index.
For example, let's assume we're running Apache as a webserver. Some tool sends an event every 60 seconds to say 'The Apache service is running' and that gets indexed. If it isn't still running, Riemann will send an expired event. If the expired event is sent, that means the service isn't reporting anymore, and an alert is triggered. This alert can be passed on to PagerDuty for example, and it will report that the Apache service hasn't responded in 60 seconds, so it probably isn't working.
Is it a good tool for distributed environments?
With a tool that polls like Nagios, you need to specify what should be polled. So every time you add an environment, you have to add it in your Nagios configuration. This results in large configuration files, as you can imagine. In the Docker world for example, containers might only live for seconds or minutes. Having to constantly change your configuration is a nightmare.
With something like Riemann, the instance just starts sending events and Riemann will filter through it depending on what you configured it to report.
For example, a container or machine wakes up and starts sending events to a Riemann instance (or whatever other service). Then Riemann realizes there is something new and adds it to its reporting. When the machine or container goes away, Riemann checks if there are any alerts configured for that and reacts accordingly. This is nice because it gives you a lot of control over what gets reported.
Maybe you don't care when the containers go up and down because it happens so frequently, but you do care about other metrics while the container is running (like when a job starts/stops). Riemann can handle that.
So you can have one central "Command Center", and you can pass all the reporting upstream to it?
You also cover collectd. Is this something you use at Kickstarter?
Collectd is very popular, though. James ran a survey and it was the most popular metrics collection tool.
It's also a very lightweight and simple daemon written in C. It lets you monitor things and generate metrics from that.
It sits on the machine, doesn't take very much memory up and runs very fast. This cares about a metric or a performance, or something-- like a number, that will be graphed or used for something else.
James adds: "I think that's a much more logical way of thinking about systems."
And so we can also use it with Riemann
Yes, you can do Riemann, Graphite, etc..
Riemann has a graphical dashboard doesn't it?
There are pretty good systems out there for measuring metrics. Grafana for example has a really nice interface. He can also have that sitting on top of a number of different data stores so you can send events from Riemann to one of those data stores and then sit Grafana on top of them.
A tool like this specializes in visualization. Riemann's graphical interface is more of a convenience, and James wouldn't recommend it as a solution for complex dashboards.
You also have a section on Logstash, what other kinds of technologies are you going to cover in the book?
The book is designed to take you through building a template architecture. James is looking at setting up that central routing engine, looking at how you can monitor things on my hosts like metrics and events at the system level. How you can pick up and monitor logs, and how you can pick up and monitor business metrics.
His goal is to build a base infrastructure that people can build on top of.
He's going to lay out each of the pieces we talked about and the last few chapters will take an example environment with a series of applications in it, and show how you can use the infrastructure to monitor.
What are the best practices for storing monitoring data long-term? Asked by AlexEatsKittins
It depends on the data and volume. Assuming the question means metrics, using Graphite would write to disk. This scales pretty well.
For really large scale, there are tools like OpenTSDB and InfluxDB. These tools are designed to consume metrics data and store it.
You really need to think about the lifespan of a metrics and how long you want to care about that metrics. Most people, is storing metrics for years and years really valuable? So you can make a compromise and keep resolutions. A resolution is the frequency of whatever it is you are monitoring. For example, maybe keep a 1 second resolution for 30 days. So the performance of the application is at a 1 second resolution for a month, then you can degrade that and keep a resolution at 5 seconds for 6 months, and resolution at 30 seconds for 2 years. Those trends become smoothed out over longer periods of time, and you probably don't need that 1 second resolution.
Thanks to Rollbar for sponsoring this episode!
I wrote a blog post on error tracking to help people get started with it, so they can stop wasting hours debugging in development and stop losing sales from buggy code production.
But see I don't have that problem anymore, because I'm using Rollbar. If a user runs into an error, Rollbar logs it, and notifies you in any tool you use. Just like we talked about with James. Go to try.rollbar.com/syc and sign up for free and use it with any language and integrate it with any tool you need.
What about Docker monitoring? Are there things that work better than others? What are the best tools?
There are a few basic options out there. Docker actually comes with a command called stats. docker stats CONTAINER
This command will give you basic CPU, memory, network IO and a couple of other things.
The Docker Remote API also publishes metrics and statistics, so you could build some sort of interface.
Amazon's ECS also has a metrics module. New Relic has a Docker module, and so does DataDog. Most of the major monitoring platforms have Docker modules.
It depends on your specific requirements, but for very basic stuff, wrapping a simple script around a container is probably a good way to start. You can output metrics to statsd and collect them.
If you want to look at those metrics in the context of other applications and services, then something like New Relic or DataDog, is a good solution. You can see your Docker container performance metrics juxtaposed with application metrics or whatever is running in the containers.
What do you think about Amazon's ECS?
They've had a quick look at it at Kickstarter because they are a big user of AWS. James likes it but he thinks it's missing a few features. It's not quite ready for 'prime time'. Deployment with it is still a bit clumsy, and so is job scheduling.
Honestly it's like most of the AWS products...the first version is trying to sample a market.
Obviously it's not in the same sphere as something like Mesosphere. (They have a pretty epic site by the way...)
They are specifically designed for job scheduling orchestration, but AWS is on the way there and it will be interesting to see how it evolves.
Can you give a general overview of what Docker is?
Docker is a container virtualization system. It's a very lightweight virtualization system.
Traditional VMs tend to take up a lot of resources and they take a while to launch because they are very separated and isolated machines.
Instead, Docker containers are built on top of the Linux Kernel and are very lightweight. They use some Linux Kernel features to construct.
(Image from The Docker Book by James)
Most Docker containers launch in sub-seconds and are designed to be easy to use. More importantly, they're designed to work around your workflow.
Docker's original intent was to provide a system for a platform as a service called DotCloud, which was a product the Docker team used to run. It turned out to be really good at that. It's really good at the workflow where you have application code, you want to package it up, and you want to ship it and run it. This workflow is very important to developers.
Every line of code that sits on a machine, rather than being deployed on production, is not making you money and it's not delivering any business value. So the process of getting that code off of your machine and into your production environment needs to be frictionless and seamless. Docker is a way to try and enhance that experience.
What is a common mindset that hinders beginners from understanding Docker?
James thinks that some people do think about Docker like traditional VirtualMachines. There are a couple of reasons why that is problematic:
1) Docker is much more lightweight. From a security point of view, containers are not as secure as VMs.
2) Often VMs run for months or years. Docker containers are designed to be ephemeral.
In a scenario with a webserver, DB, and app, all on one instance, would you separate those services in different containers? Or one bigger container?
It depends on the complexity of the app, but he would tend to build separate containers for each service.
This will make scaling easier in the future because your services are already separated. So you could say "I want 10 webserver containers" and you could easily add a load balancer like HAProxy or Nginx.
You also get a separation of concerns. If your webserver goes away, you can just restart that one container instead of going in the container and manually figuring it out, or restarting the container that has everything else in it.
How do those containers communicate?
Docker can expose network ports.
You can also create links between containers. You could have your webservice container expose port 80, and have everyone connect to it. That webservice knows that it needs to get stuff from the database, and it has a link to that database. It uses the Docker deamon to find a route to that container and establishes a private connection. This means that no one but the webservice can communicate with the database. There are no ports exposed.
Some still view this as a security issue. If someone gains access to the webservice for example, they could get access to DB variables.
Security on the application level in Docker is no different than on a VM. If you let someone in your VM, James doesn't see that as being different.
Now, VMs do have thicker walls between the OS and the VM than Docker does between the containers, the daemon, and then underlying OS.
Because of this, James says it's a good idea to run things like SELinux. This can help ensure that only necessary processes can run. You can set alerts of just have it prevented.
James also says that you should run services with similar security together. If you have front-end services with your backend services, it strikes him as poor co-location of security concerns.
We have running containers now, how do we deploy code to it?
There are a couple of different ways.
Docker containers are launched as images. Images are a built filesystem.
So for example if we wanted a WordPress site, we could install Apache and WordPress with PHP enabled and build an image from it. This bakes the image and makes it ready to go so you can boot up new containers with these new images and replace the old ones.
Since Docker has a good system for loading filesystems and mount points inside of containers, you can tell it to mount a certain directory, filesystem, or volume (from a third party store system), and that could contain the source code for your application. It could contain a git repo too.
If you don't want downtime, do you just use a load balancer?
Using a load balancer, take 10% of your containers down to replace them with your new image and test. Once it's proven to work, replace the other 90%.
Do you use Docker at Kickstarter?
They do, primarily for local development. He says they will probably end up using it in production in the future. That tends to be how it starts with Docker.
You worked at Puppet Labs. Do you recommend using something like Puppet to manage Dockerfiles or images?
Using something like Puppet or Chef or Ansible is an ideal way to maintain a Docker infrastructure, because you still need to manage all your Docker hosts and bring up instances.
He strongly recommends this for consistency.
Many people also use it to build containers and images. That's an equally viable approach.
How do you recommend automating the management of orphaned volumes and orphaned images created during image development? Asked by: elykquent
"Interesting question, and I'm not sure there's an answer yet."
James says at this point he thinks it's still a pretty manual process. He doesn't think there's an easy solution here yet.
Some believe Docker is not yet ready for production. Shopify recently moved their infrastructure to Docker and Simon Eskildsen on their team wrote a post in July called "Why Docker is Not Yet Succeeding Widely in Production"
That post lists a number of issues they've run across at large and complex scale. One problem is that "image building for large applications is still a challenge." "Dockerfiles make this almost impossible for large apps." He also says that logging is all over the place and there is no great, generic solution. There are a few other things he points out, but the question is:
Do you think some of these issues, like the Dockerfile being too abstract to enable complex use-cases and logging having no generic solution, as solvable issues? Or are these inherent to the technology and mission that Docker is trying to accomplish?
James has read Simon's post and he says it's good. He strongly recommends reading it.
Shopify is a relatively early adopter of Docker. A few others also adopted it early and James says "there were some pretty traumatic experiences." It's cutting edge technology. He would not have recommended running a pre-1.0 version of Docker in production. In fact, every piece of documentation that James wrote had a header stating that they did not recommend using it in production. A significant amount of companies ignored this. (Goes to show you how needed this technology is)
James goes on to say that's awesome but it did involve risk.
The last versions of Docker (1.7 and 1.8) are significantly more stable. There are also more and better features, like for logging. There's syslog and fluentd, and a bunch of others. There are other plugins for storage and better networking, and a host of other improvements.
Other than for really complex applications, Docker is ready for development. James also makes a good point: if you do have a really complex app, there are very few tools out there that will work 100% of your needs. Most of the time companies running complex apps have to roll out their own, custom, changes.
Have some of these early adopters contributed back to help iron out some of these issues?
Yes. James gives a shout out to the team at Gilt in New York and the team at Spotify in New York, both of whom were early adopters of Docker and gave huge amounts of feedback.
He goes on to say that he sat down in many meetings with them saying "When are you guys going to fix this thing? It's terrible!" and providing solid feedback on how they implemented things, what they needed, and what they cared about.
There are a bunch of other companies out there who donated time and effort, and many cases contributed code.
"They made the difference between Docker being successful and not."
What are some interesting tools and trends that you see in the Docker horizon? Asked by: jmreicha
There are two tools that Docker recently added:
Docker compose which is how you can build multi-container, multi-tier applications. It's a single definition file that allows you to spin up your application with a single command.
Docker Swarm is an orchestration tool which allows you to run Docker in multiple distributed environments. Previously Docker hosts didn't know about one another, so if you had containers running on one Docker host, they didn't know about containers running on another host. Docker swarm changes that. You can deploy multi-tenant and multi-geography, distributed Docker environments. You can also transfer containers between hosts, and spin up certain containers in certain data centers.
The plugin system is going to let you integrate with a bunch of other tools. This is pretty big.
They also just came out with a toolbox that makes installing and setting up Docker on your environment much easier. It basically comes with all the tools you need to get started.
James has also written a book on Docker which he has updated to include all of the v1.8 and Docker toolbox updates. (I highly recommend this book)
How can people stay up to date with your book updates and the Art of Monitoring release?
www.dockerbook.com has a bunch of info including a sample chapter, and there is a mailing list you can sign up on.
www.artofmonitoring.com also has a sample chapter and a mailing list you can join. He will send out updates about the book.
What do you do as a VP of Engineering at Kickstarter?
His job is largely a leadership function, so he's there to ensure the team has everything it needs to build an awesome product. He makes sure they have the right number of people, the right tools, and the right process.
What has been one of the hardest challenges you worked on?
Kickstarter is an established service that has been around for 6 years or so, and they have a passionate community. Building software for a passionate community is hard because you want to build something that is going to delight them and make them engage strongly with the product. You really want to build things that give them a better user experience.
How much traffic does Kickstarter have?
He couldn't give me a number, but he put it into perspective for us:
The Pebble project alone raised $20M. Pebble had 78,000+ backers, but they have projects with hundreds of thousands or even millions of backers.
Follow James on Twitter
Check out his blog
Also check out his upcoming book, and sign up for updates.
He's written quite a few books, including The Docker Book which he just updated to include v1.8 and other changes we talked about in the interview.
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)