Building epic open source tools the HashiCorp way, with Seth Vargo
Interviewed by Christophe Limpalair on 03/30/2016
In this episode, we talk with Seth Vargo from HashiCorp about many of their open source tools: Vagrant, Packer, Serf, Consul, Terraform, Vault, Nomad, and Otto. We also talk about their engineering culture to understand how they're able to innovate and build these high quality tools. What's their process? How do they test? How do they come up with new ideas? We answer these questions, and more.
Links and Resources
Can you tell us about your background and how you got started with HashiCorp?
I joined HashiCorp in late 2014. Previously, I was an engineer and community person at Chef. I heavily used Vagrant and worked on a lot of open source tools that were either plugins to Vagrant or depended on Vagrant under the hood. I had met Mitchell (Hashicorp founder) through the Vagrant community because I had been working on Vagrant as a core contributor for a while.
I left Chef and took a little time off and then Mitchell asked me to join HashiCorp. I was one of the first engineering hires. There were 4 or 5 people when I joined. There are more than 50 now which is kind of crazy to think about.
I've been there under 2 years now and it has been amazing to see how quickly the company has grown while still following our core pillars with open source principles and cranking out new projects left and right.
Fun fact: Part of the company name comes from Mitchell Hashimoto's last name. Hashi has different meanings in Japanese. One of them is "chopsticks," the other is "bridge." Part of the reason the company is named HashiCorp is that we bridge the gaps between people, process, and technology.
You worked at Chef and now you work on a lot of different tools at HashiCorp. What have you learned from working on so many different open source projects?
I think the biggest takeaway that I've learned over the past 9 years is that software is easy and humans are hard.
There are two kinds of open source software:
- Software that people didn't want to pay GitHub for a private repo
- Software that is truly open source. The open source component involves people more than lines of code and curly braces.
A lot of times we forget that open source does involve humans on both sides (humans that write that code and humans that have to use it). Sometimes we forget that and get caught up in arguments or use cases where sometimes, at the end of the day, we have to step back and realize that "it's just code" and we should care more about empathy and the human side of things than curly braces and Unicode characters.
If someone is interested in taking a step forward and contributing to their favorite open source tool, what would you recommend to them?
If you are a non technical user or you don't feel like you're skilled enough to contribute, or even if you have imposter syndrome, I think the best thing to work on is documentation. Especially non-API documentation.
A lot of open source projects are really good about having a developer-style documentation but they don't have the beginning use case...They don't have, "Here's how I go from nothing to a real use case."
I think documentation is super important and I think open source maintainers often neglect it or leave it until the very end. For some non technical or less than technical users, documentation is a great place to get started. It's a great way to introduce yourself to the project and maintainers, and a great way to give back.
Can you tell us more about the 'imposter syndrome'?
What I've encountered is that people feel like they're not adequate enough to contribute to software. It's entirely untrue.
No matter how much experience you do or do not have, you are still a valuable member of the community. The fact that you open up GitHub and want to contribute makes you a valuable community member.
Sometimes people look at large companies like Apple and Google who have these open source tools and people feel they are not adequate enough to contribute to those projects because those companies have an aura about them and certain engineering standards. If you go through pull requests you'll find that those maintainers are super responsive. They're there to help you learn. They'll teach you if there's a better way to write that code you wrote, and you will learn a lot faster in return.
"It's actually a learning opportunity, so don't idolize maintainers and don't get imposter syndrome. Trust me, you're definitely good enough to contribute to open source, anybody is."
You did write an interesting post about DevOps myths What are those myths and what is DevOps really?
It helps to have some background. I was one of the organizers for a DevOpsDay in Pittsburg two years in a row. I was involved in the DevOps movement early on and I started seeing enterprise adoption of the term DevOps. It turned into a marketing facade.
It angered me a little and I decided that instead of ranting on Twitter, it would be better if I put all my thoughts together in a single blog post where I listed what I think are the top ten fallacies or myths. It has turned into a talk as well. I've given the blog post in talk form a few times.
Myth #1- You can hire a DevOp to fix your organization.
A lot of times you see these titles for DevOps engineers or Senior DevOps engineers.
The purist definition (which was talked about as far back as 2009) was: breaking down barriers within an organization. Basically, it's getting your developers to talk to your operations engineers.
Since then it has evolved to be entirely across the organizing team. So getting your marketing team to talk to your Ops team to talk to your development team. So it's really taking those vertical silos and breaking them down and improving communication.
When I hear, "I'm a DevOps engineer," or see a job posting for DevOps, in the pure sense of the definition, that would mean that your job is to get other people to communicate with each other. It wouldn't meant that you're working with EC2 or skilled with managing Chef or Ruby or Terraform, or whatever. It would literally mean that your job is to communicate and break down barriers.
It turns out that we already have people to do that in organizations. We call them managers. To me, it seems we have overloaded that definition a bit. I think what's happening, for instance the term "operations engineer" wasn't buzz wordy enough for LinkedIn posts and job titles and we've overloaded that term to mean "somebody who manages systems," but in reality, it's still an operations engineer. It's just a 21st century operations engineer that's focused more on cloud based things as opposed to bare metal centers and datacenter cabling.
How does HashiCorps tie in with the DevOps mentality?
We published this thing called the Tao of HaschiCorp. It's the pillars on which our company stands. One of them is that we focus on workflows, not technologies. So yes, we make these really popular open source tools that are used by a lot of people, but those tools are designed to actually solve problems. They're not just tools that we created.
We identify real problems that humans are experiencing and we build tools to solve the workflow, not the actual problem itself, because over time technologies change.
Look, for example, at the number of new languages that exist today compared to 5 or 10 years ago when basically everything was written in Java, C, Ruby or Python. Now you have Go, Rust, Crystal and all these new languages are coming up and they're solving different use cases.
We tend to focus on the processes and the workflows themselves. That's why we tend to get labeled like a DevOps company because we're enabling a lot of those DevOps processes. Terraform, for example, which is a tool for managing infrastructure, allows people like the marketing team to go in and make the changes to DNS entries, which previously they would have to go through an operations or engineering team. But it's such a simple syntax. It's so easy to use. We're breaking down a lot of those barriers and allowing people to self service.
Speaking of tools, let's walk through some of them just to understand what they do and what they have to offer. Let's start with Vagrant.
Vagrant is actually pre-HashiCorp. It was Mitchell's open source project when he was in college. It is a command line tool for driving virtual machines. It integrates with VirtualBox, Oracle's open source virtualization Hypervisor layer. It allows you to use a single file, the Vagrant file, to declare your machine. The file can dictate how much memory or ram it should have and then you can run provisioners to install everything. A good use case for Vagrant is development environments.
Everyone on your team needs the same development environment. It should also match production as closely as possible. You just run vagrant up and it mounts your folder, so you still get to use your editor, Sublime, Atom,Vim, or whatever, on your computer. You can save your files, refresh the window, and you see the changes right away. Everything runs inside the VM (virtual machine), and if you screw something up, it doesn't touch your actual system. You can just destroy the VM and start over again.
I mentioned production parity. The idea behind Packer was that we wanted t o build a tool where you could create Vagrant boxes as well as production images. With Packer, you get a Packer template and you tell Packer, "Create me an Amazon AMI or a Google Compute Image, or a DigitalOcean Droplet." At the same time that you create this Vagrant box.
At the same time, you're still using the same Chef or Puppet scripts, but what is happening is you're getting two outputs and artifacts. That means that your developers are working in an environment that's as close to production as possible, and you don't have to do as much work creating the environments.
That, in turn, decreases the probability that system level bugs will enter production which ultimately saves the organization money. Packer is like Golden machine Images as a service, but it's in an automated fashion so it defeats a lot of the arguments that we typically have against Golden Images like that they're slow or difficult to maintain.
Packer automates all of that. It runs everything in parallel. It's very very fast and has a lot of adoption.
Isn't that a use case already solved by Docker?
Kind of...Packer is almost a little bit higher so it can actually create Docker containers for you. You could create a container the same way you could create an Amazon AMI or Droplet and a Vagrant box. Packer can also do the Docker side of things. The problem that you'll run into with Docker is that once you have a container, where do you put it?
That's where a Scheduler starts to come into play where you need something like Nomad which is our open source Scheduler (like Kubernetes or Mesos) to actually schedule those containers and put them in the right place and make sure they're highly available.
With something like AMIs, you just launch the AMI and let Amazon take care of it or if you've launched a Google image you let Google take care of it.
Not everybody is ready to move into a scheduled architecture. I think I see that as the direction our industry is moving towards, but a lot of people are still running on bare metal. For them to go from bare metal to a scheduled architecture is a huge jump. They might want to go from bare metal to cloud and then to a scheduled architecture. That's the kind of thing Packer would enable.
Otto is what you call the successor to Vagrant. Can you tell us a little bit about Otto?
Otto doesn't replace Vagrant. It was a little bit of a misnomer when we announced it. Otto uses Vagrant under the hood. Otto actually encapsulates all of the best practices for using our tools in a less customizable, more 'Heroku like' work flow.
Otto has 3 components:
Otto dev which will spin up a Vagrant environment based off of your application.
It will introspect your application and say, "Hey, you're running a Ruby app, let me give you a Ruby environment" and it drops you right into the environment. Under the hood it uses Vagrant for that. You don't have to customize the Vagrant file, you don't have to understand how to install Ruby, It just does it for you.
There are two other parts to that. The second part is the actual infrastructure creation which, under the hood, uses Terraform.
If you think about a modern infrastructure, you need a VPC (Virtual Private Cloud), firewall, a Bastion Host, front-end load balancers, internal TLS. If you talk to any person who has done Ops for a long time, it's very difficult to get that right. So, Otto codifies all that knowledge, then it generates Terraform configs and runs Terraform for you.
You don't have to know or understand Terraform. You don't even have to understand operations. You just run auto infra. It inspects your applications. It finds the current best practices for deploying a Ruby app on AWS or Google Compute Engine and then creates the infrastructure.
The third component is the deployment process. Part of this isn't finished but it will use Consul and Nomad to deploy the actual application onto the actual infrastructure that it created.
So, I like to describe Otto as being very akin to Heroku where, if you are a developer or you just want to deploy your application, you can just use git- f push and it goes up to Heroku. Heroku makes a whole lot of decisions for you and you don't have a lot of control over the knobs that you can tune.
The advantage with Otto is that we want to give people that same experience, but if they want to tune any of those knobs, you can dive down into the more advanced tool like Terraform or Consul or Vagrant and twist all those knobs. It allows you to migrate to that advanced mode as your organization grows.
Imagine you have a startup with 2 employees, it might make sense for them to otto dev, otto infra, otto deploy, but as that company grows and realizes they can optimize and cut costs by doing different things and making optimizations, then they can just use Terraform directly as opposed to allowing Otto to make all those decisions. They can make their own decisions based off of their own needs.
That's one of the reasons you haven't replaced Vagrant but instead made it a different tool.
Right. It's also being adopted right now because it's a lot faster than Vagrant since it's written in Go instead of Ruby.
So really with Otto, you are abstracting away the complexity but if you need more complexity or your infrastructure grows, you can dive into another tool and customize.
A good analogy is like the radio in your car. You can control the volume and what station you are listening to, but once you open up the hood, there is so much stuff. You can see the engine, how many pistons you have, you can change your wiper fluid, etc... Most people are not car fanatics. They'll just take it to the shop and have someone do it. That's Otto in this case.
If you are a car fanatic and want to make sure everything is okay, you go to your own garage, open the hood, twist the knobs and get exactly what you want.
You mentioned that it's faster than Vagrant because it's built in Go. Is there a specific reason that Vagrant is built in Ruby and not Go or is that something you plan on changing in the future?
Go didn't exist when Mitchell started writing Ruby. Or if it did, it was so early that it wasn't well known. We don't have plans to rewrite Vagrant in Go at this point, but there are parts of Vagrant that are written in Go. There are things that get vendered inside the Vagrant binary that are Go because they talk to APIs and they are a little bit faster for certain things.
I wouldn't expect a massive rewrite of Vagrant in Go any time soon, but it is something we have considered. It's just getting a backwards compatibility layer and all that stuff in place would probably take a year of engineering effort and then we'd have to worry about all the plugins which are written in Ruby because Vagrant has a really big plugin ecosystem and we don't really really want to break that backwards compatibility.
When you have a tool like Terraform, it's similar in a sense that you are abstracting complexity in configuration as well. So you can say, "I want this Droplet on DigitalOcean. I want to name it this. I want it to have this much memory in this region, etc," and it will figure it out for you.
Yes, that's part of it; it's the first half of it. When I first introduce people to Terraform, I ask if they are familiar with Vagrant. If they say yes, I say that Terraform is Vagrant for cloud based infrastructure. Once they understand that, I tell them it was a lie because it's so much more than that. With Terraform you can create DigitalOcean Droplets but you can also manage DNS records.
We just merged into CORE (hasn't been released yet I don't think) the ability to manage Github users and organizations right from Terraform config, so truly Terraform is infrastructure as code. It is a declarative syntax that allows you to define everything in your organization whether it's firewall rules, Droplet sizes, DNS records, etc...all in a single text file.
It's done in a "human friendly" format which is why we're breaking down one of those barriers. Previously, it would have been like Bash and shell scripts and probably some Jenkins, but we put a really nice DSL(Domain-Specific Language) on top of Terraform which allows even nontechnical users to contribute to the organization's infrastructure.
What is DSL?
Domain-Specific Language. We have our own configuration language called HCL (HashiCorp Configuration Language). It's very similar to Nginx config. It's also JSON compatible, so one of our other pillars of the Tao of HashiCorp is automation and pragmatism. We believe everything in the world should be able to be automated.
All our tools with the exception of Vagrant allow JSON-style output for input/output for each CL (command-line parameters) and configuration file. So you can use those machine generations to parse those files out.
Is Terraform another example of Otto in that it gets rid of some of those knobs, but eventually you may need to go inside the engine and go more manual? Is that the case with Terraform as well where it might take you only so far?
No, the goal is that you should never have to do that. In reality, Terraform is a 0.6.13 project, so it does have bugs, but those bugs are, fortunately, at this time edge-casey. The mainstream things do work very well, but I can't say Terraform is bug free. We do have full time engineers working on it. They're constantly fixing bugs and coming out with new providers.
Ideally, you would never have to drop down into that layer. At some point it might happen especially in the event that a cloud provider comes out with a new API that Terraform doesn't have support for yet.
Our turnaround time is pretty quick. When AWS announced a new API, we got it in within 4 days of the announcement, and it was faster than Amazon's own competitor to Terraform which is CloudFormation. It actually took them a couple of more days to implement the new API.
We have pretty fast turnaround time. We use Terraform to manage all our production infrastructure internally with a lot of customers who are relying on it. It's not perfect, but it's 90-95% there. I'm hard pressed to say it's production ready, but it's production stable for sure.
Let's move onto Consul.
Let's not talk about Serf because it's really just an underlying layer for Consul.
It's our service discovery and distributed Key/Value store. In a large production infrastructure, how do you identify where your database is, especially when they're moving around frequently? Consul answers that question. It's also like a distributed Key/Value store with built-in things like leader election and locking.
Consul is one of our more popular tools. Even midscale infrastructures really benefit from Consul. It's my favorite tool to play with. You get real time edge-triggering with the catalog and the agent. It allows you to dynamically configure infrastructure and then as soon as the service becomes unhealthy, you can remove it from a load balancer, for example.
It's super fun to play with. You can actually just download it and run consul agent -dev and it will bring up a cluster for you automatically on your laptop so it's really easy to play with and get started.
We were making all these infrastructure tools and we came up with a security tool. The reason for that is, when you talk about pragmatism, security really encapsulates the entire stack. Development, operations, production, runtime, they all need secrets. In development it might be staging or production credentials, it might be certificates, it can be a lot of different things.
So they wanted to build a tool that solved this problem. Vault is a secure secret storage and acquisition engine, so Vault is 'pluggable'. One of the pluggable back-ends is the generic secret back-end. You can think of it as encrypted Redis. You give it a key/value store, it encrypts it and it stores it. When you want it back, you get that value back from Vault assuming you have the right permissions.
Where Vault really shines is that it is a secret acquisition mechanism as well. So you can hook Vault up to PostgreSQL and have Vault dynamically generate PostgreSQL credentials for you.
As a developer or as an application, you ask Vault for PostgreSQL credentials. You give Vault the root access to managing that PostgreSQL instance and then any time someone requests credentials, Vault will generate the user with the correct policy and set the expiration values as well as the leases on those secrets. That creates values that are only valid for a certain period of time.
It also gives you a break glass procedure in the event of a breach. You can revoke all the PostgreSQL credentials. Vault will revoke them in PostgreSQL and refuse to serve new ones.
You can generate Amazon IAM (Identity and Access Management) users and even certificates. Vault can even act as a certificate authority at this point. It takes a lot of power away from the one operator that you have, or two operators that you have, and automates the whole process in a system.
Then we use a process called Shamire's Secret Sharing to make sure that no one person has complete access to everything in the Vault. The "terrible but good" analogy is if you've ever seen a movie like "Olympus Has Fallen" or any movie where they've kidnapped the president and try to get the secret codes to launch the nuclear missiles and like five people will have to enter their code in order to unlock the button to launch the missiles. That's actually how Vault works.
You have a threshold of keys to generate an encryption key and that key unseals the master key which is then used to unlock the hull.
What that means is that if you have a rogue operator or an employee leaves the organization, you can revoke just their access or you can seal the vault. Anyone with admin permission cannot unseal the vault unless a quorum of people come together to unlock the vault.
So really, you're creating a lot of tools that can work together or by themselves, and that all help abstract away complexity
Yes, and the other side of that is we believe very strongly in the Unix philosophy which is you would never expect LS to create a file for you, and you would never expect your secret management solution to create infrastructure for you.
The reason we do that is twofold:
- It allows us to very clearly define the scope of our projects.
- It allows organizations to adopt "all or part." If you only want a secret management solution, you can use Vault which is open source and you don't have to touch Terraform or Consul. If you just want Vagrant you can just use that.
Our tools integrate really well with each other, but we're not trying to build a massive tool that does everything.
We're building individualized tools with individualized teams with very specific use cases.
Let's talk about the engineering culture that's really driving this kind of innovation. How do you build these tools? How do you find a team that can really get on board and build some of these high quality tools? How do you build a team that does that? Is it the culture, the leadership, the people you hire? How does that work?
I think it's all of those. Our two founders Mitchell and Armon are very technical and personable. They have very clear goals for the company. We do rolling interviews, so we're always hiring. We're hiring right now. We don't do your traditional, "Google style," technical screen. We find that having someone code on a white board is not useful to anybody.
The biggest thing is that we're remote friendly. In fact, we're almost remote only. To give you some context, we have one founder that lives in LA and one that lives in San Francisco. I live in Pittsburg on a completely different coast.
That gives us the advantage of being able to span multiple time zones. It also allows us to recruit some of the top talent because a lot of the really talented engineers don't want to move out to San Francisco, leave their families, friends, home town.
For a lot of our engineers, this is their first time working from home. They love it. It gives you a really good work/life balance. That's something that really shines as part of our recruitment process.
Because we're an open source community and part of the reason we don't need to do a "whiteboard challenge" is that a lot of the people we hire are directly out of the community. They're already contributors for our projects. We see their code. We've worked with them.
They can literally come onboard and we don't even have to train them because they've already been working on the tooling. They're already familiar with the codebase. They can just come in, and the only difference is that now they have a @hashicorp.com email address.
They already feel like part of the company because we try to do a really good job of making them feel welcome in the community. Then, if we do bring them into the company, they feel welcome there as well.
But working remotely has its own challenges doesn't it? How do you make sure the communication channels are always open and that everybody is onboard with the same plans?
We use Slack for communication for everything. We have rooms for each product. That's where the discussion about those products happen. We also have committer rooms. We have a Terraform Committer room which includes all the external contributors so they have access to the engineers if they want review or questions.
Anybody who is a core committer has access to our Slack. When it comes to big announcements, like when someone new joins the company, or a change in something HR-wise, that's email based.
We have an all company meeting every Friday, remotely. We use software called Zoom for that. We get to see everybody's face. Everybody gets to say what they've worked on during the week. We do demos. It's like a weekly recap. There's no alcohol, but it's like the remote equivalent of "Happy Hour."
You get to see people's faces. It's fun. That's how we stay on task.
We have a VP of product who manages a roadmap of things. Mitchell and Armon do a good job of dictating things like what the priorities are for this quarter and who should be working on what.
Can you describe a little bit more of what your development flow is? Say you want to build a new tool, how do you get started? How do you start working on the code and make sure it's shared?
All our open source tools start in a Google Doc. Anything we write, including closed source tools, start in a Google Doc. We either define the architecture first, or, this might seem a little weird, we will design and document the Command Line Interface (CLI) before we do anything else. We care a lot about the user experience, so sometimes designing the CLI drives the way we "architect" the actual system itself.
I can't share them, but we have design documentation for Vault and Nomad and Otto where we went through and said, "This is what the API will look like and feel like. This is how the user is going to feel when they use this tool."
Once we've all agreed on an API and everybody is on-board, we find out who's going to work on it and where. If it's a new open source project, we do it in a private repo (repository) first. When we're ready to announce it, we make the repo public.
We don't remove any of the GitHub history or anything. We have a certain aura about the company and we like to surprise people with our new projects, so we'll open source it and post on Hacker news and Twitter and whatever or we'll make an announcement on Hashicorp.com when we announce new products.
Do you have any idea how much time is spent in each of those stages (API stage, implementation stage...)? Do you have any idea of the breakdown of the percentages of those times?
It turns out if you have a really well designed API doc, writing code is really fast. The majority of time spent writing code is actually spent thinking. So if you separate the thinking into it's own thing and you build a design doc that clearly defines the architecture and the interactions between processes in the project, you can write some really fast code.
As a result, we spend a lot more time in the design process than in the actual development process. We actually architect the entire system from start to finish before we start writing code.
Once you do have the code and you have a working product, how do you test it and make sure it's working to specifications?
We have unit tests and integration tests. Mitchell just gave a really good talk called "Advanced testing with GO". At a higher level, we believe in dogfooding. We use a lot of our own tools internally before they even get open sourced. If they are terrible, we fix them.
That's the reason that when you see a new product come out, it feels polished. Internally, we've been using it for a couple of weeks and we've found all the weird edge cases and crazy bugs that would have hit someone else, but we found them first and we get a chance to fix them. We deploy our commercial product with our commercial product so, for example, we use Atlas to deploy Atlas. The dogfooding helps with that "polished" look.
Is that how you're also constantly innovating, because you are constantly using that tool you can be thinking, "I wish I had this, or I wish this was faster" or is it also a mix of that and your customers coming to you and saying, "I wish you would implement this or that?"
I think it's both. A lot of the tools we create come out of real problems we experience. In particularly user experience, we identify things like, "This CLI doesn't make any sense, or the way you do this doesn't make sense to humans," so we change it. In terms of big high level features once your project exists, those are governed a lot by issues and customers. Bug fix priority is determined by how many users are affected and how severe it is.
Features are different because we have to keep a balance of the features we develop and the bugs we fix. If you're just doubling the features all the time you're neglecting the bugs and if you're fixing bugs all the time, the project itself will go stale. There are a lot of variables that go into that ranging from who is using it, customers, the number of people who are putting a +1 on the issue saying this is affecting me too, the severity of the bug and lots of things.
It's really a "feeling" thing and it's up to the engineers on that project.
You are a director of evangelism at HashiCorp. Can you tell us a little more about that experience and traveling around the country and helping people with HashiCorp and what that looks like?
I've worn a lot of hats at HashiCorp, being one of the early employees and this is where I've landed. I do a lot of training and public speaking, podcasts and screencasts. I love it. I live on an airplane which, for most people, might not be an ideal scenario, but I really enjoy it.
The most important part for me is that I get to meet people who use our tools, and they give me feedback. They say, "This is broken, or this needs to be fixed, or I want this to behave differently." I get to take that back to the engineering team and actually watch the progress on that issue or feature.
In a way, I "liaison" between community requests and the engineering teams, but I also get to evangelize our products. I get to show and teach, which is something I've always been really passionate about.
How can people connect with you?
My Twitter handle is my full name @sethvargo. It's actually the same at GitHub, Facebook, LinkedIn or whatever. My email is Seth at HashiCorp dot com. Feel free to shoot me an email. My DM is open on Twitter, so you can direct message me even if I don't follow you, if you want to have a private conversation. I'll get back to you as long as I'm not on a plane or asleep.
If you enjoyed this episode, please consider rating the show on iTunes. Thanks!
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)