A look inside Etsy's scale and engineering culture

Interviewed by Christophe Limpalair on 03/16/2016

This episode looks at Etsy's current architecture as well as their engineering culture. This culture, over time, shifted Etsy from an application with issues to a stable and respected platform. Considering that their Ops team is now about 40% remote workers, how do they do it? Is that just possible because they are Etsy, or is it something we can reproduce? Peek inside with Jon and me in this exciting discussion.

Have fun while taking your skills to the next level with two special offers. Get a month free hosting with DigitalOcean and get started in 55 seconds (promo code scaleyourcode) and get hands-on training at a discount with LinuxAcademy. They had 68 certification passes in February alone. Are you next?

Downloads

Similar Interviews

How to Easily Setup and Manage Live Infrastructure with Chef's Own, Nathen Harvey

How Shopify handles 300M uniques a month running Rails, Docker, and MySQL


Links and Resources

Expand

Interview Snippets

Tell us a little bit about how you got started with Etsy.


1:40


I've been at Etsy for about four and a half years now. I work on the productions operations team. We're a team that encompasses everything from the data center techs and the data center through network admins DBAs (database administrators) to the team I work on which you might call the system administrations parts. We do a lot of our own automation tooling, internal tooling, and everything up to the level of the actual code running the site.


You have open sourced a lot of the tools you use like StatsD for monitoring and a few others. We'll talk about the challenges behind that. You work remotely, can you tell us about that experience?


2:20


The thing with remote working, although a lot of companies do it, is that it's quite hard to do well. There's quite a large investment needed on the part of both the company and employee to make sure the communication works.


It also depends on your temperament. For me, I love working remotely. It works particularly in my favor because a lot of my colleagues are based on the East coast of the US, 5 hours ahead of me, which basically means my entire morning up until about 2pm (nobody is around) is my "productive time."


My afternoons are more meetings and collaborative work. That's a natural split in the day for me. I'm a morning person, so I'm more productive then. The "quality of life benefits" have been amazing (I'm available to take deliveries, take the car to the garage, etc). It's definitely not for everyone and it depends on where you are.


Chris: I actually work remotely as well for Linux Academy. We do have those time periods when we have "wave" periods where a lot of people jump in the chat and start talking at the same time, taking a break from day to day things. Then we have periods when nobody talks and focuses entirely on getting things done.


Jon: We communicate internally. We use IRC internally as our default communications mechanism. We might call it "remote by default." We tend to default main lines of communication like email and IRC just to make sure.


The thing we lose out on remote is what we call "water cooler" conversations in the office. In the Ops area in our Brooklyn office we have what we call the "Ops Cave" which is literally one of the wall screens with a camera on top so we can dial in.


Do you have a lot of Ops people working remotely?


4:23


We're probably about 40% remote on the Ops team. It varies from year to year. We have quite a few people in our Brooklyn office at the moment. I'm currently the only one in Europe although we are hiring in Dublin at the moment. We probably have just under 100 Ops people across the country.


Do you ever have to be on call?


5:00


Operations share on call responsibilities one week at a time by rotation, so at the moment, it's roughly one week in twelve. Being on call one week in three months isn't all that bad, although it is 24/7 for that week.


How stressful is that? I'm sure it depends if there is anything major going on, but, on average?


5:20


It does depend on the week. One thing I really like about the way we do on call is that we take things like "alert fatigue" and normalization of deviance seriously, where you have something going off all the time and being broken is almost a normal condition. So we tend to be proactive about fixing things like that.


The general theory being that the only things that should wake you up are those that actually require human intervention. Generally speaking, the sources of stress tend to be something wrong that requires you to use your brain at 3 am rather than, "Oh God, that thing that was broken yesterday has gone off again. Why is it paging me now?"


Stressful moments tend to be when something actually seriously has gone wrong that we need to fix rather than "pointless" alerts that didn't actually need to wake you up.


What would you say are some of the biggest challenges? Are they similar, or constantly changing?


6:15


More the latter because, although we are a fairly large and complex infrastructure, we're also a fairly mature one.


We tend not to, generally speaking, do that much "firefighting" so "keeping things running and, oh no, the site's down again," is not generally a thing for us.


New problems tend to come from introducing new systems. There is some kind of complex interaction between components that hadn't been seen before, that does something unexpected and we have to figure out what it did and why.


It tends to err on the side of interesting technical problems rather than putting out fires with software crashes, server restarts, or things like that.


Can you give us a general overview of your current architecture?


7:15


Our infrastructure is all bare metal. We do virtualize developer virtual machines. Every developer gets a virtual machine that's basically like a microcosm of the Etsy stack. Those, themselves, run on physical hardware that we own ourselves.


Our actual application stack is very traditional. We run Linux, Apache, MySQL, and PHP. We have various caching layers.


One of Etsy's maxims and technology choices has always been almost what we would call "choose boring technology."


We tend to err on the side of mature, well-proven technology. The reason we don't tend to spend our time fighting fires is that we don't tend to use too much "bleeding edge" software.


We tend to pick mature things like PHP. It has been around decades at this point. Most of the bleeding edge bugs have been fixed. There's a lot of shared expertise and it's easy to hire people that are really familiar with the tooling and stuff.


We choose well-proven technologies that have been around for quite a while rather than the latest and greatest JS framework.


Although you've kept the core PHP and MySQL, there have been a few changes over the years. Etsy has been around for quite a while now and sometimes things just didn't work out so you had to swap them out. How does that work? When someone says, "Hey, I think we should use this technology," what is the process?


8:35


We have a well-defined process for making decisions on using new technologies. Typically speaking, we'll do an "architecture review."


We have a tool for creating architecture reviews and putting in your supporting materials and so forth. They also have a team that has to come up with a concept that they think will really work for us.


Recently, we introduced Kafka for our event pipelining stuff. The team that did that came up with a use case for why they felt we needed Kafka and what problem they were trying to solve.


Then, they set up an architecture review. A bunch of senior engineers and all the interested parties get together to discuss the proposed architecture as to whether we feel it is a good fit for the problem. We look through issues they may not have thought of, such as ways the components may react with the system, who's going to support it going forward, and so forth.


Once they validate the underlying approach, they implement what has been proposed.


Before it goes live, it will go through another process called an operability review which makes sure we've written a run-book for the on call people, gotten all the monitoring and alerting created, we've communicated with other teams of the company as to when it will go live.


The impact of going live has been well thought out.


Although our general position is what we call "preferring boring technology," there is a process for getting new technology in there, but it's not as simple as saying, "I want to rewrite the site for angularAMD or something," because our site is large and complicated and enough people depend on it.
We take our uptime, reliability, and general operability of the site very seriously.


We don't take the introduction of new technology lightly and we take care that we're introducing it for a specific reason. It's actually solving a problem that the current tooling we have won't do.


We have also, in some cases, rejected a new technology because what we have could be made to do it with a little more work. You save time by not having to go through the time of figuring out all hoops of the new technology that isn't quite as well proven.


When I saw you were still using Apache, I thought "I can't remember the last time I interviewed someone on this show who's not using NGINX". Is it because you haven't had a need to switch, and there hasn't been a huge difference?


11:30


Partly. The subject does come up periodically. It's worth bearing in mind that in our specific stack, we're using PHP with module prefork and things like that.


We have done actual performance tests at various points over the years with NGINX as opposed to Apache.


We did actually try NGINX with a project we did last year and the performance gains for our specific stack and combination of the performance profile of our site. The gains just weren't there to justify the switch.


We have a lot of tooling built around using Apache. We understand it very well. There isn't a compelling enough use case to make the switch.


Etsy has an engineering blog called Code as Craft where they talk about some of these decisions and performance testing.


Thanks to DigitalOcean for sponsoring this episode and for providing viewers with a discount


DigitalOcean is the fastest growing cloud infrastructure provider thanks to its laser-focus on creating an elegant and simple solutions for developers and teams.


If you've ever had projects you never started because of cost or time, grab this month free 1GB droplet and deploy it in 55 seconds. Their one-click deploys make it super easy to launch popular framework and languages without getting stuck on those time-sucking details. So now, you've got a free option and a time saving option. Go out there and have some fun!


Thanks, DigitalOcean!


Did you switch from PostgreSQL to MySQL because you saw a big difference?


12:30


That would be part of the reason. There is a lot of legacy in why the decision to use PostgreSQL was made. It dates back to before my time. I guess 2008/2009. The database in use at Etsy at that time was very large.


A lot of the thinking at the time was around streamlining, what was seen at the time as a very unreal legacy database system.


The actual reason we picked MySQL specifically predates me so I can't be absolutely sure. The thinking even back then when we made those technology decisions was that we wanted to pick well understood technologies that had been around for a long time.


That must be a really big data system with all the data. Shops, user accounts, etc... How do you scale something like that?


13:30


What you usually run into with MySQL is your scaling gets a couple of "check marks:"


  1. When you end up having to scale from a single box so you will typically end up having multiple replicas of the data.(read replicas)
  2. The second root mark is when your data writes too much for one box to handle and replicating writes is particularly difficult...guaranteeing data integrity and so on. Typically then you end up sharding databases.

In the MySQL context, effectively what that means is we have an index server that is itself a cluster. So if you want to get the user data for a particular shop, you'll go to the index server and see which data shard this lives on and that will be in itself a master-master pair of servers so the rights are actually split between several, well at this point, dozens and dozens of master-master pairs.


So you're almost sharding your data into multiple clusters.


So sharding is really taking all that information and splitting it up into smaller and smaller shards and making the shards smaller and smaller?


14:53


Basically, and as time has gone on, we've moved from having one shard equivalent to one box to, as the performance characteristics of the hardware got better, now we actually have multiple virtual shards on the same physical hardware.


Our scaling strategy has been very much around that. We've run into a few interesting problems like when you have a 32 bit Counter and it overflows and you have to all of a sudden change everything to 64 bit counter so you start having wrap arounds.


There are a bunch of interesting things that we've run into like that. I would say with MySQL it's sort of standard practice to have replication and sharding. It's a very well understood topology for large MySQL installations.


Sharding is not something you can just flip a switch and it automatically shards for you. There's a lot to it isn't there?


15:55


Sharding is quite complex because you're switching from one source of thinking in sort of ORM terms. You have one set of connections going to one database where all of a sudden you have to have logic to go to your index server, find where the data you want lives, then go and get that.


If you're dealing with master-master pairs, you have to start thinking about where you're writing the data.


Typically it writes to one side and replicates to the other. We have mechanisms in place where if sort of one side of the master-master pair dies, we can flip right to the other side.


There are a bunch of moving parts to consider when you switch to sharded data.


When we did it a number of years ago, it was a fairly significant undertaking.


When do you think it makes sense to shard your database?


16:42


That's very hard to give a good concrete answer to. It depends on what you're doing. If you're at the point where you can't just throw hardware at the problem and horizontally scale one database cluster anymore and you know that on the machine you're writing to you're starting to be bottlenecked on the actual underlying performance, it may be time to start thinking about it.


For example, say you're running out of disk space on the machine, but it's already a massive machine, then you have a case for "This isn't a problem we can solve with more hardware."


Thanks to Moore's Law (the hardware will generally get more performant), you can improve things somewhat by throwing more hardware at the problem. But if your database usage is growing faster than Moore's Law says the hardware is getting faster, you end up running into a wall where one box isn't giving you the performance you need or one destination for writes isn't giving you what you need to keep growing.


One good way to offload some of that work for the database is to add a caching layer or multiple caching layers. How have you done that?


17:40


We have a bunch of different caching layers. We use memcached fairly heavily for caching database objects. We have a bunch of others. We have the public facing API that our third party developers integrate with, but we also have a number of internal APIs too.


We have caching layers for caching API responses that aren't particularly short-lived. For example one that might live for an hour or longer, we might cache it. We cache images and static assets heavily as well. We pretty much cache everything we can feasibly cache.


Chris: That's a good idea but not always easy to do.


I would say caching is a quicker win than sharding because caching is easier to just throw into the mix.


Chris: Yes, but if you try to take it farther and try to cache some of the more complicated data it could start getting more complicated.


Cache invalidation is definitely one of the "hard" problems of computing. Making sure you're not serving stale content to users, especially when you start dealing with CDNs for image caching and so forth. Making sure the content on the edges nearest users is fresh. This has its own set of challenges.


Thanks to Linux Academy for sponsoring this episode


If you haven't heard, I work at Linux Academy now. I'd love to have you join us so we can help you learn more about Linux, System administration, and DevOps. We also have a private community where students and instructors help each other out. If you join, come say hi, I'd really enjoy connecting with you.


By the way, our training lets you go at your own pace. You can also train on our hands-on servers and environments, so you're not just learning theory. You're actually getting real-world experience. There's a reason why 68 of our students passed their certifications in February alone!


Here's a 24% monthly discount. See you there!


What load balancers do you use?


19:10


Internally, we use F5 load balancers. We've used them for years and been very happy with them.


Chris: I think Basecamp uses f5 as well.


At scale, there are only a relatively small number of big load-balancer vendors. Something a little bit different to large enterprise deployments is that we actually don't have a lot of business logic in the load balancer CDN side of things.


Our load balancers are generally used for literally load balancing between pools of IPs. We don't have complicated redirect rules or any kind of business logic around the kind of things you sometimes see in large load balancer deployments. We try and keep that to a minimum.


Are you running on your own hardware primarily because of cost or is there another big reason?


20:10


It does largely come down to a couple of things:


  1. Commercial (capacity planning plays into this). The way we scale our site for our yearly traffic it makes commercial sense to run it on our own hardware and our own data centers.
  2. Because we're running an e commerce site, our traffic pattern is very predictable. We don't have to deal with things like at Netflix where you have huge traffic spikes for a film release, a sporting event or something like that.

Our daily, weekly, monthly traffic patterns generally year after year follow similar shapes. Our busiest traffic period of the year is going to be Black Friday and Cyber Monday weekends. We know for example what our traffic does around Christmas. We have a very well understood model of what our traffic patterns do.


We tend to capacity plan based on we know what we know from our year on year traffic growth. We always have a pretty good idea what traffic we're going to do Cyber Monday the next year. We buy hardware accordingly to cope with that traffic spike. Then we know our traffic will taper off a little and start going up around Christmas. Before Cyber Monday next year, our traffic will be kind of where it was Cyber Monday last year. We'll know that over the year we will use the hardware we've bought pretty fully.


We always build in a fair amount of capacity headroom as well because you can never tell with complete accuracy what traffic is going to do. Generally speaking, traffic spikes aren't something we feature heavily in our capacity planning. That's just not the way our users use the site.


If you want to spin up new environments to be able to handle those loads, is that where something like Chef can come in handy?


21:55


Since our capacity planning is done in "smaller/ larger" blocks throughout the year, we're not doing what you might call Autoscaling. We tend to provision a bunch of servers at one time. One thing Chef is particularly good at is if you're spinning up a whole new Hadoop cluster. We have a lot of provisioning tooling internally that makes it almost as easy to provision the physical box as it would be to use the AWS console to launch a new cloud instance.


The combination of that with Chef means you can potentially build an entire Hadoop cluster on a "for loop".


Can you give a general review of what Hadoop does?


22:45


Big Data is where it comes in. Hadoop is the idea of sort of MapReduced jobs that came out of a project at Yahoo. It's essentially a distributed system for crunching huge amounts of data. I haven't done that much with Hadoop myself. I'm not hugely familiar with all the internal pipelines it uses. If you're doing large scale analytics with hundreds and hundreds of gigabytes, Hadoop is the de facto solution.


Chris: Data analytics is an entirely different animal ... there's so much to it.


Yes, so don't ask me any questions about it.


Chris: That would be an interesting subject for an interview, although I haven't done much with it ... still, interesting and very, very useful.


We actually have our own internal data scientists and engineers. We involve them in our capacity planning too. Last year, one of our data scientists was helping us build mathematical capacity planning models ... being a bit more scientific than sticking your finger up to see which way the capacity winds were blowing.


We try to have a rigorous approach to how we model our traffic patterns. There's a lot of mileage in data science and data analytics. It's a phenomenally powerful tool that requires a very specialized skill set.


So you wrote a book called Customizing Chef. Let's talk more about Chef. Why are you using Chef, and when do you need to customize it?


24:20


For those of you who are not familiar with it, Chef is what you would call a configuration management solution. We use it to bring our servers from bare operating system up to ready to run code and to start running the site.


Chef is what you call desired state configuration management, which means basically that you write (in Chef terminology) recipes that say, "This is what I would like my system to look like,"... these packages, these configuration files, these services running and so on.


Then Chef will look at the current state of your machine and figure out the delta between your desired state and the current state and figure out what it needs to do in order to converge your infrastructure to that state.


Because we're dealing with a large number of servers, having something that can do that and kind of guarantee that we have the same configuration on the same types of servers is very very important.


We've been using Chef since 2010. Since Chef was numbered in the "0.something" versions. It's something we have a great deal of internal expertise in. It has always worked very well for us.


At the time, the main contenders were Chef, Puppet, and CFEngine. Ansible and Salt hadn't really come out by then. Part of the reason we were using is because it fit particularly well to the way we think about infrastructure and the way we think about writing code. "We" being the Ops team. Since then, we've never had a reason to switch away from it.


It comes down to now, we have so much internal expertise, to switch to another configuration management system, there would have to be a really big set of value adds to make the migration effort to switch away.


Honestly, I would say Chef works very well for us. I know people tend to rant about Enterprise software, but Chef is one of those enterprise pieces of software that does exactly what we want it to.


Customization:


Chef isn't designed to be a "one fits all" solution. It's really designed to give you a sort of toolkit to let you solve your own automation problems.


Chef has this idea that we (the users) are the experts in our system. Chef can't tell us how we should configure our systems, but it gives us the tools so we can make those decisions for ourselves and then tell it how to.


What that means is that, "out of the box", Chef doesn't do everything. There are a bunch of built in primitives like installing packages, configuration files, restarting services and so forth, but if you run into a particular sequence of behavior that Chef doesn't do right out of the box, you find yourself writing a lot of copy-paste code. Chef also lets you customize it so you can write your own resource (the building blocks of recipes) to encapsulate that behavior yourself.


One example you can think of yourself is configuring a vhost and Apache. Chef gives you the ability to write a resource which will let you make a little block that tells Apache the name of your site. It will give you a template and then it goes off and does all the things like creating symlinks (symbolic links) from sites enabled to sites available to sites enabled, reloading the Apache service so you can do stuff like load modules.


However you want to describe your infrastructure and this infrastructure's code way of thinking, you can write something that lets you do that.


Chris: So you tell it, "Hey, this is the state I want my machine to be in," and Chef will do all those mundane tasks that you would have to do manually otherwise or put in a script somehow.


Yes, so out of the box Chef comes with a bunch of things where you can say, "I want to do this thing," and then it goes and figures out how, but it also lets you write your own, so if you say, "I want to create a Hadoop cluster," you can write your own resources in Chef.


So if you say, "I want a Hadoop cluster called ScaleYourCode" and I then want to have this many machines, you can write your own underpinnings in Chef that will figure out the delta part.


It will look at what you already have and what you want to have and you can tell it how to figure out the difference essentially.


Can you give us an example from your book that ties back to Etsy in some way that describes what the book is about and what it teaches?


28:45


The book is designed to tell you how to do something that it isn't designed to do "out of the box." There are a number of different places in Chef where you can customize things from the actual set of building blocks of recipes to the tools you use to interface with Shaft, the API that runs on the server.


At the time I was learning Chef back in 2009, 2010, there wasn't a huge amount of documentation on this kind of thing and it was quite difficult to learn how to do those things.


"I was really trying to write the book I wish I'd had when I was learning."


As with Chef, the book isn't going to solve all your problems, but if you have an idea of, "I wish I could encapsulate this behavior in my own little block of code I can put in recipes, the book will tell you how to actually do that." Then all you have to write about is the implementation details of your specific use case.


It will tell you how everything fits together, how everything links into Chef, how Chef actually runs and does all the magic you see on screen.


It was driven by a lot of the customizations we've done at Etsy because we're very heavy on continuous deployment. A lot of our specific customizations in Chef actually related to our work flow for using Chef around how we managed to make sure all those people could make changes rapidly, often at the same time, without treading on each others changes.


That was the specific challenge for us that your typical startup won't have. Every company has different challenges, but I was trying to basically give a good reference material for when you start running into those "pain points" of what it doesn't do "out of the box." You can start solving those problems for yourself for your own company.


You say Etsy is really big into continuous deployment. Can you describe what that process looks like?


30:35


Our "site" is comprised of both the code and configuration deploy, which is the master configuration file that has all our feature flags in it and so forth. We deploy our site code and our site configuration somewhere in the order of 60 to 70 times a day. These are very small, and rapid changes. We do the same with Chef.


We typically deploy Chef changes around 20 to 30 times a day. We're very much into small, rapid iterations. So if you want to make a change, you can push it out fairly quickly. There's a defined pipeline for doing so. You have all your testing built in.


To the extent that your testing is correct, you can push out a change with a fair degree of certainty that it will work when it gets to production. You have your preproduction stage and so on and then if anything goes wrong, you have all the tools in place to detect that something went wrong. You can roll forward, fix your change and so on.


It's basically trying to make deploying as natural a part of people's day as writing the code itself.


Everybody that joins Etsy deploys the site on their first day.


We also extend that program to non engineers on a scheduled basis. For example, our CEOs our finance team, deployed the site. People who would usually be involved with the day to day site deployments all get to deploy the site really to see how easy it is.


Our deployment has two buttons: one pre production and one post production. We try to keep it that simple.


Chris: I had a chat with Adam Wiggins, one of the cofounders of Heroku. That's one of the philosophies he really believes in when on boarding people is having them be part of the project and deploying something to the project so they can feel like their work has been involved in it.


Absolutely. It is also one of the underpinnings of the DevOps movement. Things like deployment shouldn't be the preserve of one specific team. Sys admins stuff shouldn't be the preserve of specifically the operations team.


One thing Etsy, in my opinion, does very well, is that our teams tend to be very cross functional. There's not too much silo functionality. Teams collaborate, they work with each other. We had a boot camper come in from our front-end team to help us with front-end work on a bunch of our internal tooling.


Once you have been at the company for a certain amount of time, we do rotations of our fellow engineering teams for about a month or so. We sent an operations engineer to the performance team to do a rotation. We try to have that spirit of collaboration and make it easy for people to get their job done.


Tooling should be a force multiplier. You should be able to do your job faster, better, stronger, not getting in your way and stopping you from doing things.


It's my understanding that it wasn't always like that at Etsy at the beginning. There was a bit more silo acting going on where you had very distinct teams. What happened. Do you know why that switched?


33:48


This is going back a long time (before 2009). The site was pretty unstable. It crashed a lot. They decided this needed to be fixed so they hired in talent to try get things better. Chad Dickerson was brought in at the time as CTO. John Allspaw and Kellan McCrea were also brought in. A lot of them have shared backgrounds at Flickr and Yahoo. On the operations side, Mike Rembetsy was brought in. That was the start of when things started to turn around to the Etsy people know today.


Chad, Kellan, and John were the engineering executive team. They started putting these processes into place, like continuous deployment, for example. Flickr was one of the first to do continuous deployment. There was a talk that John Allspaw and Paul Hammond did at Velocity in 2009 where they were deploying their code 10 times a day which blew people's minds.


No one realized you could deploy code that often. They brought a lot of that philosophy to Etsy and started basically including those processes.


I'm not particularly fond of the name, but Etsy sometimes referred to as one of the DevOps unicorns where people say "Etsy does this, but they're Etsy so I can't do that."


Honestly, half of the reason that we are as well known for these things as we are is that we've been doing them for a long time (five or six years at this point).


It's very much part of our engineering DNA now to think in those ways about our tooling and technology.


It sounds like there's a lot of emphasis on the leadership having changed that and taken it from those Silos to something more "blended in." Is that really something that comes from the core of leadership?


36:00


That kind of organizational transformation is extremely difficult to do without executive buy-in. So a lot of things companies tend to talk about these days with DevOps alongside continuous deployment. You have things like blameless culture and "blameless post-mortems" and so forth.


If you're in a company trying to implement "blameless post-mortems" within a specific team, but you still have the VP of engineering yelling for someone's head because the site went down, it's very difficult to reconcile those two things.


We were particularly lucky that from the top down our leadership was very very invested in this process. Chad is now the CEO of Etsy, so from the highest level, Etsy is fully geared up and invested in "blameless culture", continuous deployment and all those things.


I would say that for companies out there trying to get these transformations into their culture, doing it without executive buy-in is extremely challenging.


Executives, especially in traditional companies, have the power. Even if your manager wants to have a "blameless post-mortem," if his manager wants to fire you and can, then you haven't really changed that much. I would say it's extremely crucial.


It's been hard to keep open source projects up to date because people would fork projects and change the code there and then upload that code to the Etsy codebase, but they wouldn't upload that back to the main project, so things weren't getting updated. How'd you fix that?


37:10


For quite a long time now, Etsy has been very open source friendly. We actually have our defined engineering values; what we call generosity of spirit being one of them. Part of that is that we contribute open source back to the community.


Where we could have done a better job in the past is we didn't have a particularly well-defined process. We had a defined process for when you can open source something, but not what we actually did with it afterwards. We have a number of very successful projects; StatsD being one of them, which is very widely used and actively maintained.


We also have a number of cases where we open sourced: for example our Dashboards framework We open sourced this framework and then people were working on it internally, and the changes weren't getting moved upstream. For a really long time, the external framework wasn't really being touched because we weren't working on it.


We had a number of projects like that we had open sourced and that we had stopped using internally or they hadn't proven to be as useful as we thought they might be. They were still out there and they weren't being maintained.


Last year, we set up the open source working group to try and systematically tackle how to make it better. One thing we started doing is that we archived a lot of projects. They're still there but we've made it explicitly clear that they are presented "as is."


We archived a bunch of projects that we clearly weren't maintaining anymore.


We also kind of renovated our open source project, as well as making it clear that the sort of tip boxes you need in order to be able to open source something.


We've changed a lot of the questions we ask in that process to make sure we capture the things we really care about.


For example:
How long have we been using this software in production?


Who is going to maintain it when we release it?


How are we going to manage contributions to the community?


"So the process we've put into place is much more geared to making sure that we only open source things we are ourselves actively using in production."


We also have a better process now for, if someone happens to leave the company who's maintaining the project, what happens going forward ... whether can continue maintaining it or whether we want to take ownership and continue maintenance.


Going forward, we're being a lot more rigorous. We still err on the side that we want to remain open source-able. We want to make sure that we release things we will be able to continue to support.


Putting things out there that the community is using that we aren't supporting is not doing them as many favors as we might like. We want to make sure that the things we open source are things we are committed to both using ourselves and continuing to support, and that we have a plan in place for if that ever changes, we don't just leave the project there with people not knowing what its status is.


Chris: I think I speak for a lot of people when I say that I thank Etsy and the team at Etsy for open sourcing so many of these different tools. I'm a really big believer in open source. We can all learn from this lesson, and hopefully a lot of people can use this as encouragement to go ahead and open source some of their tools that they are using in their companies.


Maybe they've been wanting to do it, but they see it as a very big challenge and they don't have the answers. Maybe they can learn from this lesson and go ahead and open source some of their tools. Thank you so much for doing it and for being so transparent about it.


Thank you so much for being on the show. This has been an incredible interview experience. We have spoken before and I was really interested in asking you more questions about Etsy, your job, and what you do there. I have learned a lot from this, and not just from the open sourcing, but from everything else you've taught me and us on Scale Your Code. Thank you so much, Jon, I really appreciate it.


If people want to reach out to you, how do you recommend they do that?


They can reach me at JonLives on Twitter. Same user name on Freenode IRC and also in the Hangops Slack chatroom.



How did this interview help you?

If you learned anything from this interview, please thank our guest for their time.

Oh, and help your followers learn by clicking the Tweet button below :)