How Shopify handles 300M uniques a month running Rails, Docker, and MySQL
Interviewed by Christophe Limpalair on 09/21/2015
Simon has incredible tips on building resilient apps. From scaling Shopify's database to failing over an entire data center, see how one of the largest Rails app running on thousands of Docker containers can handle intense flash sales and over 300M uniques/month.
How do you make apps more resilient? The easiest first step is tracking errors in development and production and my sponsor is offering viewers a free 90 days ($87 value) so you can get started in 5 minutes.
Links and Resources
Your online content is fueled with passion. Where does this passion come from?
Since he was 11 or 12 years old, Simon jumped at every opportunity to work with computers. He really loves this kind of work, and being able to do this full time is such a pleasure. He gets to work with so many smart people.
"I get paid to play, it's amazing."
How did you get started with Shopify?
He's been working at Shopify for 2 years now. He wrote a blog post around 2-3 years ago about not having a phone. It had a header like "Why I'm glad I broke my iPhone." This was before a lot of people started writing these kinds of posts, and New York Times and Lifehacker both picked up the article which gave him a lot of hits.
Someone at Shopify found him through that and scheduled an interview in Canada the next day.
I commented on how Simon's post motivated me to not check my phone as often. The result? More productive hours.
I introduced you as working on Site Reliability, Performance, and Infrastructure. Is that an accurate description?
He jumps around a lot in the infrastructure department, and is currently working on setting up new data centers and trying to completely isolate clusters of shops from one another.
What does the Shopify architecture look like?
Their architecture is still fairly simple. When a request comes in, it goes to Nginx and then to a cluster of servers running Docker with a Rails app.
They have a data tier that has things like:
These are the most important components of their stack. Another important note is that they are a multi-tenant platform: they host online stores where one server might serve shopA.com and shopB.com. This is really important to how they scale, because they can offer cheaper stores than a lot of the more traditional e-commerce platforms that host shops on a single node.
Shopify runs its own hardware in data centers but they do also run some stuff in AWS.
Although they run Docker in production, they have not broken their app up in Microservices. They are still running a large Monolithic app. There are smaller apps here and there but it is primarily a large app.
When we hear about containers we often think about Microservices. Shopify hasn't moved to Microservices. Why not?
For Shopify, it was really about getting into Docker at the right time and getting into the right mindset of using containers for their company.
Two years ago it was fairly obvious that this movement would gain momentum, and that Ops and infrastructure would look very different in 5-10 years. They really wanted to get in on that early to get as much experience as possible.
One of their main goals in switching to Docker was to have reproducible builds so that every server would look the same in production and in testing (CI, for example). Running Continuous Integration on a big app can be a slow process. Running Docker on CI drastically reduced the time it takes for them to deploy.
Simon goes on to say that their stack in the first year of running Docker was "a lot more terrible" than before switching to Docker. Only few people knew how to operate it and they ran into bugs in the kernel, had increased complexity, etc... But now it has paid off tremendously. CI runs with hundreds of thousands of lines of Ruby in about 5 minutes. Deploys to 300-400 nodes across 3 data centers take just 3 minutes.
Just a year ago the CI and deploy times were closer to 15 minutes.
So not only are they drawing benefits from Docker without running a Microservices architecture, but they are also setting themselves up for a switch if/when they would like to do it. Like many other companies, they do not take the extra overhead from Microservices lightly.
Did you start moving to Docker in 2014?
Yes, they started at the beginning of 2014, and it took about 6 months to get it to production. So they had Docker in production at the end of July 2014.
What is the main scaling challenge that you have at Shopify?
The biggest challenge that his team has is what they call flash sales.
They have stores with tremendous sales that happen at a specific time. For example, Kanye West sells shoes online which combined with Kim Kardashian, has a Twitter following of about 50 million people.
They also have customers who advertise on the Superbowl. They have no idea how much traffic that will give them but it could be 100x their normal traffic, and then it could go back to the normal throughput within a few hours.
This is the biggest challenge. How do they make sure that they can handle this scale? Even if they can't really handle that scale, how do they make sure that other customers don't suffer from it?
How do you prepare for these sales when a customer warns you?
Ideally they wouldn't have to do anything. They're still working on getting this down because they want to be the best platform at providing a scalable solution. They don't have that protocol down just yet, but they have a checklist for performance checks.
In Kanye's case, they sat down for about 2 weeks and did extensive passive load testing and performance optimization by combing through critical parts of the platform. They experimented by putting the entire storefront on Fastly instead of serving it from their own data center.
They also experimented by throttling the checkout proportional to the amount of inventory.
In all, this is a really good testing ground for Black Friday and Cyber Monday, for example. They have really improved with this over time, and eventually they'd like for it to be second nature.
You use a Resiliency Matrix to help you with these tests. Can you tell us more about the process?
About a year ago, preparing for Cyber Monday 2014, they had embarrassing outages where components that should not have affected the entire platform ended up causing issues.
For example, if you are having issues with sessions, don't take the whole site down. Just sign everyone out and allow them to still go through checkout without their customer account. Then, associate the email address with the customer account afterwards.
This is really where the Resiliency Matrix comes into play. At the top of the Matrix, you have sections of your application. Like customer storefronts, checkout flow, and maybe the administration, etc..
On the other axis, you have all the different services and data stores.
So if the service Redis goes down, we need to make sure the checkouts are still up. If it goes down, fill the box in red. If it slows down, fill the box orange, and green if it still works. The goal is to have every box filled in green.
(Here's the slideshow from his talk.)
Simon gave a fantastic talk about Resiliency at a conference that I highly recommend you check out.
This gives a really nice visual to see how your different components behave when certain services go down.
How do you run these tests?
They tried many different testing methods but ended up realizing that many of them uncovered more bugs than they solved. Because of this, they decided to build a tool called Toxiproxy.
Toxiproxy is an HTTP API that allows you to change the condition of your connections on the fly. You can literally have a closure where the service is up, but inside the closure there's another closure where the service is down. This enables you to write these really awesome resiliency tests that let you find when things go down or what happens when things slow down.
Since all of this is done at the TCP level, you can also test drivers instead of having these 100 line stubs for every single driver.
Here is an example of using Toxiproxy in a Rails app.
A lot of scaling issues come from databases. How do you handle this with your large databases?
They are running MySQL which they sharded about 2 years ago. A lot of it was shipped the week before Black Friday, so it was under very intense pressure.
So they shard pretty aggressively and they are aiming to have more, smaller, shards over time. That has been very helpful for them.
Scaling the database is hard, though, especially in their business. Everything is very transactional, and they can't loosen up on their guarantees unlike social media websites like Twitter and Facebook. Shopify is dealing with money, inventory, and other things that can't go out of sink.
Other things they have to deal with:
API clients that find some strange path or a customer doing something crazy, or a custom Shopify App doing something crazy. The good thing about shards is that when a customer in a single shard is doing something crazy, or a single shard is misbehaving, only a very small subset of the entire platform is suffering from it.
Sometimes it can even be one of your developers who shipped code that doesn't perform as well under load.
For the most part, they have gotten pretty good at handling database scale with automatic failovers, and a lot of resiliency work.
This is where something like the Resiliency Matrix becomes handy.
Exactly. They can go back to that matrix and pinpoint exactly what happens when one database shard suffers. How many stores are affected? What parts of the application go down? All these can be answered, and they can be transparent with their customers.
"The unknown is a really big enemy when you are handling these catastrophes." Having this sort of transparency with how systems affect each other gives engineers peace of mind. Instead of scrambling to solve an issue while customers are losing money, they are better equipped.
Can you think of anything you solved with your database that viewers/readers could benefit from knowing?
When you're optimizing these relational databases it often becomes very specific so this is a difficult question to answer.
Shopify has an easier sharding story than a lot of people because the shops are fairly isolated from each other. Sharding in general is really, really hard and only recommended as a last resort. You can probably cache a lot before you have to do that.
If you do shard, keep those shards really small. Isolate failures as much as possible.
There are companies who go to another extreme of not doing caching at all. Instead of running memcached in front of their MySQL tier (which Shopify does and Facebook also does), they have hundreds of READ replicas with sophisticated systems around that. Depending on your problem that might be the way to go.
GitHub for example, if you do a WRITE to a shard you will see a cookie on your browser that says "go to the master for the next 5 seconds" to make sure that the data is going to be up to date, which is a brilliant hack that a lot of companies are doing as well. Unfortunately if you are handling money, you can't really do that.
Can you walk us through what happens when someone opens up a new shop?
Not a lot to be honest. They create a row in the master table with a new shop, do additional provisioning for billing and that sort of thing, and there's really not much going on.
Because they are a multi-tenancy platform, they need to keep the cost of creating shops really low.
Since you are a multi-tenancy platform, how do you ensure that shops aren't affected by noisy neighbors?
This is something they are still working on. Right now they have one big app tier with the traditional architecture of having load balancers passing requests to Docker containers that have Rails processes.
Since they still share that tier, they need to make sure that even if a single shard is slow, it doesn't cause all workers to hang up. Through resiliency testing they can reject connections at early signs of a service slowing down.
As Simon said earlier, they also isolate shops more and more by making shards smaller. On top of that, they are looking at isolating tiers that serve the applications from each other. This is something they are working on this year and next year.
Another big thing they are working on is getting shops to run out of multiple different data centers on different continents, and at the same time. This is big for data locality and also to protect against unexpected events.
Preparing for unexpected events is also something Netflix has invested a substantial amount of resources in as Jeremy Edberg told me in his interview.
You're running on your own hardware. Why?
Shopify is over 10 years old now and they started running into scaling issues about 5 years ago. Back then, a lot of people were buying their own hardware and running things in data centers.
Shopify doesn't have its own physical data center. They co-locate. This was very common 5 years ago. It's probably just a relic of that and they haven't had a compelling reason to move.
There are a lot of challenges there too, but there are also benefits.
They are looking to use the cloud more aggressively now than they have in the past. Maybe this will be a good thing for Shopify in the future because it could allow their architecture to run on multiple different cloud platforms at the same time.
Whereas Netflix auto scales with AWS to handle spikes, how do you do it?
Shopify has a lot of hardware. They are over-provisioned.
Simon points out that one of the reasons you have to auto scale in the cloud is because it is expensive. If you own your own hardware, it's not that expensive. That has worked really well in the past.
He goes on to say that there are definitely a lot of benefits to auto scaling but it is very complex for a lot of smaller companies.
Moving to the cloud introduces a lot of potential issues you have no control over. How do you prepare for that?
Companies like Netflix that make extensive use of the cloud have spent a tremendous amount of resources on resiliency.
Traditional companies try to bond all of their interfaces on the network level and do all of these low level optimizations to make sure failures happen less frequently.
Netflix has to do all of this at the Layer 7, which is the only layer they can control.
If one of Amazon's routers fail, or if a box is killed, they have no control over that. This is the new reality of cloud -- you have to build on top of components that you cannot control.
Of course, cloud providers are going to do their best to not let components fail, but if they do fail you have no visibility into it. So the new reality is that you can have an entire region go down, but your app still needs to stay up.
Simon adds that he believes a lot of their efforts there are really going to pay off in the longterm. Shopify also does all this low level optimization previously mentioned, but it's a lot cheaper for your application to be able to cope with this and so having this mindset embedded in the organization early on has been really beneficial to them, Simon explains.
You recently opened up a second data center and performed a failover test. The only thing that went down was the checkout process. What happened?
There are a couple of steps to perform when moving a database. The first thing to do is to disable checkouts. This is because they have relational data stores and for them to be consistent, they have to be read only in one location and write only in the other location. There is a short amount of time when they don't really know which data center is going to receive the traffic because of the way routing works.
So they actually manually took checkouts down. That way they can serve customers a page that says to come back in a little bit because they are performing maintenance.
Here's how failovers work at Shopify:
After disabling checkouts, current jobs need to finish so they can be written to the database. Next, all the BGP blocks for IP addresses need to be moved from one data center to the other. Load balancers also have to be refreshed so they will route to the new data centers because some ISPs will still route to the old data center even when they have announced moving elsewhere.
Following that is a failover of all the shards -- trigger them to read only in one location and write only in the other. After these steps are done, they can restore checkouts again and restore ElasticSearch from backup, etc...
Simon says it is a fairly simple process and they are looking at turning this into something they could easily perform multiple times a day.
Simon then shared something they are working on which would solve the checkout issue:
When a request goes to a shard that is currently non-writable, they can queue the request at the load balancer. Once the shard is writable again, they can drain that queue. This means the customer just sees their browser spinning a little bit longer while maintenance is being performed, and then the request goes through. This literally means they could perform an entire failover test without losing a single checkout.
They are using Nginx and LUA to do this.
This will enable you to have data centers running independently around the world.
Yes. They are investing a lot in the routing tier and one of his coworkers is going to do a talk at Nginx Conf about this.
It all boils down to their custom routers that look at requests and tell you where in the world that request is going. All of this is done with Nginx LUA, which Simon says is amazing technology although the learning curve can be steep. Mainly because tutorials are hard to find.
What tools do you use to detect failures?
They use Icinga for a lot of the alerting and paging. They also use DataDog for a lot of their monitoring metrics, dashboards, and that sort of thing. DataDog does have ~20 seconds delay sometimes and in large scale systems this can be problematic. But overall DataDog has definitely changed the way they do monitoring.
What about Rails errors? How do you detect errors users run into that slipped through your testing?
They are transitioning from a self-hosted solution to BugSnag.
I use Rollbar at ScaleYourCode and really like it. They even set up a page for viewers to try it free.
It has saved me from errors more times than I care to admit. In fact, the morning of this interview I woke up to an email alerting me that users weren't able to play interview audio tracks. This was my fault because of code I had pushed the night before, and I could have integrated Rollbar with other tools to alert me as soon as the first user ran into the problem, but email is just fine in this case.
It's an incredible product, and I highly recommend you check it out. Go to try.rollbar.com/syc and set it up with your app in 5 minutes, no matter the language.
How do you keep this large Rails app running fast? Any scaling issues specific to Rails?
Yes and no. Rails is fairly slow by itself but it's an easy thing to scale... you just throw more application servers at it.
When you hear people struggling to scale Rails, it is often an issue of not having enough capital to buy more servers.
Rails has been extremely productive for their developers, whereas a lot of times faster languages have other opportunity costs. So they've been using a Rails codebase from the beginning since their CEO contributed the first lines of code to it and was a Rails core contributor.
Simon goes on to say that they can attribute a large part of their success to adopting the Ruby mentality to their development.
Is Docker ready for production? A lot of companies hesitate
This is something that he's been talking about for a little while now. If you watched James Turnbull's interview on the show, then you already know that we quoted and discussed his article called Why Docker is Not Yet Succeeding Widely in Production. While he says some people interpreted the article as bitter sweet and as him having too high of expectations for the short amount of time that Docker has been in existence. He says that's not really what he meant to do.
He is very proud of what Docker has done in the last few years. They've managed to develop an entirely new mental model for people developing infrastructure. "That's been really awesome."
Many things he mentioned in the article are getting better by the day, but they are things that make the pros not outweigh the cons for a lot of companies. There are still a lot of initial blocks to get over. Fixing these things just takes time.
One thing that he still sees little innovation in but believes to be very important is building containers quickly. Dockerfiles are just not the right thing for it (at least for larger scale). Shopify spent a lot of resources on building tools to fix this problem, and have managed to build one of the largest Rails app in the world in under a minute. That's quite incredible.
Another issue that some companies care about is managing secrets. This is not a major issue for Shopify because they don't expect to get extra security out of Docker. More security is nice, but they expect that if people can break into the container, they have access to the host.
More on managing secrets here.
But these things are getting better and companies should be looking at Docker for some sort of CI and development environment this year. Shopify took a very aggressive approach by running it in production first with their biggest app, whereas a lot of people try with their smaller apps, testing, or development environments. They took this approach because they believe that if they couldn't make it work for the app, there was no reason to switch. Obviously now it has been proven that you can run larger apps with Docker.
In fact Yelp has gone to a more Microservices architecture. One of the questions Simon asks his friends who have moved to Microservices is "how do you test across these apps?" Yelp has a great way of doing this. As part of CI, they pull down the latest container of another service and test against that. If you do this in development, you could also do this by spinning up all these containers.
With large monolithic applications your testing and CI gains will be much smaller, but Simon still believes it is worth looking into so you are not left behind. But it's probably not worth taking to production just yet.
Simon drew up a matrix of where your architecture is vs. how important it would be for your to invest in Docker.
(Taken from this slideshow )
Do you build containers with your code in it, or do you map your code into the container and lose some of the value of Docker as it starts to depend on external state. Asked by disclosure5
Great question. This was a huge question within the community in 2014. Should you only be changing your containers when you change system dependencies or should you do it on code base changes as well? Simon thinks that the community has mostly settled on putting everything inside of the container. If you need to deploy a container with a system update (like a package update for example), you need to replace the container.
Deploys need to be consistent whether you update a package or deploy code. You don't want two different procedures.
In development he would recommend mapping in the source code and overriding wherever you have it. The only reason he recommends this is for iteration speed.
How do you get experience working on large scale applications when you don't have access to a large scale app?
He got that experience from essentially working on it. He started at Shopify really wanting to work on this stuff and he was lucky enough to get this opportunity.
Obviously it is a very hard thing to pick up on your own. There is a lot of tribal knowledge in the field, and there are a lot of things that are not written down in blog posts, but organizations practice internally.
A good way of getting this knowledge is reading a blog like Highscalability and listening to podcasts like this one. Question how apps end up with a certain architecture. Write down the words you don't understand and spend some time researching.
This is still very hard unless you are immersed in it because you're not facing these problems.
If you want to get good at it, you have to practice it.
Hiring a Site Reliability Engineer or Software Engineer, what kind of skills and mindset would you look for?
The thing they look for in people they hire is endless curiosity. During the interview process, Simon likes to ask people about a project they are working on. He wants them to talk about something they are comfortable with because then he can get an idea of how much they care about a problem and the kinds of questions they ask themselves. If they are working on a small app, he will ask "what would you do if you had a 100 times more visitors?" How would your database react?
They might say, "oh it would be important to have an index." "Ok, so how do you make sure that this is indexed? How does an index work? How is it stored in the database?"
They might go all the way down discussing how a B-Tree is more effective on disk than a Red-Black Tree even though they have the same sort of complexity. Of course he doesn't expect every candidate to go all the way down there, but he can very easily tell from the answers to his questions whether they have thought about this sort of thing before. These are the kinds of people he wants. The people who look at Twitter and immediately start thinking "I wonder how this is built. I wonder how they scale this."
The best engineers that he sees are creating these distractions for themselves -- be it debugging something or creating some sort of a mental challenge and allowing these distractions to be an opportunity.
How can people reach out to you?
Simmon's Twitter feed.
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)