Simplicity at scale with Simon Eskildsen of Shopify
Interviewed by Christophe Limpalair on 01/18/2018
Can you imagine migrating a large and heavily used platform to the cloud, when you know that each second of downtime costs customers real money? There is no room for error. If you're in this situation or if you're curious to know how Shopify is approaching this challenge, this episode is for you.
If you're curious about running Kubernetes for monolithic and microservice-oriented architectures, this is also for you.
Finally, if you're interested in scaling a team from a few engineers to hundreds, while maintaining a simple, clean, scalable codebase, and still delivering within deadlines, you'll want to hear what Simon has to say.
That's not even all of it - I know, there's a lot in this episode and if you're like me you won't be able to stop listening. Enjoy!
Links and Resources
Shopify is an incredible platform in terms of scale. They handle a lot of load, and they were one of the first major platforms to run Docker in production. If you're interested in that, definitely check out the previous episode with Simon. While we don't focus as much on Docker in this episode, we do talk about Kubernetes and many other interesting topics. Enjoy!
Christophe: Simon, you have a blog post titled "Shitty First Software Drafts." In that, and I quote:
"Toby, the CEO of Shopify, has mentioned on more than one occasion, that
git reset --hard which blows away all of your work, is his favorite feature of Git. He said that if you can't blow away all of your work and write it again from scratch in one hour, you have not found the simplest solution yet."
How has this shaped your work, Simon, and the work of everyone at the organization?
Simon: Yeah, I think this story is one he told me in my first couple months at Shopify and I thought it was somewhat fascinating. He has this very deep appreciation for simplicity and I think this is a great way of talking about it. The origin of the story apparently was that him and the CTO at the time were at this conference and they were working on the CSV import feature. You have this massive CSV of products or whatever, and you import it into your database, and they didn't have the feature at the time. I think they stocked the hotel fridge with beer and then at the top of every hour, blew away all the code with this notion that if they couldn't implement it in an hour, they hadn't come up with the simplest possible way of constructing the software.
My favorite analogy to that, which I mention in the post, is that there are these famous Picasso paintings where you see a bowl or a dog, and it's one line but you can recognize that it's a bowl or a dog or a penguin. You look at that and say, "Well this Picasso guy clearly did that because he doesn't know how to draw." You're only seeing the tip of the iceberg. If you look at the whole process and all the drawings that came before it, you see that it's this extremely elaborate, detailed drawing of a bowl or a dog and there's like sixteen revisions. So what I say sometimes, internally here, is that you need to draw the bowl again. You need to hone in on the core idea. I think that's really fundamental, because if you don't boil the idea down and really distill it, the same way you do with a shitty first draft, if you don't do that, you're handing over this poor mental model and it can become very difficult to refactor something, take a step back. I think it's really important because you'd never hand someone a first draft, but you'd happily make a PR when your code is passing the tests.
I really think that our software is kind of cursed and blessed by the fact that we can easily prove whether it does its function. It's very easy to call it done. Writing is blessed and cursed by the fact that it's much more vague, it's whether people understand your idea, so you just keep honing it. I think programming needs to balance those two things more so than we do today.
Christophe: I like that you mention that simplicity is actually harder to get. So if you're writing, if you're an author, or a programmer, whatever it is, you start with a crappy draft at first and then you have to refine it. But then how do you know when it's done? You have deadlines you have to hit, and you have something that works, but you know it's not the simplest possible way. Do you compromise? Do you keep trying to refine it? How do you juggle that?
Simon: I think that's where experience and intuition has to come in, to some extent. What's difficult is that you might look at a codebase and say, "Hey, I think that's really complicated, I think we need to sit down and come up with a simpler solution," but you don't know what it is because that person has more context on it. So you have to trust your judgement, and there are several features that you could consider, like the deadline. How many people are going to be touching this code, how fundamental is it, how long is this code going to live? And you sort of have to balance all of those things. You may not do it consciously, but it all comes down to the sum of those things. Sometimes it really does make a difference to blow it away every hour until you have the simplest possible thing, if this is something fundamental to how the code is going to work, if it's something a lot of people are going to look at and form a mental model of how the code is going to work. Then it's really worth it. But if it's something you know is going to be killed a couple months from now and is pretty low-risk, an area that a lot of people understand, then maybe there's a little bit of room for tech depth.
Christophe: What kind of scale are you guys running at nowadays? I'm curious to compare it to two years ago. What kinds of stats can you share?
Simon: I wouldn't remember off the top of my head what we were at two years ago, but today we're running at about 80,000 requests per second on a Rails app when we hit peak. It's been quite a lot, it's been compounding every single year. We're running out of two datacenters now, starting to experiment a bit with the cloud, but having several thousand servers right now, I believe, without having the exact number or knowing what that is. There's a lot of workers as a facet of Rails, which needs a lot of CPU to scale to that extent, but it's working quite well. So yeah, about 80k RPS.
Christophe: You mentioned in the previous episode is that you're running from your early datacenters - here you mentioned that you've got two different ones. But you recently told me that you're experimenting with a gradual cloud migration. And you also said that you should mold your app to your infrastructure, not the infrastructure to the app. So what do you mean by that?
Simon: There's a couple different things here. I'll talk about the molding first because I think that's really important, then we can dive into the cloud migration and why that's something we decided to do. Something I'm seeing when we're migrating to the cloud are two dangerous patterns. One is that people will be migrating to something like Kubernetes, and they'll be looking at all these shiny new APIs, all these new things they can do and all these problems they can solve at a much better level, and they get excited. They start engineering. This ends up stymying the actual platform because now we're introducing all of this new risk. There's the lift and shift approach, then there's the one where you rebuild the entire thing and migrate onto that. Neither of those are probably perfect. Lift and shift is a hard thing to get through in an organization because it can be hard to justify the value proposition over time. But doing too much at once can also be a problem.
The other thing that I think is important is that you need to mold your app to the infrastructure, not the infrastructure to the app. An example of this can be something like jobs. In Kubernetes, you configure a graceful deadline. This graceful deadline is when Kubernetes sends a signal or a sigterm or whatever to your process, you need to be able to shut down before about this interval before it forcefully kills it.
So in order to comply with this Kubernetes paradigm, that's something you have to do. Now, you can go two paths here. You can go the first path, which is, "Well, okay, I'm just going to configure Kubernetes to have a graceful deadline of two weeks because I have some workloads that can run for up to two weeks." But in that case, you're directly violating the paradigm Kubernetes is trying to set. People who run much bigger infrastructures than we have, have decided that any workload should be able to shut down in this amount of time. If you have a workload that you cannot interrupt for two weeks, the probability of that machine dying - and if you have a lot of those jobs, losing one of those jobs - becomes incredibly hard. So in that case, you can build infrastructure that tries to work around the problem and you can increase complexity at the infrastructure level, or you can go to the app and fix it at the app level, molding the app to the infrastructure and its paradigm, and not have workloads that cannot be shut down in less than thirty seconds.
For example, if you have a web request and it's been two minutes, you should probably try to work on not having that. If you have jobs that run for two weeks, what we did was invest heavily in, instead of molding the infrastructure to support an app that had these long-running workloads, what we did was build a new API for people to define jobs. You define, this is the enumerator, the collection that I want to perform an action on, and this is the action that I want to perform on every single element of the enumerator. If you have these two things, you have all of these areas where, at an infrastructure level, we can hook in. Like for every single iteration, let's check if there was a signal sent and then we'll shut down gracefully, save the process, and shove it into another job. This has had all these compoundingly positive effects. We now know that any workload can be shut down gracefully. If we went with a cloud provider, we can use instances that can shut down in thirty seconds or less and save a lot of money that way. Another thing is that when we're moving between regions, we know we can interrupt all the workloads very fast without losing anything, which means we can move between regions very gracefully.
Another thing we've done: between each iteration of a lower importance task, we can check how the database is doing and use backup and throttle mechanisms automatically. There are all these compounding effects that we've found. In general, when you do platform level work, it's important to consider how these good decisions and changing these assumptions can have these cascading positive effects over time. I think that we talk a lot about knowns and unknowns in terms of risk, and how we may surface something that's dangerous or can stymie the project in whatever way. But what we don't talk about in platform work is that there are often unknown benefits to something. An analogy that I like to use is that if someone set out to build a twenty or thirty story concrete building, that seems impractical. Walking up thirty flights of stairs even though you can probably just stack thirty blocks of concrete on top of each other is just very impractical. It's much easier to build a bunch of five story homes and spread out over a bigger area of land. As you look at all the European cities, that's kind of what happened. Then someone invents the elevator. You had places like Manhattan and Chicago and all these cities in the New World shooting up. You had the elevator, so you had these buildings that you could build much, much bigger. I think it's the same with infrastructure. You had the elevator first, and then came the skyscraper. People don't set out to build the skyscraper; that's too many unknowns on the path to something you can see as great. So I think as infrastructure developers working on a platform, you need to identify the elevators in the stack.
Another example of this is something like the Silk Road. Someone says, "I want to build this massive Silk Road" or whatever. To do that, you need Tor and you need Bitcoins. You don't build all those three things, that's too much. But when Tor and Bitcoins exist, that becomes a logical next step. Of course, I don't endorse it at all, but it's another example of these compounding unknown positives of platform work.
Anthony: One of the things you were talking about is that Shopify works a little bit of a monolithic architecture on Kubernetes. My two questions would be this: What challenges have you faced with that monolith, from a Rails application standpoint, and then actually running that on Kubernetes? What does that look like? When we look at Kubernetes, a lot of times we think of service-oriented, microservice-oriented architectures because of the sheer nature of how it works. Then what challenges have you faced putting that inside of Kubernetes, and how has Kubernetes really enabled you to explore some of those cloud migrations?
Simon: I mean there's a lot in that question, so you may have to follow up on some of the subquestions. But let's start about talking about the monolith and where we are with that, then we'll follow up with the Kubernetes questions. We're still a fairly monolithic company, and I think that's started to change as we've grown to several hundred engineers. We are starting to solve problems with other apps and we're seeing some velocity increases there. There's also a lot of problems with building these apps that are not a part of what we call Shopify core. The problem is that these Rails apps, on day one, they're serving - because of the network effects of Shopify and all the customers that we have - on day one, these Rails apps are serving more traffic than a vast amount of Rails apps out there.
We'll get back to that in terms of Kubernetes and how we're leveraging it. The appeal of putting things in core is that some of those scaling challenges have already been solved. To get off the monolith: one, you have to make it really easy to create services, but at our scale, you also have to make it very easy for them to scale on day one. You also have to scale Heroku. The dyno slider only goes so far - it actually has a limit. So at some point, you reach the limit there, and you need to help your internal teams build other services if they're not putting them inside of core. What we've been doing for a long time is running the main parts of Shopify in the monolithic application. The monolithic application is a big Rails app and the core commerce principles are in there, such as how you do an order or how you charge customers. How you do a checkout. Those things are still in there.
What we are starting to see and starting to endorse a pattern of building features outside of Shopify core. For example, a feature that's built outside of Shopify core is something we call Shopify Pay, or "Remember Me," where you go to a checkout on one Shopify store and then it remembers you on another Shopify store, trying to expedite that checkout. It's like something you're familiar with, like Amazon, where no matter what you're buying or from which vendor, it's the same thing. That's built as a separate app. The problem is that they need this massive scale on day one. What we're trying to do with Kubernetes is - on one part, we have a lot of operational challenges with the monolith. But we also are very passionate about trying to create this platform for solving commerce problems faster at Shopify than anyone can do anywhere else. We have these amazing APIs internally, and we need to make it faster to develop at Shopify than anywhere else in the world. That's the mandate of us building a platform on top of Kubernetes.
And another big thing is to start migrating to the cloud. Someone made this interesting analogy that perhaps cloud is like what happened to power plants. Back in the day, every neighborhood or factory had its own power plant. Today, you just trust the grid. You hook into the grid, it works, and you don't care very much about where your power comes from. I see that as very analogous to the future of compute, where there are people who are specialized in providing compute at scale, and they will be fantastic at this, but most people don't have to. They just tap into the cloud and let someone else do it, the same way we consume power. But there are problems we are very good at. I don't want Shopify to be a company that is really good at building datacenters; that's an opportunity cost for us. Shopify needs to solve contextual problems that make commerce better. On my part, one of those challenges is we like working with Ruby on Rails, but we need to scale it, and we have a lot of knowledge there. But building datacenters - Amazon, Google, and Microsoft are just world class at building datacenters. I think they should continue developing that expertise. I want the engineers here to focus on things that are more specific to how we run and operate software, and we'll build a platform that makes commerce better as a result.
Anthony: So you're running a lot on-premises right now and you're running a little bit in the cloud. Does Kubernetes make it easier to test those workloads? You're running Docker inside Kubernetes, as your container engine. Does it make testing and transferring those workloads easier?
Simon: Absolutely. Kubernetes is a massive part of this dream to make it easier to build applications inside. Kubernetes has these really simple APIs - "I just want to run this thing and then a team just goes and figures it out." It has these very carefully designed APIs about how you do things. They're very limiting, but in a relieving way. Even though the amount of parameters you have may be limited, they're very carefully considered. They come from people who have run this stuff for a very long time, and that's what speeds us up. We now spend time writing controllers and writing proxies. Kubernetes is the skeleton and all of these parts are helping us move faster - controllers are fantastic APIs for us to get the majority of our work done. It helps us move those workloads in that it provides these APIs in this skeleton for doing those things. Kubernetes is defining this sort of de facto API and if we want to go multi-cloud and use different providers, this is sort of the standard. The same way Docker has defined this carefully designed API for how we run containers, I see Kubernetes the same way. People were coming to Shopify from other companies, starting to work on the infrastructure, they may already be familiar with these APIs. There's already fantastic documentation out there so we don't have to duplicate that stuff, it's just of such high quality. We don't have hundreds of contributors that can contribute to that one project, but Kubernetes does. That's just reiterating the power of open source.
Christophe: Now, in terms of the actual codebase itself, how do you approach refactoring something like that? Moving forward, how does that work?
Simon: I think we talked a bit about this in the previous episode, but when you move a workload into a new type of infrastructure - in this case, something like Kubernetes - again, you have to mold the app to the infrastructure and these new paradigms. There are parts of that which require a lot of refactoring. For Kubernetes, we haven't had to change too much in the app. A lot of these things come as second order effects of just preparing your app for containers already. There are things here and there, where you can't rely on the host being there, and things like that.
But some of the areas where we've embarked on very large refactors has been the way we've been doing sharding over the past couple years. We started sharding about four years ago, but since we last talked, Shopify is now running multi-DC (across multiple datacenters), we're moving shards around, moving shops around between shards. We have this sort of very dynamic infrastructure. As part of that, we had to do some massive, massive refactoring in the codebase to support this paradigm. If you're sharded, it means that every single web request, every single job, should only be able to touch a single shard. If it touches multiple shards, it means it's trying to span connections across potentially multiple different regions, which may even be on different continents. That's unreliable. That's a massive refactor.
We're also seeing refactors in terms of trying to scale the monolith for development velocity. If you have a "vanilla" Rails app, and you have hundreds of models, you have callbacks that reach deep into the guts of very unrelated parts just because it was an easy shortcut. So we are doing very large refactoring on that front as well to make the codebase more maintainable. And when you have hundreds of developers working on the same codebase, there's a bit of an art to figuring out how to refactor these codebases at scale. Some of the things we started doing - it's not a new idea, but the name is new - we call them "shitlists." And what a shitlist is, is if you are working on some kind of large refactoring or trying to get something done and there's hundreds of other developers, it's not sufficient to send an email saying, "Hey, I'm doing this, please don't do that." It just doesn't work. By the time you're starting to make progress, a couple tens of new developers have joined, who never read that email and will never read it, and some people may have archived it. Maybe they were on vacation and did a rage inbox zero when they came back. You don't know, but it doesn't scale.
So what we've started doing, and what they're starting to develop a culture of doing, is that when you're making a big change - such as upgrading Rails - you deprecate a new pattern, but instead of logging to standard error, "Hey, please don't do this because of this reason," you raise an exception. Sometimes it's multiple paragraphs in the exception telling you why not to do this, what you need to do instead, which team's progress you're stifling because of this change, and so on. The other powerful thing about this is, for example, if you have a list of, say, classes that violate some principle. For example, maybe all of these classes are jobs and you cannot introduce more jobs, so when you add a new job, it raises an error message that tells you this is why you cannot introduce more jobs into the codebase. This is a problem and this is a problem and this is a problem, and long-term, this is to your benefit as well, even though in the short term, it might hurt you a little bit. Then, the other team that goes and refactors all those jobs, they know the bleeding has stopped. They know they're communicating effectively.
I think this is a really important philosophy to adopt: when you're making large refactors across codebases, and with people, have a little empathy when someone is trying to do something that is stifling your progress by raising a really good error, or pointing people toward docs that say, "This is why you can't do this. This is how it benefits you. This is what you should do instead." You can't just tell people to not do something, you need to tell them what to do instead or who to talk to. In general, I think when you have so many engineers trying to work on different things, with different pressures, different incentives, you need to align around something. People will respect a failing test, right? So that works really well for us. We've seen this pattern where if you change this file, then this team will be tagged on GitHub. That's just some safeguarding that doesn't really work, so we've been able to make large refactoring - upgrading Rails major versions - without going two steps backwards by communicating with errors, communicating with developers just at the right moment so that they cannot do something.
Anthony: One of your posts - and I'd suggest reading it - it's specifically about shitlists. And I'm going to quote you here, it says:
"If you own a shitlist, you must empathize with the problems of everyone who’s running into problems. If you simply deprecate new behaviour and don’t offer an alternative, you will be the source of frustration."
So obviously the end result is amazing. You put this process in place that helps your process scale, which is always a struggle for companies. My question to you is based off this text in your post, were there any frustrations or challenges in actually implementing the shitlist? How did you get around those?
Simon: This comes out of experience, right? We've had people who reach the shitlist, where the errors were just, "Don't do this." You'd have people coming to you who are losing trust with you because you're just hindering their progress and they don't understand why. So we learned the hard way that if we have hundreds of developers and we're asking them to do extra work, we also need to tell them why. The errors would say, "If you're doing this, it will hinder the progress of us running in multiple regions at the same time. This is important because of reason X and Y." So that was really important, and it was a lesson we learned the hard way. The same thing happens if you just log something to standard error. People will just ignore it. It's not enough of a nudge for you to do something about it. You need to have some powerful primitives there.
Anthony: You also put in here:
"Delegating with shitlists is great. Due to the tight feedback loop, asking other teams or onboarding new team members becomes much easier."
One of my questions with that is as you bring on new team members or you roll out something like shitlists, how do you communicate or get all your engineers up to speed or maybe get a new hire onboarded, and really understand the value in that?
Simon: I think the value of the shitlist is expressed by whoever owns it. I'll give you an example. We have this database that we call "the monster database." Every shop has a little bit of data in there, and these are all shared responsibilies between multiple shops. An example of this could be the representation of an app or currency rates or tax rates. These are things that are shared between multiple shops. Now it's very appealing to just add another table to this global database that is just magically available everywhere. But we can't scale that, so we have a shitlist that says "Do not add new tables and new behavior into this database. And this is why." Now we have the shitlist of all the tables in master and we go and find all the teams, and tell them, "Hey, this table that you have in the master database - as we continue to scale, we add more shards, more regions, more customers, the availability of this table will decrease. We cannot support this primitive. This is the end of life date and you need to remove this table from the shitlist; either move it to a different app or shard it by this date. The reason why is that you'll lose trust in your product because it will be less available." You need to create some incentive there. If the incentive is only for your team, that's a hint that this is maybe not important enough of a shitlist to erase, or you just have to do it as a team because you have the biggest incentive to do it. Or, the third thing, you can incentivize from different angles. Sometimes it's not quite that false dichotomy of they do it or you do it, it's somewhere in between. But you need to make it extremely clear why this shitlist exists.
For delegation, I think it's really great because it's quite satisfying spending time figuring out, "How do I define the shitlist, automate the process, and stop the bleeding?" And then you just start removing one entry at a time, converting them to new patterns and things like that. It's really easy to delegate and onboard someone to the team because you say, "Try to remove that job class from this array of things that are shitlisted and see what tests fail. Make the tests pass and merge your PR. One at a time." And you learn a lot like that. It's not usually that simple, usually these shitlists are quite high-level, so getting an element off of it involves some big technical challenges, but it stops the bleeding at a high level.
Christophe: This is a really good segway into it. So maybe we can cover it a little bit and then go back to the cloud migration and experimentation. In terms of growth - correct me if I'm wrong, Simon - but I believe you had 150 employees and now the company is at 1500+. Is that right?
Simon: In 2015, we were not 150, that was back in 2013. Today, we're at about 2000 or so. I think those are the latest reported numbers.
Christophe: Which is significant growth. Every company that grows like that, even at a much smaller scale, experiences these problems where they might onboard more developers, more engineers, or anyone, and the communication processes break down. In terms of development and what you're talking about here, what are some of the processes that you guys have been able to put in place that has helped you scale from that smaller size into a bigger size now?
Simon: I think one big thing were these shitlists and patterns that make sure you don't stymie other teams' progress on something.
A second big investment we're making is that we're really trying to make the code easier to change. As I mentioned earlier, if you have these hundreds of Rails models that just communicate willy-nilly with each other, they become really hard to understand. There's no one who understands the whole codebase, so you have to shrink it down. So what we're trying to do now is create components. A component is essentially - you can think of it like a Rails app within a Rails app, but they're sort of nested. So you have this component that's clearly defined, and if you're communicating with this component, these are the APIs.
So an example of an API could be maybe a billing component. A billing component has an API of: change the subscription, charge this much, charge this much on a recurring basis. The data model is completely opaque to you. The tables behind that component barrier are not doing any joins outside of it. There are no callbacks outside of it. All the callbacks have to be done with contracts to other components. This is, to some extent, what people do with microservices. If you're creating a service and you're putting this TCP/IP barrier, or whatever RPC protocol on top, you're making it very clear how to communicate with that other app. What we're trying to challenge is that maybe you don't need that TCP/IP barrier for everything. And even if you do need it, you can stop just before you're ready to move it out of a separate app and perhaps there's a lot of value there. So a lot of investment in containing complexity is really important, but I would challenge the notion that you absolutely have to services.
And for a third one that I think is fun to mention, I think you have to, as you grow to many hundreds of developers, there are cultural things that you lose. If I go down to the water cooler, I don't necessarily know who's there, what they're working on, and I may not even know the question to ask them. So you definitely lose culture when you scale, but I think it's important to consider what culture you can gain. For example, we've had engineering talks for a long time, internally. But when you have hundreds of developers, there are so many people that show up. The leverage of doing an internal talk is really high when you have a lot of people. Another thing is, as a function of scale instead of in spite of scale, has been to run these internal podcasts that are only available to Shopify employees, where we talk about engineering, we talk about what people are working on. The recruiting team is doing this thing where they go around the company and they interview people: what is their life story? How did they end up here? There are a lot of things that you can do because of scale, and you can only do when you have a large scale company. We tend to focus too much on all the things we can't do because we're at a large scale company.
Anthony: I'm gonna ask a question that's maybe not directly related to the engineering standpoint. But you have hundreds of engineers, you have this amazing product that you're supporting, thousands of customers. What role, if any, does a product owner or a project manager play in your day-to-day lives for your engineering team?
Simon: I can't really speak to that because I'm not really on the product side. In production engineering, we don't really have project managers, and production engineering is the team that I'm a part of. We don't have product managers or project managers. That may change because we're now building this internal cloud product that we want people to work on internally. When we're building that, that's a product. The same way someone like Heroku or Amazon has a front end and a user experience. So that might change, but for the time being, it's not something I have huge insight into. The product teams, of course, have product managers and project managers, that look out for cross cutting concerns, talking with teams that have relationships with other teams. Like we just launched an integration with different shipping providers and you need relationships there instead of having a big hub across the company. But it's not something I can speak much to because it's not something I experience day-to-day.
Anthony: No, that was fantastic, I appreciate you diving into what you did.
Christophe: How do you structure those infrastructure teams or even more specifically, the SRE teams? What does that typically look like?
Simon: I'll give a tiny bit of history because I think it's useful and it's not something we've talked much about publicly. We used to have a classic operations team. These were people who were just on call for everything. No matter what broke, they were on call for everything. But you get beyond a couple of engineers and things start to break down, incentives don't quite line up because you have this team that is just, by design is going to be very conservative about change. You have this team that is very risk-averse, and you have this other team that will take a lot of risks and put things into production because they know they're not going to be on call for it. They can just ship it on Friday and go. The incentives don't quite line up, which is why this whole DevOps thing happened. Another thing was that we had exponential growth and you can just hire, but at a certain point you can't just hire to support that exponential platform growth anymore, so you just need to do a lot of software development to automate.
I think we hit that critical threshold in about 2015, where we restructured all of our operations and we had our performance engineers, and we sort of merged the two and they're part of this new team we call production engineering. In production engineering, we define a couple high level areas of responsibility.
One area is what we call developer acceleration. Our developer acceleration is a team that is very much advocating for our product developers, who are the developers that are creating commerce features. They provide a bridge between them and infrastructure - they build things like our CI/CD pipeline. They build our developer infrastructure, where we have these fantastic tools that build on top of Docker for Mac, they have super cool YAML definitions, they're blazing fast. They build our Docker containers. They do all this stuff, so we call them developer acceleration. Their mandate is to make development better at Shopify.
Another team we have is the traffic team - they're responsible for ingress and egress of traffic, so buying traffic from traffic providers and ISPs. They're responsible for DDoS mitigation, banning bots, all those sorts of things. So anything that is related to traffic, from the customer to the load balancers, is handled by that team. Then we have a team that's the data storage team. They're responsible for relational data stores primarily. We have a team that's responsible for our Redis and Memcache, our queuing and caching infrastructure. We have a team that's responsible for what we call "data delivery," which are things like Kafka and ElasticSearch and pipelines into various other facets of engineering.
Then we have the team that I'm on, which is called the pods team. We do a lot of cross cutting concerns and mandating the architecture of the Shopify core. We've worked a lot on scaling, multi-DC work, moving shops between shards and things like that. That's sort of the overall way that we're structured.
One last team that I think is worth mentioning and very interesting is what we call the operational excellence team. The operational excellence team's mandate is that - it's around 8 people who work full-time on trying to create tools and processes that make it easier to be on call and things like that. So for example, we have a chat bot and it has an extension to handle incidents. So when an incident starts, you tell the chat bot, "This incident started." It will start giving you checklists. After five minutes, it might say, "Hey, do you need to contact more on-calls?" After forty-five minutes, it may say, "If this is a large scope incident, you may need to notify senior leadership about this." After ten minutes, it might tell you, "Hey, have you reached out to support and told them very explicitly what the extent of this incident is?" Meanwhile, it also keeps a log of everything that's going on. When you have messages and graphs of people trying to debug a problem, you can react with emojis and then that puts it into a timeline, that goes into a report at the end of what happened during the incident. This is just a brilliant tool that becomes an everyday thing and makes it easy to update the status page. They work on tools like that.
Another tool that's worth plugging here is what we call ServiceDB. ServiceDB is this operational excellence initiative where we have this massive - you can imagine it as a spreadsheet, a massive spreadsheet of every single app that we have internally. The columns are things like, "Do you have an on-call? Do you have a CI/CD pipeline? Has this app been deployed within the past month?" All these different checks. And the more checks you pass, the higher tier you are. A tier one app has all of those things and way, way more. A tier four app that's just starting out and running some kind of experiment obviously needs much less of this and can optimize for speed. But this gives us a report where we can give each director a score based on how well they're performing and how well their apps are performing. And it gives us this fantastic cultural leverage tool. We can test things like: Chaos Monkey may, in the future, hook into ServiceDB and give us a Chaos Monkey report and say, "Hey, you're not living up to your requirements as a tier two application. You need to invest here or these other apps may start refusing to talk to you because you're not meeting your SLOs." So this is just a really powerful tool under operational excellence. I really like that team, and that's sort of a deep dive into what the different teams here do.
Anthony: I have a quick question. What I've heard is bits and pieces of your stack. You mentioned tools, but are there any other major tools that have been game changers for you, that you consider part of your stack. You mentioned Kubernetes, obviously, ElasticSearch, Redis, Rails - obviously that's you're coding framework - Chaos Monkey, and internal custom tools. What other components or tools would you consider part of the Shopify stack?
Simon: I would say Nginx and OpenResty - I love those tools. OpenResty was something that we started making part of our infrastructure about 2-3 years ago. I know you want a bit of an overview but I just want to plug this because it's an amazing piece of technology. It allows you to write Lua scripts on your load balancer, and we've written an exponential weighted moving average load balancing algorithm that's been phenomenal for us. We've written all of our multi-DC routing on top of this. When we do failovers between regions, we pause the requests while moving the shard to another datacenter to avoid downtime. We've banned bots with these Lua scripts as well.
We talked earlier about these positive known unknowns and building elevators, and OpenResty has been one of those elevators. We introduced it for one or two things and then the amount of useful things we've been able to build has just been absolutely incredible. So OpenResty and Nginx is the big one. Another one is we're using Zookeeper for a bunch of things. Like Nginx, it's one of those things you just kind of run and it never complains and it just works. It's a fantastic piece of technology as well. We're running Rails, I think we said MySQL earlier. We run ProxySQL in front of those and we're using Kubernetes. We've been using Docker for a number of years - we're using Docker for Mac locally. We're using Memcache for a lot of our caching needs, and that has a fantastic reputation for being a very stable infrastructure component. You can see Facebook, how they used it there. I'd say those are some of the main components that we use. Another thing is that we limit the number of languages that we use - we use primarily Ruby and Go. It keeps velocity high because it means people can easily move between projects and read the code. Go tends to be good at a lot of things that Ruby is not, and Ruby tends to be good at things that Go is not as good at. Those two complement each other really well. Is there any specific part of the stack that you're curious about? I'd be happy to talk about that.
Anthony: Not at all, in fact, I was just looking for that overview. Have you guys evaluated or thought about serverless technologies at all? And if so, what type of role do you see it playing, if any?
Simon: Yeah, I think it's super exciting. Serverless technology is really interesting because as I mentioned earlier, we have all these apps that are reaching a really big scale on day one. I think one of the big things I want to solve is making scale a non-issue so we can focus on solving problems for our customers and not solving these scaling challenges. Serverless architecture really falls in that bucket. It's clear that if you follow this set of paradigms, your thing will just scale endlessly. I really like that and I'm interested to see what sort of patterns we can start doing now to prepare for a future where this might be more relevant. It's not something I'm betting everything on just yet. Getting Shopify to run in that would be almost impossible just because it's a Ruby process and it takes, you know, 30, 40, 50 seconds to boot. But there's a lot of interesting things there. I see it as the other extreme of auto scaling. If you can't already auto scale, which we have problems with, then it might not work so well. But I'm very interested to see what kind of patterns can we adopt today that puts us in a good place, not just now, but in the future where this might be more relevant.
Christophe: Going back a little bit to cloud migration, as we start to wrap up the episode here, what's the strategy behind doing that? I know when we spoke two years ago, I don't think you had any plans - at least not publicly - to try a cloud migration of any kind. But now it sounds like you're experimenting with it. So what's the thought process behind that?
Simon: I alluded a bit to it earlier. We don't want to be a company that's really good at running datacenters, and that's the position we felt that we were falling into. We saw a pretty clear fork in the road where we could invest a lot more in these tools we already had, making it much easier to provision servers and use Chef and things like that. Or we could let someone else do that and let all these amazing engineers who built these fantastic datacenter tooling and make them work on building a platform. We want to focus on the abstraction level up from hardware. So I think that's the biggest factor. We believe we can increase our engineering velocity by being on the cloud, and at the end of the day, that's how we win or lose: it's whether we can engineer quicker than everyone else.
Christophe: You've also mentioned that at your scale and with this company, you don't want to be specifically restricted to one cloud platform. So finding some of the tools that allow you to do that - Kubernetes, all of those - is very beneficial. If somebody else, let's say another mid to large scale application, is trying to do a cloud migration, what are maybe 3-5 tips you'd give them right now?
Simon: I think it's the mantra of mold the app to the infrastructure that just pays off so handsomely later. The second thing being don't add too many features when you're migrating. Focus on actually migrating. You can get into this very precarious state where you're maintaining two different architectures - one on Kubernetes and one not - and if you're in that for a long time, that can really slow down the amount of progress you can make because you have to do everything twice. I think those are really the two main ones. Invest in the things that will work over the long term, and don't spend the time migrating to all the new patterns just yet. Hold your horses - if you've survived so far without all these APIs, you can probably survive until you've migrated everything over. It's a bit more of the lift and shift than adorn everything and then move. I think either of those extremes are perhaps a bit naïve, but I would err more on the lift and shift side of that spectrum.
Christophe: You had a talk at GOTO where you talked about the fact that you're working toward performing very low downtime moves of shops between shards and dealing with those. Can you give us a little more information into that? We'll also put a link to the video for that GOTO talk so there's a little more information there. We also talked a little about shards in the previous episode, but what can you tell us about that?
Simon: The main thing I've been working on over the past two years is called the podding architecture, which is to take sharding to the next level - not just sharding MySQL, but sharding Redis, sharding Memcache, sharding everything. So instead of running one big Shopify with multiple data stores, we run a ton of very small Shopifys. When you do that, you can imagine that the Shopify over here has some stores that are much larger than the Shopify over here. If you imagine all these little Shopifys on a bar chart, you might get something like this. Every single bar is a different size - one is like this, one is like this, and so on. So what we're starting to see, the Shopify that has the bar like this has a lot more outages than the one like that. And this one might just, because you have so many of them, there might be some really big stores that take up maybe half of all the traffic there. Perhaps they should have their own little Shopify. Maybe there are a bunch of shops that can be moved around. The idea is to move the shops between all the mini Shopifys to flatten that graph. The size of every single pod, each mini Shopify, should be as close to the average as possible. That's been a really big goal in minimizing downtime. A shop might be many gigabytes of relational data, and that takes a long time to insert properly, building the indexes and so on. We've been really investing in a binlog based approach, where we just tell the binlog and move the shop, lock and unlock it, so we can do that in seconds. It's not quite done yet, but that's something we've invested a lot in. And just general architecture, building on top of OpenResty, and that's what I talk about in the talk. It's the pod's architecture, and we have these mini Shopifys running around the world that we leverage. If you have a big sale, you can steal resources from some of the other Shopifys. So at a disproportionately large store, such as Kanye doing a sale, can steal some resources from some other pods while they're having a sale, then give it back to the platform afterwards and leave a little for Kylie Jenner, who can have a sale afterwards. That's something we've invested a lot in, horizontally sharding really hard and ensuring we do not have platform-wide outages.
Christophe: You also had another talk at SREcon where you talked about working on very low to zero downtime moves of shards between regions. You exercise that on a pretty regular basis, can you tell us about that as well?
Simon: When we have these little Shopifys, you can imagine that some of the little mini Shopifys (the pods) in one region need to be moved to another region. There may be some kind of disaster. We do drills for this now fairly frequently. When we move a pod, we want to minimize the disruption. There are some really important effects when, if you have this really stable infrastructure tool where you know that you can evacuate a region, even if you have to do a little low-risk maintenance, you can, over time, give much less downtime. What we've been looking at is how we can have the minimal amount of impact on these moves so we can do them all the time. So when disaster hits, they're just regular and routine and everyone knows how to do it. That little bot that I was telling you earlier, that tells you, "Hey, have you notified these people?" or "Have you updated the status page?" can tell you, "Hey, have you just considered evacuating the region?" That's where I want it to get to.
In order to get there, you have to build a lot of trust and you need to exercise it regularly. And in order to do that, you can only eat so much into your error budget doing resiliency drills. So to do that, you need to have a very small hit on the error budget when you do it. When we move it, the move boils down to us moving all the stateful resources over to the other region - letting the database catch up, moving the jobs and things like that. Doing some of this stuff, there's nothing that's writeable for that pod. The MySQL is readable in both regions for a very short time while the target region catches up. During that time, we can't serve checkouts and things like that. If you zoom out and look at this from your perspective of trying to do a checkout and you've just entered your credit card details, and you hit pay, and it was this expensive order, now you get a 500. There's nothing more dreaded than that - you have to call the store and ask them like, "Did you charge me? What's going on?" And they're like, "I don't know." You don't want that. That's really bad experience for a customer and we really care about that.
So what we do is, when the request comes in, doing a move, we pause them at the load balancer while we're doing the move and then unpause them. So the experience, instead of serving that 500 is instead, the browser is just taking 5-10 seconds to serve the request, which is not so bad. You may freak out a little bit, but you should not even get to that time. Confidently minimizing that disruption is really important.
Christophe: Definitely not as bad as a 5xx error for sure. So I've got one more question to wrap up. Now that you do have a team and you're managing that team, what are some of the ways that, one, you motivate that team, and second, you measure their individual performance or the team overall? How do you manage that? How do you structure that?
Simon: I'll go into two ideas that I really like, in both facets. In terms of the team, I was reading this book recently called "The Power of Moments" and the author of that had written some really brilliant books. You may remember some of these moments from your childhood, where your parents told you something at just the right moment, and you remember this disproporionately better than anything else. It's sort of your bee in the bonnet, one of these things that's just recurring. The book is about you creating those moments.
As part of this, they say it's very important to have purpose. There's these tests, and I haven't read the study, but if you have very high purpose, or - let me approach this from a different angle. If you have high passion and low purpose, you will only be in about the 20th percentile of performance. Passion actually doesn't matter as much as purpose because you can have high passion, but if there's no purpose, you may be completely working on the wrong thing and scattering your attention everywhere. But if you're high purpose and high passion, you're in the 80th percentile. If you're high purpose and low passion, you're in the 60th percentile. How is it that low passion, high purpose individuals can outperform the high passion, low purpose ones? I think that's the responsibility of the manager. That's a really big part of managing a team - managing a purpose and making sure everyone understands why what they're working on is important. How does it affect the end merchant? How does it affect their customers? Whether it's developer or merchant or whatever it is. Whether that means listening in on a support call while there's an incident so they understand why we don't have incidents and why uptime is important, it's really important to build purpose.
In terms of rewarding employees, I think there's a lot of ways of doing it. So I'll talk about something that I think is rare and that I'd like to see more of. I think that it's very easy to reward people for being heroes and doing things reactively. If there's an incident and someone does a great job of managing the incident for something they're an owner of, we celebrate them. We say, "Hey, that was a great job, you stayed up until 2AM and you fixed it. Amazing." Yeah, that's great, but why did that person end up there in the first place? I see all those opportunities, and it says you made some really bad decisions in the past and yeah, you had the context to fix it in the moment, but it shouldn't have happened in the first place.
What I'd like to see people do more, to reward teams and individuals, is to look at what proactive decisions they've made. It's very subtle. What you can do, that I've started experimenting with, is looking at individuals on my team and the key decisions they've made that year and trying to play it forward to now if they've made a different decision. For example, this decision someone on my team made of creating an API to make every job fast changed all these assumptions. It had these cascading, amazing effects over all of our pods - how we move shops, how we throttle jobs, how we design jobs now. All of these amazing effects. So we sat down and just entertained for ten minutes, what would the present look like if we hadn't done that? If we'd just taken the short road out? You reward people based on that - what would the present be if we'd made a different decision.
Of course it works in the other way as well - a bad decision, what if we'd done this instead, where would we be today? Rewarding people for being proactive is so important. Black Friday and Cyber Monday is coming up for us (this was recorded right before BF/CM), and I'm really proud of the fact that my team is not scrambling to get things done. We're in a state now where it's Black Friday, Cyber Monday and we feel we have a pretty good handle on things. We're not stressed out of our minds. I think that's something to celebrate. It means that we've hopefully made some really good decisions about the responsibilities that we're responsible for on my team. Whereas last year, we were scrambling, we were stressed. That was an opportunity to learn - how do we not get in that situation again? What poor decisions did we make this year to end up in that situation?
Christophe: Simon, thank you so much for sharing your wisdom and for your time. I really enjoyed this episode, as I did with the previous episode as well. It's always a pleasure to have you on the show. Anthony, thank you for joining us today as well. If anyone has any follow up questions or they want to check out some of your content, where and how can they reach out to you?
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)