Scaling Basecamp and making it insanely fast with CTO and Rails Founder, David H. Hansson
Interviewed by Christophe Limpalair on 10/19/2015
David has quite a few interviews online, so I did a lot of research to ask him questions you won't easily find anywhere else. He explains Basecamp's stack (which handles 2,000 reqs/second) and what makes it so fast. We also talk about the big DDoS attack that took them down, Rails 5 speed features, rendering HTML on the server side, and why 80% of applicants get automatically rejected.
Using Rollbar could be the quickest and most useful improvement you make to your app today. It easily integrates with Rails, Heroku, and any other language/platform you can think of. Viewers get 3 months free ($87 value).
Links and Resources
What scale is Basecamp running at? Give us some stats
- 15 million users (keep in mind they are private and they choose which numbers to release and how)
- 2,000 requests per second at peak
- 30 app servers (For the current Basecamp version, not Basecamp 3)
- They store 700TB of user uploads on a distributed cluster of Cleversafe installations
- They have ~1.5TB in a single MySQL instance
- 375GB in a single MySQL table
- 48GB of RAM across 6 instances (288GB) for caching
They are currently at about 2,000 requests per second against the Rails app during peak, with 30 app servers (for the current Basecamp version). They store about 700TB of user uploads on a distributed cluster of Cleversafe installations. On top of that, they have about 1.5TB in a single MySQL instance and 375GB+ in a single MySQL table. How much RAM? 48GB across 6 instances (288GB) for caching.
Let's just take a moment and think about where Rails started in terms of scaling opposition. "Rails doesn't scale!" I think 2,000 requests a second is no joke. Not that impressed? Shopify is running a monolithic Rails app at 17,000 requests per second.
David adds that the algorithms and architecture you are using have much more of an effect on whether a system will scale or not. You can have the fastest programming language in the world running on the fastest server, but if your system doesn't have the right architecture, it won't scale.
He also mentions that their scale has come from an unsophisticated place. They still run a single write database for all of their applications. They have not gone to multiple write databases. They do have read replicas.
So all writes go to a primary, then reads come from replicas.
They're experimenting with primary-primary setups but those are not for performance, they are for resilience and cross data center setups.
You're running a monolithic app. How big of an app are we talking about?
The current version of Basecamp that is running in production, the application is about 20,000 lines of code. David is a big believer in monolithic software, and designing things as integrated systems.
They're currently working on a new version of Basecamp, called version 3. That's currently clocking in at around 18,000 lines of code.
What does your architecture look like?
They have BIG-IP load balancers in front, which David says much of the industry is using at this scale. Then they have Nginx webservers. For the app servers, they predominantly run unicorn which are single-threaded app servers.
For Basecamp 3, they are working with Action Cable which is a framework for dealing with websockets. They've been experimenting with both Puma and Thin, and either will work.
Their database runs on MySQL for the main records. They run Redis for a variety of optimizations and things that don't necessarily need to survive a crash. They also run memcached for caching.
Why do you run on your own hardware?
Mainly because of cost.
Over the years, they have run a lot of various tests to determine how much performance they would get from the cloud versus from their own hardware. As it stands, the benefits do not outweigh the costs.
For example, they compared the price of storing user uploads on AWS versus on Cleversafe and they could do it cheaper by a factor of 4. "Given the fact that we are not one of those VC backed companies that are all about hypergrowth, profit is actually something that matters." David adds.
Even though the cloud makes certain things a lot easier (like spinning up machines -- one click instead of 2 weeks for ordering and provisioning a new server). But the fact of the matter is that Basecamp has a very predictable growth pattern. Except for the very rare times when they launch a brand new version, they usually know what their load is going to look like 18 months in advance.
I'm curious to see when that changes, seeing that cloud prices are going down.
David says he's convinced that it won't make sense to run their own hardware in the very long term, but that could be a very long time. He actually thought they would be there sooner. He didn't think that running their own servers would make economical sense at their scale and with their own team for so long.
"It is an ever shrinking market of making sense."
David encourages medium-scale companies to take advantage of what cloud service have to offer. Then there might be a "wild growth stage" where it's hard to predict what traffic is going to look like, the cloud still makes a lot of sense. Eventually if you have predictability, you can rent some space somewhere and have full control over your entire stack.
When dealing with so much data in MySQL, what are some ways of making sure it doesn't slow down?
There really isn't a bag of tricks to figuring this out, David says. A lot of it is understanding the fundamentals like indexes, for example.
One of the deceiving things, especially when you run MySQL on SSDs and very fast hardware is that things can look fine one minute and then not so fine the next. Those tipping points can come when indexes don't fit in whatever fast retrieval method you were fitting in before, and you have a bottleneck.
Or, you start with something that works well when customers are querying datasets of hundreds, and all of a sudden they are querying datasets of thousands.
A lot of this scaling is about observing changes in performance, paying attention to datasets customers are using, and changing things once they hit certain tipping points.
David says that often times your changes will have to do with application changes. For example, they used to have a page on Basecamp that loaded all users on a Basecamp account. What about when someone had 5,000 users? That doesn't work well.
They've also had a few instances in the past where they had to rely on Percona and others to help them but they have been using MySQL for over a decade and they've been growing with it slowly and predictably. Most of these tricks have been things they have picked up along the way.
1) Pay attention to your query times
2) Make sure you can trace those requests
Noah, Basecamp's data expert has built a very nice system for them to query all of their logs and trace them back to requests, and tracing that back to all of the infrastructure pieces they touched along the way.
(I'll take one of those!)
Because knowing that things are slowing down is interesting, but what's more actionable is understanding why it's slow, what's causing it, and then figuring out how to fix those database calls by adding the right indexes or cutting back on query load.
Are your monitoring tools internal?
Yes, 100% internal.
They used New Relic back in the day.
Tools like New Relic, Skylight, Stackify, and other tools of that nature fit in the same discussion of whether you should use the cloud or not. When starting out, you should definitely use them but it does make sense once you get to a certain scale.
Once you are building your own systems and you have teams to operate those systems and tailor to them, it can definitely make sense to build your own tools tailored around your exact needs.
Can you explain your failover process?
I mentioned using Nginx to queue requests during failovers so that the user's browser just spins a little bit longer than usual, and then lets the request go through instead of showing an error page. This is something Simon Eskildsen talked about in his interview.
David says that they have mainly used this technique to failover the database. In the old days, when they didn't have this sort of setup, they would have to take Basecamp down with a notice.
Pausing requests to hot swap the database was a major improvement. So how do they do it?
They have a primary database which is replicated to a replica. They make the change on the replica, and they make sure it is caught up with all the data changes from the primary, and then they swap the replica and primary (here's a script they wrote 3 years ago to do that for MySQL).
What do you do while that swapping is going on? The tool they are using is called Intermission. It allows them to pause requests for up to 10 seconds while the swap is going on. Then, it lets the requests go through.
Sort of like a traffic cop, as David illustrates.
For failing over entire data centers, that's far more complicated. "It is very, very tough."
They've spent about 3 years to get to the point where they could completely do it. Maybe not full speed ahead for 3 years, but still.
It's certainly not something he would want to undertake in the early stages of a company's life. It sucks a lot of energy out of other things, and introduces a considerable amount of operational complexity. But David says he looks at it as an insurance plan once you get to that scale. The odds of their data center being wiped out entirely are very low, but if it happens it is absolutely catastrophic.
If, for whatever reason, Basecamp went down for a few days, the results could be devastating for the company. You might think that this sounds unrealistic in this day and age, but they've had enough close calls, David says. Back in 2008, a truck drove into a power station or something near the Rackspace data center they were in at the time. The backup generators failed, but Basecamp ended up going back up after 48hours or so. "This was traumatic enough in itself." Some of the other people were not so lucky. "Are we willing to risk the business on that even though it's a remote chance. Our conclusion was no we're not willing to risk the business on that." So, they got a backup data center. That has been a massive undertaking. Operating multiple data centers and building systems to fail over between them has probably sucked up more energy than any other project they have ever done on the operational side of the business.
David adds that this is a case where the cloud helps, because it is easier to distribute things. AWS can do some very clever things with failure migration.
Even though it's funny we say that because a few weeks ago AWS had failures in the East Coast. Nothing is 100%.
At this point, Basecamp is at 99.997% uptime which is something they are very proud of, but it doesn't come without blood, sweat, and tears. That level of reliability is absolutely not necessary in the beginning. When they first started with Basecamp, they had 99% uptime. While this sounds high, if you calculate that over the course of a year it is not that high.
99% uptime means 361.35 days out of 365. Almost 4 days down.
99.997% uptime means 364.99 days out of 365.
Big difference, but not that important when you are just starting out (unless your service is something critical, of course).
Instead, David says, spend more time figuring out your business model and your market. Over time, you can become more sophisticated.
One thing that does matter at every stage of the business is having a fast app. What optimizations have you been able to do for Basecamp 3?
Basecamp 3 is, in many ways, a continuation of the same techniques used in Basecamp 2.
One of these important optimizations is called Russian Doll caching. As you can see, it had a pretty big performance impact on Basecamp.
What is it, and how does it work?
Each segment of the page is cached individually, and then you build larger and larger caches from there.
For example, say you have 99 comments and one of those comments gets edited. Instead of re-fetching all 99 of those comments, you only have to invalidate one. That's huge, especially when you do it to entire pages.
This does, however, pose some important restrictions on the way you design templates and the UI.
"One of the traps a lot of web developers fall into is that they tend to care about the things that are easiest to measure."
The things that are easiest to measure are things going on in the backend. How fast the server responds with a reply, for example. David says that as long as your server is responding in 200-300ms, then how much more can you really squeeze out? If it takes longer than a second, what is causing the 800ms? If your server takes longer than 300ms to respond, then yeah, definitely look into what's causing it (Maybe you need more caching?).
Otherwise you need to look at all the other parts that are going into serving a page:
You might be loading tons of assets, or images are being served in the wrong size and without the proper compression.
Then look at how your CSS is structured:
Look at the overall time it takes for your page to load, not necessarily just individual pieces. This is one of the areas that they really focused on with Basecamp 3, especially since they are being more ambitious on the mobile side of things. They are using webviews which is loading parts of the real application inside of a native mobile application. That has to feel really fast, so a lot of effort went into that. Here's what they did:
It made a very measurable difference in the perceived performance of pages. A lot of that cutting down came from getting rid of unnecessary jQuery code from plugins.
David does say that jQuery plugins get a lot of hate, but he doesn't think it's appropriate. The thing with plugins is that they try to solve a lot of problems. When you take something like jQuery UI, which is a beast, and you just use it to do some drag & drop actions and a few other things, it's like taking a huge machine and using it for a tiny job. That's probably not going to be the fastest thing. They solved this problem by plucking out unused code.
When does it make sense to do this kind of optimization?
Very late in the game. As late as you possibly can.
"I love plugins [...] they quickly let you test out an idea."
This is a problem a lot of developers run into...they start working on a feature and optimize the crap out of it, only to find out that their team wants to scrap it. "What?!? I just spent 3 days on it!"
That's a quick way of falling in love with code. By spending so much time optimizing it, you become defensive of it and your application suffers because of it. So the later you can optimize with this sort of thing, the better.
David does clarify that there are certain times when this isn't the case. You could fall in a deep hole that makes it difficult to get out of. His advice? Keep an eye on this kind of optimization, but don't focus too much on it. This is mostly something you learn with experience and time after getting burned a few times.
Another way you can work around this issue is by setting a path you could follow to optimize whatever it is you are coding. Structure it in such a way that allows you to optimize down the road without breaking everything. Even if the app is a little slow to begin with, as long as you have a path to making it fast and you keep it in mind, you'll be OK.
What's something else that's made Basecamp 3 so fast?
Something that's made Basecamp 3 seem so fast is the use of websockets. The previous versions used a variety of polling mechanisms and AJAX updates, etc.. that weren't using websockets.
Websockets are really fast.
The reason that it is so fast is, in part, because it is a persistent connection. A lot of the overhead is not necessarily from the work that it has to do, but instead from the fact that it has to create a new SSL connection every time. Because sometimes you get to reuse that connection and sometimes you don't. "It's a very opaque machine to look into."
They have certain operations where establishing SSL connections is where the majority of time is spent, especially when they are further away from the data center. You can spend 500ms getting the handshake done, and then 100ms getting the update that you need. It's not proportionate, and it's not cool.
With websockets, you make that connection once. You pay for the handshake one time, and then all the data going back and forth between the wire is already encrypted. It makes a big difference.
Rails 5 is going to ship with an entire framework dedicated to websockets called Action Cable.
They're using Action Cable in a bunch of the UI elements of how to load things dynamically and how to get updates. That has been a really nice benefit.
There was CGI and then there was Fast-CGI. Fast-CGI and any other persistent process basically compiles the system once, and then it keeps that instance around to respond to multiple requests. TurboLinks does the same thing on the client side.
What are your thoughts on the shift to thicker JS clients on the frontend served up by REST APIs, and how does Rails plan to evolve into that model, vs. rendering view logic predominantly on the server side. Asked by: djslaker
"People think that this is a clash of civilizations. That there can only be one idea that prevails. And that's just bullshit."
David says that generating HTML and any kind of view response on the server side is not only completely legitimate, it is his preferred way of creating web applications -- including brand new ones that are being created today.
It's a spectrum, and David says he likes to use the part of that spectrum that gives him the most torque and the most productivity. What he found was that doing everything as a client side JS app, when you talk about applications that are like Basecamp, Shopify or like GitHub, those apps don't win by moving all the logic to the client side. "They become, in my opinion, more fragile and worse off."
On the flip side, it doesn't mean there aren't good use cases for it. But David goes on to say that the types of applications that he sees & uses on the web (and that most people are interested in building) are better off creating HTML on the server side and sprinkling dynamic elements into it with JS.
People who really believe in client side JS, David says, see JS sprinkles as a negative, but he sees it as something wonderful.
So what they'll do at Basecamp is sprinkle JS over server side generated HTML and it works great. They do SJR too which caused a bit of debate on Hacker News.
To illustrate with an example, say you are making an ajax call from the client side. Like 'delete this comment'. You click the delete link, what should the server side do?
Well, the server side could just respond with an HTTP status code, and then on the client side you can have logic to make it disappear.
David says the latter becomes preferable when you're not just responding to remove a DOM ID from the tree. It's super powerful when it comes to add a bit of HTML to the DOM using the same template that was used to generate the page in the first place.
For example: you are rendering a page that has a bunch of comments and you are rendering those comments on the server side. Now if you want to add a new comment to the page -- you could do that client side which would mean re-implementing the template that displays that comment again (if you were doing it where you still had HTML on the server side) and now you would have two templates -- or you would set it up as a fully client side application where what you are getting from the server is just a JSON dump and you have to generate everything on the client side. "I've not found that to be a good time either."
Rails 5 is building what's called the dash-dash (--) API mode, which gives you the option to choose. If you don't want to generate HTML on the server side, you don't have to.
Rails is a "big tent". It's a very large framework that's trying to solve a lot of things. We don't have to have everyone use all elements of the app. So Rails 5 is building in a specific mode which can be used when you generate a new application by typing '--API'. This is going to give you a structure that's targeted at generating just JSON on the server side. No view logic at all. No Asset Pipeline.
You're building a distributed system from the get-go, where the server is simply responsible for checking integrity, access control, and all that other logic you have to deal with in the model (which in a lot of apps is a big part of the logic).
All this just means you won't use a small part of the Rails framework. People do this all the time already. Think back to the jQuery plugins we talked about earlier...
David wraps up by saying: "We don't all have to agree on all things, all the time, to make progress together."
You were under DDoS attack in March 2014, what did you learn from the attack that could help others?
It sucks. Bad.
A DDoS of a sizable magnitude is pretty hard to completely shield yourself against. There is hardware you can buy that tries to mitigate these attacks, but the only real mitigation that works on a general level is having more bandwidth than your attackers do.
If you have 40mbs of bandwidth and you are getting attacked by 200mbs, it's going to be pretty tough. What makes it a little bit easier is if you have good relationships with your upstream providers. David has heard horror stories from people who couldn't get any help from upstream providers. They're just stuck and they are being bombarded without the ability to defend themselves.
After Basecamp was attacked, they started a DDoS survivors group with the Ops teams of other companies that have been attacked and they heard heart breaking stories of people (especially at smaller companies) who didn't have any power to get help at all. They were just down for a week. Basecamp was down for just under 2 hours.
Was this to try and extort money from you?
They got a letter telling them to wire a certain amount of Bitcoins to a certain account.
But, a lot of times, these attacks are against sites who are hosting something on behalf of someone else, and those people clearly don't want that to be there. If you're a GitHub or a publishing platform and somebody uses your systems to put up some code or writings that someone else doesn't like, they can attack you.
That's why David believes Basecamp has been shielded from this for such a long time. They don't have any public data.
What do you look for when hiring?
"I look for great software writers. [...] How do you actually figure that out?"
There are a bunch of indicators that you could use.
You could look at whether they attended an Ivy League school, or whether they worked at a prestigious firm, but David hasn't found these indicators to be helpful at all. "Not only do they produce duds, they also produce a ton of false negatives."
Many of the best programmers in the world have learned programming themselves and not through any accredited school. They just figured it out. How do you get those people?
For Basecamp, it's been a lot about looking at the code. David adds that a lot of programmers self-select themselves out of the pool.
For example they might get anywhere between 100 and 150 applications. "I'd say that at least 80% of those are rejected outright because people just spam them with a resume that lists places they've worked and the year they graduated."
"How is this actionable information? What am I going to use that for?" "I think resumes in the general sense are pretty worthless when they come to assessing the capabilities of a programmer."
On top of this, people also usually write cover letters that are pretty generic. They don't show an interest in the company itself.
So 80% of people fail these first two tests. That leaves 20%.
For the 20%, he looks at their code. He's found that he can quickly make an opinion about skill after seeing a substantial amount of work from someone. Things like how much the person cares about the presentation of their code...not so much about spelling mistakes or omitting commas... "I make spelling mistakes all day long."
However, there is a tipping point where it establishes a pattern, David goes on to say. "This person is just not diligent enough with the quality of their code, that this is not something that fits what we're looking for." "A lot of people fail that test."
"A lot of the code is poorly indented, poorly named, poorly scoped."
David goes on to say that they see files submitted to them with lines of code commented out! "They submit a piece of code without cleaning it up. It's kind of like inviting your prospective employer over to your house and you didn't fucking even clean up. You had some people over last night and there's all sorts of crap all over the floor."
Doesn't mean you're a bad person, but c'mon!
Beyond that, there can be clean code, but it's just not that great. That's the unfortunate thing where a lot of people wonder, "what do you mean by bad code?" "Can you tell me exactly what you mean?"
It's kind of like sending someone a short story and you don't really like the story, so the author asks you to point to the line or paragraph you don't like. It doesn't usually work like that. You can't really teach someone how to write an interesting story over an email. You can't teach someone how to be a good programmer in a reply to a job application.
It sucks to just reply back to a job application with "your code just isn't that good" because it's not actionable.
So what kinds of things has David run into that look like 'bad code'?
It's a lot of basic level stuff, like:
1) Having methods that are 15 lines long and they do 5 different things.
2) Tons of global variables.
To avoid falling into these traps, David recommends a book called Smalltak Best Practice Patterns. It's kind of like a book that teaches you to write well. It goes into all the elements of proper naming and proper composition, for example.
Finally, if you pass all of these tests, they make you write code for them. The only way to get around that is if you already have a large body of work in the open source community. They've hired a number of people straight from the Rails core group of contributors.
With that being said, you don't have to contribute to any open source code to qualify at Basecamp.
Anyway, they'll pay you to work on a side project so they can judge your skillset. Once you go through that process with a number of candidates, it's very clear who you need to hire. You get to see how they work, how they think, and how they solve problems.
So you do this instead of whiteboarding
"Yeah that is the worst thing ever."
"Artificial coding sessions at the whiteboard are the devil, and companies that follow them deserve what they get."
"If you pulled me out and made me write Bubble sort or something on the whiteboard, I'd probably fail that miserably."
"I would fail the majority of these whiteboard tests."
How can people get in touch?
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)