Inside engineering: Behind CodePen's security and scale

Written by Christophe Limpalair on 11/18/2015

This is extracted from episode 9 of the ScaleYourCode show with Tim Sabat. (Info and stats are from late June 2015)

CodePen is a playground for the front end web. It is no exaggeration to say that CodePen has changed the way front end developers build and share their work. From prototypes for professional designs, to fun and engaging pieces of work, front end developers can quickly and easily write and share their code with Pens.

Behind the scenes, these Pens need to be processed because they give access to powerful preprocessors like HAML. In order to process Pens, code must be executed on servers. I'll refer to the environments processing code as sandboxes.

While sandboxes are powerful for users, they have created a number of challenges for Tim Sabat, who is in charge of their servers. I wanted to learn about these challenges, so I sat down with Tim and got to work. This post will cover security and technical issues that Tim and the rest of his team had to wrestle with. It will also cover the stack powering these Pens, and a few other things you might find interesting. I certainly did!

What scale are they running at?

CodePen serves about 32 million page views a month (stats from late June 2015) with 3.4 million monthly users. This boils down to roughly 12,000 requests per minute, or 200 a second.



As you will see, CodePen has a set of unique challenges that make things very interesting. Before we dive into that, let’s take a look at their architecture.

Languages:
  • JavaScript and Ruby

Frameworks:
  • Rails, Sinatra (Padrino), Express, Node

Webservers:
  • 4 x m3.large: Web Apps
  • 3 x m3.large: Preprocessor servers, with Docker containers
  • 3 x t2.small: Specialized preprocessor servers

Data Store:
  • 3 x m3.xlarge SQL boxes: Master/Slave/Export
  • 2 x r3.large Solr: Master/Slave
  • 2 x m3.large: Redis Master/Slave
  • Redis stats - 1,000 requests/second with 11 billion calls in the last 150 days. (late June 2015)

Load Balancers:
  • 2 x Nginx

Misc:
  • 1 x Deployment box
  • 1 x Sidekiq
  • 1 x t2.small Gitlab
  • 1 x m1.micro: Hit counter
  • 1 x monitoring box (icinga)
  • 1 x NAT

All of this runs in Amazon's VPC.

What challenges do they have?

A regular web application request to Ruby is relatively quick. If you need to scale Ruby, you can usually just throw more hardware at it.

The slow part comes from sandboxes that run preprocessors. Preprocessors have a lot of overhead:
  • Setting up the preprocessor
  • Giving the user a response

For example:

There's a difference between a regular call to a server and a call to a preprocessor like HAML. Whereas the server should respond within a few milliseconds, the preprocessor alone could take 100ms. Add that to the overall response and you're keeping a user waiting. No bueno.

But why does it take so long?

Preprocessors have a startup penalty. They weren't meant to run on a server--they usually run on a local desktop.

How do you fix this? Or at the bare minimum, how do you make sure fast calls aren't pipelined behind slow ones?

The answer is to move things to services, route properly, and have dedicated servers that run preprocessors.

What about security issues? What happened?

You probably saw this one coming. Running random people's code on servers is not usually something you want. Except in this case, that's exactly what they want.

So what security issues has CodePen run into?

Very early on, someone emailed the team a Pen. The pen was running HAML and spitting out their htpasswd file.

I'm sure that was an "OH shit!" moment.

Thankfully, this vulnerability and a few other remote execution bugs were found and reported by white hatters.

White hat

How do you fix these issues?

It really all comes down to permissions. Don't run preprocessors in a context that can read important files like your htpasswd file, for example. Or really any file that they don't need to have access to.

Ruby security issues

Backticks (`) in Ruby allow you to drop in a subshell and run commands on the server. This is fixed by turning down file descriptors to 0, which makes it so that you cannot write() or backtick to the shell.

Wait, how does that work?

Every running program in Linux starts its own file descriptor. By limiting the number of file descriptors, you ensure that nothing else can be done outside of the Ruby process. The OS can’t create anymore file descriptors.

How about Docker? Are there security issues there?

If you've read any literature on Docker, you've probably seen debates on security concerns. Naturally, I had to ask Tim about this.

Docker

First, they use one container per service to get process isolation. A single instance of HAML runs in a single container, so breaking it doesn't cause anything else to break.

Second, the application running inside of the preprocessors is really small (~40 lines of code in Sinatra). All it does is call the service, like HAML for example. Everything else happens before the call is done in a separate router service--any security action like stripping calls all happens separately.

In addition to this, Docker lets you mount filesystems as read-only. This results in the OS rejecting any attempts to write.

Finally, they can use fork(). fork() creates a copy of a process so that you can't modify the original one. In CodePen's case, right before the call, they tell Ruby to fork out, isolating calls from the web server context and preventing anything dangerous from happening. You are working in your own memory space.

Bonus security tip: SASS and SCSS allow data-url to read off the disk, so keep that in mind.

Bonus: Vagrant vs. Docker

Ahh, yes. When does it make sense to use one over the other?

Vagrant Logo

Docker is still a very misunderstood technology. Since I knew Tim came from a background of using Vagrant, I was really curious to learn why he chose Docker for CodePen. Then, I wanted to take it a step further and ask how the two differ from one another. This way, you can walk away armed with knowledge that can help you decided which is best for you.

There are more advantages to Docker that I'm not covering here, but the two main ones that influenced Tim's decision are:

  1. Speed
  2. Convenience

Speed, speed, speed.

How long does it take to boot up your Vagrant environment? How about provisioning? Yeah, it takes a little while.

Docker boots containers in seconds. This makes it really easy to spin things up or tear them down.

Say you want to play around with a service's settings, or even Linux configurations. Even if you mess something up, all you have to do is tear down the container and boot another one up in no time. Pretty nice.

Convenience

This one also has a lot to do with speed as well, but it's more than just that. Imagine you want to make a change in your infrastructure like a configuration change in Nginx, for example. Instead of changing it in Ansible, Chef, or whatever, you can just change the container image, destroy the old ones, and deploy all of the new ones instead. Again, this all happens very quickly. You don't have to wait for a recipe to be applied.

While it does have pros over Vagrant, it's not the easiest technology to learn.

There can be a lot of things you have to understand up-front before you can use it: linked containers, volumes, etc... This is one of Vagrant's strong points--you can usually just boot it up and get started.

Tim says "There's a high startup cost for learning Docker, but once you learn it and you kind of understand the nuance, I think it's pretty powerful."

What now?

My hope with this article, and really every piece of content on this site, is to leave you with actionable items. Knowledge is great, but it is easily forgotten if not applied.

How can you apply this article to improve your application(s)?

Here are a few lessons that stand out to me:

1) Take a good look at your system as a whole, then break it down into smaller parts. Do you have any fast calls pipelined behind slow ones? If so, is there any way you can change that?

How do you even know? Monitoring is key. If you aren't able to tell, the first step you should take is to improve your monitoring. Seriously, it's super important.

Great, so now you have one or two actionable items. What else can you do?

2) While your application may not require running other people's code on your servers, security is obviously still an important concern. How did CodePen solve their security issues? It often came down to permission restrictions.

Processes and services should only have access to what they absolutely need in order to function properly. Anything more than that is asking for trouble.

Run through your infrastructure and take a good look at permissions. This can be quite dry and boring, but putting in the effort now can save you hours of stress, public embarrassment, or even your business.

3) The last actionable step I'll leave you with is to research container technology. Yes, it can be intimidating. Yes, it may take you some time to figure it out. No, you shouldn't just turn a blind eye to it.

I'm not talking about replacing your entire infrastructure with Docker, but it may benefit you to check it out. Even if you don't end up using it, you will get a better understanding of how containers work and what makes them important.

If this article has helped you in any way, I'd love to hear about it in the comments :)