How CodePen handles 32M page views a month
Interviewed by Christophe Limpalair on 06/26/2015
Tim came prepared and he didn't hold anything back. See every layer of CodePen's infrastructure from security to AWS and Docker. He teaches us how they're able to serve 3.4M uniques/month and 12,000 requests per minute.
How did you get started?
Tim has been involved with computers since an early age. His dad went to MIT and he always had computers around the house.
Tim worked at Wufoo for 3 years before they were sold to SurveyMonkey. Wufoo didn't use a framework at the time so it was all "raw" PHP and MySQL.
Once acquired, he worked at SurveyMonkey for about 2 years and Tim fell in love with Ops during that itme. Then, the three cofounders (Alex, Chris, and Tim) decided to pursue CodePen.
Did you start CodePen on AWS?
They were on Rackspace for a short amount of time and then switched to AWS once in production.
Did you build the infrastructure with scale in mind?
It was just a weekend project when they first started, so no.
They built everything on the same machine and had a live version of it within a month. They certainly didn't anticipate it being a popular project.
Turns out that they had 1,000 signups within the first week. They started running into scaling problems within a few weeks after launching it.
Did you already have a network or did people stumble upon Codepen?
Chris Coyier represented 70% of the inbound traffic at the beginning. He already had a big following at the time, and a Tweet was all it took to get started.
What kinds of technical issues do your Pens create?
Initial performance issues came from the volume of requests.
The regular web application request coming in to a Ruby server is relatively quick. The issue comes from their sandboxes that run preprocessors.
Preprocessors tend to run slower than a normal web request would because there's a lot of overhead in setting up a preprocessor call, and there's a lot of time spent getting a user's call returned to them.
This is an interesting challenge because there is a regular call to the servers and then a call to something like HAML for example. That alone can take 100ms because there's a startup penalty in running a preprocessor that was never meant to be run on a server-- it's usually run locally on a developer's desktop.
The main challenge is: How do you make preprocessors fast and how do you separate them from the rest of your application so you're not pipelining fast calls behind slow ones.
How did you speed up preprocessors?
Moving things to services, routing things properly, and having dedicated servers for preprocessors.
What kinds of security issues did you run into?
When they started out with a single box, about 3 weeks later, someone told them to checkout a Pen that was running HAML and spitting out their Htpasswd file.
All of their other security issues have been remote execution bugs that have been found by white hatters with good intentions poking around the application. Thank you, white hatters!
How did you fix these security issues?
It really all came down to permissions. Don't run preprocessors in a context that can read important files like your htpasswd file, or your database.yml.
Backticks in Ruby will allow you to drop into a subshell and run commands on the server.
SASS and SCSS allow data-url so you can read off the disk.
You really have to work to prevent this sort of thing.
At the time they didn't store anything privately so there were no real issues, but it definitely was a wake up call.
What's really important to remember in this section is that scaling isn't just about the requests/second. You will have users trying things you never thought of. Keep that in mind.
What are some of the Linux and Docker security features you use?
Codepen uses one container per service because it gives process isolation. The single instance of Haml runs in a single container. So if you break HAML you're not also breaking Slim or Sass or Scss.
The beauty of Docker is that you have a very small footprint. The application itself inside Docker is minimal (~40 lines of code in Sinatra). All it really does is call Haml.
Everything that needs to happen before that call is done in a separate service they call "router". Anything they do security wise like stripping calls that shouldn't be made, all happens separately.
With Docker, you can also mount the filesystem as read-only. The operating system would reject any attempts to write.
They also fork() each call. Forking creates a copy of a process so that you don't modify the original one. In Codepen's case, right before the call is made, they ask Ruby to fork out which isolates calls from the webserver context and prevents anything dangerous from happening-- you could change the env variable for example and it's not going to hurt anything because you've got your own memory space that you're working with. These calls look the exact same to the software but from a protection standpoint, you can't actually edit variables on the server itself.
Another recent discovery is that they can turn the file descriptors down to 0 which means that you cannot file.write() or backtick to the shell. This works because everything you do in Linux starts its own file descriptor, so by limiting them, Codepen ensures that you can't do anything outside of the Ruby process that you're in.
Docker vs Vagrant
I personally use Vagrant almost daily. Tim used to as well, but he's recently switched to Docker. I asked him why, and now I want to play more with Docker...
Conceptually, they both start a server. A VM with Vagrant and a Linux Container in Docker.
Everytime you start Vagrant, you're starting up a new VM. When you start a Docker container, to the operating system it looks the same-- there's the normal set of files that an operating system expects to be there, except those files are actually recreated in a little file system that is separated from the host.
As you can imagine, Docker containers can boot up considerably faster.
Another cool thing with Docker is that you can mess around with Linux to see how things affect different services, delete that container, and start over from scratch within seconds. While you can achieve this with Vagrant too, it takes much more time.
Additionally, with Vagrant you can pull down an image once you start a box, but ultimately you can't distribute that image to your applications. With Docker, you can.
That container or image that you built can be used on all your different servers. Make a change? Destroy them and boot them back up in seconds. With Vagrant, you can build scripts with Chef or Ansible and distribute those, but then you have to run these scripts and it takes considerably more time. Just something to keep in mind.
Do you simply destroy containers when you need to make a change?
Yes. Because of the Union filesystem (UnionFS), you cache these different layers so if you're changing one thing all the steps before that change have already been provisioned so they don't have to re-run.
That's pretty powerful, and allows you to develop with speed.
With Vagrant, even if you make a tiny change, you have to reprovision the machine. With Docker you change it, destroy the containers, and restart them all within seconds.
Did you have any issues switching to Docker?
Mostly conceptual issues. Like a lot of things in the IT world, there's a lot you need to understand before you can even start the basic steps.
With Docker, there can be a lot of things you have to understand up-front before you can use it: linked containers, volumes, etc... This is one of Vagrant's strong points-- you just boot it up and get started.
Tim says "There's a high startup cost for learning Docker, but once you learn it and you kind of understand the nuance, I think it's pretty powerful."
Is Networking and Deployment challenging?
You can have a container locally but you still need to set it up and run it on your servers.
There are things coming along to ease this like Docker Swarm which help with orchestration and setting up containers.
Networking is more of a challenge if you have multiple containers because you're dealing with networking at a "box" level and VPC level. There's port management you have to deal with, and there's load balancing as you start new servers.
As an example, Tim talked about a script they use for deployment where you start a container that runs side-by-side with your other one, you run a system check to make sure that container is up and running, and then you take down the currently running one.
Does Docker have good docs and a large community? Compared to Vagrant?
Vagrant has a large community of users and if you have any issues you can usually find answers online. Is it the same with Docker?
Tim says that Docker seems really bleeding edge to him and sometimes solutions that were recently posted are already outdated.
At the same time, there are so many people who really love the concept of Docker and put a lot of effort into helping others that Tim hasn't had a harder time learning Docker. He does clarify that he has a lot more knowledge now than when he first started with Vagrant.
Why the switch from Chef to Ansible?
Chef has a lot of moving parts and can be difficult to explain. On the other hand, there is more immediate feedback with Ansible that makes it easier for someone learning to get started.
If you're interested in learning more about Chef, check out this interview.
What do you use Ansible for?
Gluing stuff together. They use Ansible for deployments: to push up application code and also for server management.
When a new server comes online, the first thing it does is run Ansible-pull. This gets the server into whatever state you want it to be in. Once it's in that state, you pull down the bundle of code that you have to put it in place. Bam, you're server is ready to go.
Do you use Capistrano?
Yes, but not as much anymore. They use it for Sidekiq. This is what handles their queuing system.
What do you queue?
Queues are used everytime someone saves a Pen to create a snapshot image of the final result of that Pen. They do this in 3 minute intervals. Even if you save 20 times during those 3 minutes, only the final result will be saved.
In order to accomplish this, they push Pens that have changed to a Redis Set.
They use it for scheduling, jobs that run in the background like cron jobs. Every 4 hours they will queue up Pens that appear to be spam.
How many webservers, load balancers, etc.. do you use?
Rails, Sinatra (Padrino), Express, Node
4 x m3.large: Web Apps
3 x m3.large: Preprocessor servers, containing Docker containers
3 x t2.small: Specialized preprocessor servers
3 x m3.xlarge SQL boxes: Master/Slave/Export
2 x r3.large Solr: Master/Slave
2 x m3.large: Redis Master/Slave
1 x Deployment box
1 x Sidekiq
1 x t2.small Gitlab
1 x VPN
1 x m1.micro: Hit counter
1 x monitoring box (icinga)
1 x NAT
2 x Nginx
Why m3.large instead of Elasticache for Redis?
They use 4 different DBs in Redis for different levels of cache. They have a volatile cache, permanent cache (for counts), queueing, and another one (he couldn't remember).
Using Master/Slave would double the amount of needed DBs. Since Elasticache doesn't allow for more than one DB, this would cost a lot more.
What are some other painful parts of your infrastructure?
They've had a problem with every part of the infrastructure, but at different stages of Codepen's life.
Once Codepen started making money, they began to separate the different services. This allowed them to better understand memory and CPU profiles.
What is a painpoint you have right now?
They just recently had an issue with Solr scaling and running out of memory.
Solr was too tightly coupled to their application so if the query time ran slow, so did the application. This is very bad and dangerous: if Solr goes down, so does your app.
They solved this with Redis. Redis checks for Solr's status every single time someone searches for something, before the HTTP call.
If Redis reports that it's down, the app redirects to a page that says the search will be back up in a moment. The Redis key expires after 30 seconds and then checks again.
How does Redis check if Solr is down?
They write a Resque call around the HTTP call that checks for Solr's status and sets a key in Redis.
Redis stats on Codepen's servers
11 billion calls have been made in the last 150 days.
How do you handle failures in your infrastructure?
Codepen uses Amazon Web Services Auto Scaling, but they use it for disaster recovery. Each web server, is in a way, self-healing.
If a server goes down, they can kill it off and spin up another one.
How do you change servers without downtime?
They use Phusion Passenger for code deployment.
For web servers, they either do it at the load balancer level or at the DNS level. In some cases they also use a VIP (Virtual IP). This allows you to assign a secondary IP to any box you want, and then inside the box you configure the network to also accept requests from the secondary IP. This is powerful because you can float that secondary IP address between boxes and the changes are instantaneous, unlike with DNS.
You had a DNS outage back in March. What happened?
DNS just went down on Amazon's end. It took CodePen down too.
So I asked him how they could avoid this situation in the future? This is a tricky one, but Tim answered that they could use Multi-AZ if DNS isn't failing in the other zones.
What configs have you tweaked on Redis?
On the Master, they don't write backups to disk. This is because they had keys changing so often that background saves were running almost constantly.
On the Slave, they have a high IOPS disk that runs and the machine itself is EBS-optimized so they can quickly write the RDB files.
How do you handle Redis backups?
They push up the RDB files to S3
Which data-types do you use for which use case?
Codepen uses Sets to store Pen screenshots and they make sure that Set is unique so they don't screenshot the same Pen over and over again.
They use Sorted Sets for popularity scores.
For more use case examples, see this screencast and this article.
They've talked about using Redis as an LRU (least recently used) to avoid having to set manual expiration times.
If you're interested in more Redis content, I have an entire series explaining various concepts and how it can be used to cache or even create a Twitter-like newsfeed, for example.
Why use Redis Master/Slave and how?
The process is simple. You take a slave and point it to the master.
For Codepen, it's used mostly for disaster recovery. Other people sometimes use it to farm out reads, but they don't.
How to get in touch?
Resources mentioned in the episode
Tim's blog on CodePen
Tim's other blog (Not as current but contains lots of great info)
Tim on GitHub (Good way to learn configs for Docker, Load balancers, Rails, etc..)
Books Tim is currently reading
Codepen.io, of course.
I didn't get to ask Tim all the questions I had planned because of time constraints. Tim was very prepared for this interview and he actually sent me notes to my questions the day before the interview. If you'd like to learn more, here are the raw, un-edited, notes.
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)