Building and scaling Code School with Docker and a service-oriented architecture
Interviewed by Christophe Limpalair on 02/29/2016
In this interview with Thomas Meeks, Code School's CTO, we learn about how they use a service-oriented architecture to spread around the load. We also talk about Docker, which they use in their newer coding sandboxes to run student's code and verify it. On top of that, we touch on topics like caching, databases, monitoring, and company culture. You can listen, watch, or read the episode. Check it out and let me know what you think in the comments!
Hear about an interesting technology that you want to try out? You can easily and quickly do so on the cloud with DigitalOcean. The best part is that I've got a free month of hosting for you. Type in promo code SCALEYOURCODE and have fun!
Links and Resources
Thomas, can you tell us about your background and how you got started with Code School?
I have been a developer for about 14 years. I started off in the Java world and was very happy to move over to Rails quite a while back. I've done everything from Java desktop apps to web applications and a lot of server ops in between.
The story of Code School starts in 2011. It was born at a web consultancy called Envy Labs that I was working at. The point of that consultancy at that time was to try and create multiple products.
Code School ended up being super successful and that was the only product they ever made. A lot of us jumped from the consultancy to Code School in 2014.
Can you tell us more about what you do as CTO of Code School and what your job is like on a daily basis?
I am actually director of three different departments at Code School right now, so I spend a lot of my time being a hub for communications, mentoring other people and making strategic decisions about Code School, especially on the technical level of things ... making sure that things move forward on technical tasks.
I also make it a really big point to do some coding and server administration tasks every week.
It's really important for me to stay in touch with day to day work. At this point it feels weird to be out of touch and just do administrative stuff.
The transition was scary for me from a senior developer role to CTO. It's been less coding, things are less "clearcut" than passing spec. People are "harder," but they're also a lot more exciting.
Taking a higher level view of things has been rewarding.
It seems like switching back and forth between those roles would be difficult. Do you have tips for someone who might be in a similar situation?(the switch from developer to CTO)
One of the advantages of Code School is that we have these "work at home" days ... Tuesdays and Thursdays. I don't actually work at home, but we avoid scheduling meetings on those days. They are solid days where you can really get work done, and get through technical tasks.
It's really important to be involved with your team (do code reviews, etc) so you have a clear idea of what's going on.
My biggest piece of advice is to not underestimate how difficult it will be to transition from programming to working with people. As long as you take it seriously, things usually work out alright.
I've interviewed other CTOs, and some have said that they got to the point that they couldn't do any "hands on" work anymore because they felt it would harm the company. Do you think that happens when you get to a certain size with the company?
I think it happens naturally when it gets to a certain size. If you wanted to make time to do a little no matter the size, but maybe only once a month...not every week. There comes a point where you have to make sure you're delegating enough.
It's hard to delegate when you know how to do something, but you can't do it because you don't have time and you become a bottleneck. You have to slowly spin yourself down as you have more developers.
Can you give us an estimate of how much traffic you serve or what kind of scale you're running at Code School?
We're doing about 10 million page views with about ten times that in internal API calls. We're running about 40 small servers to keep Code School running although that scales up and down.
We can get crushed pretty hard with a lot of traffic at times, but we've been able to scale ourselves up to about 10 times traffic without making major architecture changes.
What platform are you running on and what kind of architecture is behind that?
We're running on Linode.
Code School is organized in a service-oriented architecture. I wouldn't go so far as to say microservices. Each of the services are pretty big. We're separated into clusters for each of the services. There are 5 right now. Each has its own database server.
We run at least 3 application servers for each of the clusters...a worker server or two, depending on how many are required.
It allows us to scale up our databases independently. We're able to spread that load around. If the front-end (the .com) is getting hammered hard, it doesn't necessarily affect courses.
So you have been able to clearly separate the role of each server? So if you have an issue with the main server, it doesn't take down the courses and vice versa?
Correct. All these different services are usually communicating through REST although the communication between things is generally pretty simple, so we've been lucky in that area.
We're also starting to dabble in RabbitMQ as well.
So you use primarily Ruby on Rails? Do you use other languages?
What was your insight in choosing Ruby on Rails?
Ruby on Rails is an amazing framework. It does so much for you that it's weird to step into another framework like Node after being in Rails for a long time.
One of the reasons we jumped into it, honestly, is because as a consultancy, that's what we were most familiar with.
There's a lot of value to sticking with what you know when you're firing up a new product that is full of unknowns. At the same time, it scales really well and handles the load really well and allows us to develop applications very rapidly.
While we see a lot of technology companies moving to Node, and we have as well on a couple of small places, we tend to only do that sort of thing when it's a very small service.
It's one of those dilemmas where you want to start a new project or just want to practice. You have to decide if you want to use something familiar, which allows you to develop things faster and not make as many mistakes, or try to move forward with some newer technology. It probably depends on who you ask, right?
If you're looking at something from an engineering or learning perspective, trying something new is always fun.....
... but starting a business or a product is really hard and you need to simplify as much as you possibly can. Unless there's a technical reason not to, stick with something you know.
You have "coding sandboxes" where students type in code and you check the code with whatever back-end engine you have to see if it's correct. How does that work behind the scenes?
It is really interesting. We have 7 or 8 servers dedicated to that right now split into 2 different clusters. One is called Legacy Executors. They are a "hodgepodge"... literally whatever it took to get it working at the time. If you go back to some of our older Ruby courses, they're running on JRuby, using some of the sandboxing features inside to keep things separate along with whitelisting and blacklisting.
We had some pretty fun security exploits where people found out how to call out the Java when it was running inside JRuby.
We also have some Rails Executors that are running on MRI...that same sort of thing....you're getting your code thrown into some random name module and then we're whitelisting it to make sure you don't do anything weird.
The iOS Executor was a complete one-off for us. We ended up leasing 25 Mac Minis in a data center and installing VMware on each of them, running 6 copies of OS X on top of OS X and setting up this crazy...actually, interestingly enough Java-based messaging system so they could all talk back to a central server and receive messages with people's code and execute it inside all of the normal iOS tools and apps. We're still running that one for Try Objective C.
It's been interesting to see how those different ones come together.
When you mention JRuby, how would you describe it to someone who's never used it?
Java runs on top of something called JVM which is a very good system for running an interpreted language or any language really. An enormous amount of work has been put into it to make it fast and secure.
A while back, when CRuby or Ruby MRI, which is compiled CRuby, wasn't so great ... kind of slow, a lot of work was put into pouring Ruby onto the Java Virtual Machine.
It has advantages and disadvantages. It's kind of slow to start up compared to normal Ruby, but it's also faster if you let it run for a while.
You created a newer model? What does that look like?
We created a newer one, called Tutor, which is all Docker based ... no surprise to anyone familiar with Docker ... which allows us to spin up tiny little servers ... each one is used for one request and then thrown away.
It's really flexible and allows us to spend a lot less time developing Executors. That's the primary advantage ... it's actually a little slower per request than the old way because of having to spin up Dockers. We've found plenty of performance gotchas with Docker that we've had to work with, primarily with file systems.
But it's a really nice, scalable system. We have a little Node app, very tiny, that manages all that ... and builds a Docker image that runs all the code and runs all the tests to check the code.
We usually spin a number of those up and let them sit on the server unused and just use them as requests come in and then spin those containers back up outside of the request.
That's all automated as the requests scale up or down?
This reminds me a lot of my interview with CodePen. They have a similar model with Docker containers which they had switched to. When did you switch to Docker?
For about a year, that's where most of the work was going, so we didn't have to do much maintenance on our executors.
With JRuby, you said you could do whitelist or blacklist. How do you control security with Docker there?
We don't. We let Docker "do it's thing." There are a few things we keep an eye on. Generally, Docker, itself, is fairly secure. We don't allow code to run as root inside the container where possible. For example, once that code has run, we trash the container and spin it up from a copy of the image.
We have resource limits so someone doesn't try to take up a bunch of memory or things like that.
Are you running those containers inside a virtual machine? This is one way to have more security, or to have a "harder layer" so to speak.
Yes and all those servers are considered untrusted by the rest of our servers. Even if one gets hacked, it's just not that big of a deal. We just trash it and spin it back up.
In order to check the student's code: you run the code, something checks the result, and if it matches what you expected, the student can go on. If not, it gets an error. How do you give hints when there is an error?
The kind of tests that fail determine what kind of hint is given. There have been a few instances where using a parse tree and going through that has felt like a better solution... especially on courses that are beginner where there are about a hundred different ways to do the same thing.
What's a parse tree?
When a computer is looking at a language, it needs to take each of those language words and tokenize them out into something the computer can actually understand.
Then it creates a tree that lists out how the code would be executed.
Once you parse the code and get it to a point where the machine can understand it, you can walk through it and see what someone is trying to do.
For example, if we say, "Please write a loop," it's not necessarily accurate to use a regular expression to look for all the loop commands because you don't know if the loop started correctly or ended correctly. If you use a parse tree, you are breaking it down to a place where the interpreter can read it.
Then you can walk through the tree and see if this is an actual loop that's correctly formed and does what it's supposed to do.
Now that you've used Docker to run this code and maintain the course section of things, do you think it makes sense to move Docker to your main or core services for databases, etc...?
I'm personally not convinced yet. I think Docker makes a lot of sense if you're just starting out and you'd like to save on costs and make it easier to scale up. It is definitely easier to take that pain early on and then be able to move the Docker containers to other servers.
Right now, though, one of the things that has me not so convinced is that we use a really nice application called Phusion Passenger to manage our Rails and Node services. It "feels" a lot like Docker, although, in practice, it's not anything like Docker.
It keeps them separated, spins them up and down, scales them up and down... We wouldn't gain a lot, from a business standpoint, to switch over to Docker.
What are some of the other tools you use for database or caching?
We use PostgreSQL everywhere. It's our database of choice and we really love it. It's been really nice on the scaling front. Our database servers are just 16 gigabyte Linodes, so we have enormous headroom just on scaling that vertically right now.
We haven't hit any major problems with scaling the database. We've been careful from day one ... cognizant of how we're using them. When we start doing heavy number crunching, we pull all that from the read pairs.
For caching, we use memcached and Redis. Redis is probably our bigger cache. We use them to cache the courses themselves. Our courses get compiled and loaded into the app. So Redis caches that for us.
We also use memcached to cache code submissions. So when you submit code, if you submit the same code as someone else, we don't bother executing, we just return it from memcached, which saves us a lot of time.
Is there anything else you're able to share like that?
On the CodeSchool.com site we're not doing anything really interesting in caching ... just your normal page caching and partial caching that's inside of Rails to let us toss pages out a little bit faster.
Chris: It's probably pretty complicated to try to share that load for the caching so that if a student already wrote the same code, you don't have to execute it again. I'm trying to think if there are any other use cases where you could share that load like that.
In some cases, we try to keep it kind of dumb. We've noticed that a lot of students just copy/paste hints. So that makes it a lot easier on us. You can't necessarily just strip out things like white space and that kind of thing.
There are a lot of ways to make sure you're not getting a cached request, like if you toss in a comment with a bunch of gibberish in it, we're not parsing out comments or anything like that.
You said you use memcached and Redis. What for?
By in large, we use memcached when all we want is this really simple, dumb cache that we can communicate. It's going to store some plain text sort of thing and dump it back out for us.
We use Redis where we want to use more of the Redis features like arrays and hashes and things like that. Our course application makes pretty heavy use of the extra features in Redis.
Chris: ...all the rich data types...I remember you had a forum and a section with podcasting. At one point you could vote on a course you wanted to have. That could be an interesting use case for Redis.
You said that Postgres is your "go to". Why?
There are a lot of specific reasons. I'm a big fan. I've been using Postgres for the full 14 years. "One of the things I really love is that Postgres doesn't tend to die when you put it under a complex load." I've killed Redis by just putting it under a heavy write load. I've killed memcached the same way and I've seen MySQL die in a similar way.
PostgreSQL will slow down but it doesn't just stop. A long time ago it had a reputation for being slow, but it seems they've taken that very seriously and reversed course. Many times I have been "blown away" by the speed of it.
For example, we've started using the json data type. We're storing messages that come in from services like Pluralsight and Stripe and never planned of querying them, but in PostgreSQL you just cast it to a json. It looks over the million rows and finds it in less than a second, which is kind of impressive when I didn't create an index or anything you normally do to make these things fast.
What kind of data do you have to store in those databases....user progress, user accounts, Stripe detail.....?
Codeschool.com is the business side of things, so it stores user data and the information we need from Stripe. Obviously, we don't allow Stripe to handle the PCI compliant stuff... credit card numbers or anything like that. We do store kind of a summary of user progress on Code school so we can display that ... badges, achievements, what videos they've used...
Inside of each course, we store every submission that anyone has ever done, which is a rich piece of data that we haven't really explored yet, but it's kind of nice. It tells us how many times people have taken the course and how far they've gotten through it.
Is that how you're able to tell if someone has copied and pasted the hint?
Yes, we were really curious about that. We poked around some of the submissions and saw how similar they were.
Chris: It makes perfect sense to store that kind of data because you could tell that students are really struggling with certain sections and try to understand why. That's really interesting. I wonder if you could do some kind of Data Mining ("big data") with that.
We actually have a Data Mining tool called Guidance that pulls down progress on individual courses and challenges. It shows us where people are dropping off. Then we dig deeper.
Is that something you built custom and what is it built with?
It's a small custom app that pulls data down though jobs from all the different databases that we have. It's built with Ruby on Rails and D3
What other kinds of monitoring do you do? Can you tell when your database slows down? What kind of tools do you use?
We've switched a few times. Right now, we're using a service called Datadog to monitor all the servers. It makes keeping up to date on all that stuff easy.
My favorite metric to watch on all the servers is a line draft of load. Load is one of those often under appreciated and under utilized metrics for how a server is doing. They can be a watch for I/O, memory and CPU all in one number. That's the primary way of making sure that no one server is getting crushed too hard.
We also keep an eye on the top level of metrics. We want to make sure response time is not going too high.
We watch proxy errors, our internal load balancers, to make sure the application is responding when it should. We can tell a difference between whether the internet is having issues today or Code school is having issues today which seems more and more common.
I'm not sure why...for example, as we sign up more businesses in India or Europe, we get reports of issues and end up tracing it back to their ISP or maybe that part of the world is under some kind of DoS attack that has a particular line between them and us congested. There's not a lot we can do about that.
It doesn't happen on a weekly basis. You can dump a lot of time into that. We found it useful to monitor internal network performance so we know if it's us or them.
We mostly monitor that through Datadog. We keep an eye on HD proxy.
What about performance metrics? Do you also watch time to upload a page through Datadog?
Datadog and New Relic ... Datadog keeps an eye on how fast the load balancer is getting responses from the servers. New Relic is keeping an eye on how fast we're rendering, how fast the internal performance of the application is going and what the users are seeing on the end.
Let's talk about culture. What's it like working as an engineer at Code school?
I believe we have a very unique culture:
- We have Tuesday and Thursday workdays (as I mentioned before).
- We're big on in-person collaboration, so we try to have everyone here 3 days a week so we can have quick "face to face" meetings.
- If you are a developer, we try to keep meetings to a minimum. These meetings are primarily on Mondays. No daily stand-ups.
- We're big on "do what works for you." We don't lay down a personal development process for everyone to follow. Test-driven, behavior-driven, document -driven, soda-driven development ... whatever development ... is not required.
- We care about output.
- We emphasize keeping the teams small and tight knit. We want to cultivate friendships. We want people happy to come to work and to focus on doing their best for their friends rather than for the company necessarily.
How do you group people? Do groups have a specific focus, or is it more important to have teams that collaborate?
We've gone toward a clear separation of teams that work on individual codebases. Now that things are getting big and complex enough, we've found that there is a lot of value to working on the same thing. Working on the same thing everyday helps "to learn that 70,000 line app," as opposed to jumping around, which winds up being "too much spin-up time."
How big is your team? Do your developers and engineers have concrete principles they have to follow?
We have 16 developers right now. We're really big on "finding your own work"; however, we don't take it to the extreme. If there are gaps in work, we encourage people to find things they're interested in that could improve the company.
We do ask people to work inside pool requests and get their pool requests reviewed, make sure that tests pass ... We do run continuous integration (CI). We have a general feel for each project as to where the code coverage should be, which is usually around 80%. "How you get there" is not something we lay down the law on.
What tools do you use for continuous integration?
Travis CI is the main thing we use and we have it integrated through GitHub...we manage all our pull requests and issues like that through GitHub as well.
Thank you so much for being on ScaleYourCode, Thomas. How can people reach out to you?
Anyone is welcome to reach out to me at Thomas at codeschool dot com. We have a great support team, and our forums are a good way to reach out to us. We keep a close eye on that. The Code school twitter account is active too.
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)