Stack Exchange Engineering with Nick Craver
Interviewed by Christophe Limpalair on 01/19/2017
We're joined by Nick Craver, who is a developer, site reliability engineer, system administrator, database administrator, architect, network guy, data center guy, all around debugger, and everything in-between at Stack Overflow and Stack Exchange. His experience gives us a unique viewpoint of how these two massively popular websites run, and we also get a peek into their engineering culture.
Nick mentions the importance of monitoring and error tracking throughout the episode. If you're not tracking errors in production, get Rollbar for 90 days free and set it up in only 5 minutes.
He also mentioned how much he loves his job. Considering the amount of time spent at work, it doesn't make sense to waste it. Hired can help you find your dream job without going through the stressful traditional process. Sign up here for free, and get a $2,000 hiring bonus.
Links and Resources
Nick has written a number of excellent posts on nickcraver.com/blog which talk about hardware and architecture, and the software powering these extremely popular websites. I recommend referencing those as you follow along this episode.
I mentioned this in the intro, but how important do you think monitoring and error tracking is when we are talking about small scale or large scale operations?
You've got to have it and it's so easy to get! There are a lot of third party vendors and a lot of open source tools to be able to do that sort of stuff nowadays.
When you need to improve something, you need to have a way to measure the "before" in order to see if you made progress or if you had an impact.
The other part of error tracking and monitoring is knowing what went wrong after the fact, or being on the edge of something before it happens. Spotting an issue before you "go down" is a whole lot more valuable than simply knowing what went wrong afterward.
The monitoring side usually eliminates a lot of the error side if you are doing it correctly.
Where does your passion come from? You spend a lot of time writing about your work and we can really see the passion. Where does it come from, and how did you get started?
I've been doing IT of some sort since about 14 or so for doctor's offices and hospitals ... networking, setting up the quick servers ... and various other productive or non productive things. My grandfather got me into computers. He worked at A&T Bell Labs. My parents fostered whatever I was interested in.
I started in college with CSE (Computer Science and Engineering), and ECE (Electronic and communication Engineering). I didn't really learn that much in college. I consider myself a self learner like most people in our industry. That's not uncommon. I worked for Co-op for GSK for a while and went back as a contractor. They cut my hours short on the co-op. I wanted to take more than 28 hours a semester and they wouldn't let me and I didn't want to do another semester (I wanted to finish up), so they turned me to a contractor and that worked out well.
I was a contractor for a while. I started long term at some medical, clinical research type of work, building .NET, databases, etc... Now I'm at Stack Overflow.
At one place, I worked with Jeff Atwood and Jerry Dickson which probably helped.
So, you said you knew Jeff Atwood and Jerry Dickson. Were you one of the very early employees at Stack Overflow?
No, I joined later. Jeff contacted me after I was the top user on Stack Overflow for a year. I was bored at my job. It was a validated system. It's not to say you weren't working, but you couldn't make a big impact, which is what developers love to do. You are an engineer. "If it works perfectly, take it apart and put it back together again." We couldn't improve anything without it being validated, so there was a lot of "red tape" which gives you time to answer questions on Stack Overflow.
One of the things we look for is ability, not whether you know this code, this language, or platform. That stuff you learn as you go. It's hard to interview for, "How well do you learn vs. what do you know now?"
We have that problem. I think everyone else does too.
What are some interview techniques you use?
You just keep adjusting your interview process. No one has it down. We have multiple layers of interviews . We say, "from an easy problem to a Sanity check":
Does this person have the criteria we need?
Can they develop code?
Can they analyze?
Given certain requirements can they interpret those?
Can they pass that kind of an interview as well as one with hard problems involving specific coding issues in whichever language they want?
Language doesn't matter. It's more about how you write the code.
How do you solve problems? Are you thinking the edge cases through or do you just start writing coe? (which some people can do while others can't)
In my experience, most of your time programming doesn't involve a keyboard. Most of my time doesn't involve a keyboard for better or worse, but it ends up with a better system. Brain power should be involved more than fingers.
Is answering questions on Stack Overflow internally encouraged at the company? Is that something you still do in your spare time?
Spare time. We're not paid to do it necessarily, but if we're debugging a problem and we encounter a problem, we are encouraged to post it. If we run into problems, we post them.
Stack Overflow is, for all intensive purposes, a bootstrap device as well. You see this as well on Server Fault, which is the admin side of things. When we encounter issues with HAProxy or this weird thing with transaction buffers down in Linux in a deep dark sub system no one has known for the past twenty years, we ask on there and hopefully we get an expert that answers it.
If we find a question we think we would want an answer for, we self answer it. For example, if it's something we wish had been there. In a lot of cases, the engineering side of the company is the primary consumer of our product (our site).
Helping the next person is really important. That's in everyone's blood here which is really cool to work around.
You have different titles depending on which platform I checked (Twitter, your blog, etc...). In one you were a DBA (database administrator), on another you were a Site Reliability Engineer and on another one, Developer. Is that because you're involved in different parts of the application? Do you switch jobs to try different things?
I'm an oddball. I don't really have a role so much at the company as I hop around to help out. This meeting week has been exceptional. This has been my twenty-second time slot meeting this week. Other times, you're in the hangouts all day. "Hey Nick, how do you do this? How do you figure this out?" "How does this system connect to the other system."
My main role is knowing how all the pieces fit together and what things to consider.
If it's SQL (Structured Query Language), I'm the closest thing we have to a DBA (database administrator). I tune and upgrade SQL.
Greg Bray and George Beech are my partners on the SRE side. They help test the SQL upgrades on the SRE (Site Reliability Engineer) side. They help test SQL upgrades. We work with Microsoft ahead of time. We're in the TAP (Technology Adoption Programs) programs. There's an old blog post on why we run 2014 CTP 2 in production, well, before RC (Release Candidate) releases and why it's beneficial to us and you.
Fewer bugs hitting RTM (release to manufacturing) is better for everyone, right?
The closest person to my work at Stack Overflow is my buddy Geoff Dalgas. He also straddles multiple teams. Our delineation is something like this: the SRE team takes care of the sites (datacenter side and things) and the developers: obviously the Stack Overflow, the Codebase, careers/jobs that you see on Stack Overflow.
I need to do a blogpost on how we internally proxy it to make you think it's a single application (it's fun). It was a very quick way to ship an existing thing.
Each day we may wear three or four different hats. I love doing a hundred things a day. I can't just watch tv. I have to have a laptop and be coding while a movie is on.
It sounds like you have different responsibilities ... by "you" I mean the SRE team, the DBA team. They each have their own responsibilities. Are you the person in the middle that helps communication between the two parties, or how do you make sure you don't have silos where information gets trapped?
I help that, but I am by no means the only person doing that. If you need an SRE, you yank an SRE in. If you don't know who to contact you just ping the team lead and ask for someone that's related to:
- network bandwidth
Two days ago we had a Providence problem.Jason Punyun tells me when it's broken. "If the monitoring hasn't already gone off, Jason's going off." It's two hundred times worse because of the patch we did the night before. Then we rebooted it and it went away. I was furious because I couldn't figure out what it was.
"Reboot solving is infuriating."
For that kind of thing, they know who to grab, and the team lead will know if you don't. You just drop in the chat room. Each team has a chatroom. We use our own software. They are internal protected rooms that people can't see.
There's a "Wheel of Blame". If somebody says it's not my fault, we'll pick somebody and move on. We don't care who broke it. That kind of stuff is very handy for DevOps. Just drop in the room and grab someone. Either you have a point of contact or you can very quickly get one for a project.
All companies struggle with this: How do you get the right people involved early on a project, whether it be decision makers, technical expertise or whatever knowledge you need. I think we're pretty good with that. We're always trying to get better at it.
How did you get such a broad range of knowledge and skills? Was there a point in time where you did have more of a focus on a specific area, and you learned it inside and out and then moved onto a different topic, or has it just been trying different things, moving from one topic to another on a daily basis?
More of the latter. I see something and I'm curious. I'm a big believer of "If you see something and wish you knew how it works, you can. It's the internet." You can learn anything you want with a quick Google search.
Google gets my money's worth. That's what we're all doing all day. We are all learning at this. As Sysadmin, I did some roles at previous companies. Sometimes pure Sysadmin, others pure developer. Other places, I was the only one learning SQL. Some places didn't have SQL. They were all on Oracle with DBAs that you really wanted to poke with sharp sticks.
When they stripped out all your values from the column names, that was a favorite. They had a program to do that. "There's a special place in hell for those people."
You just learn as you go. Doing multiple things at once helps me learn. I think you have to be learning all the time. I have fun doing that, so it pays off work wise. At the end of the day you want to play a game or read a book, do it.
Do what makes you happy!
It happens that this aligns close to what I do.
I'm in a dream job which is extremely lucky. I hope everyone else gets to do that in their life.
In some of your blog posts, you cover the architecture and hardware. If someone has not read those posts, can you give us a quick overview of what the architecture is, and the environment.
Sure. Primarily, it's a Lisp stack. It's Windows IES SQL .NET code, mostly C# for our programs. If someone's trying to pitch F#, good luck to them, because we'll battle it out.
There are certain parts of the infrastructure that are interesting, some as vanilla as possible. We're using HAProxy without exotic configs. George Beech is a good source. He blogs actual configs that we have. We have Elasticsearch for our search on the background running on Linux ... 3 node cluster.
With SQL, there is a primary server for Stack Overflow and a primary server for everything else on the network. I detail what the split of those databases are. Those each have two replicas; one in the New York center (so we have redundancy there), and in Colorado, we only have one.
That has more to do with cluster quorum dynamics than cost. The other bits aren't all that interesting. We open source the things that we think are interesting. Our ORM (object-relational mapping) layer like Dapper for accessing SQL from C# or any .NET platform. It's a micro ORM for efficiency. That's what our big thing is.
StackExchange.Redis is the official Azure cache client and it runs Redis underneath.
We try to be boring. Boring is stable ...scalable. The simpler something is, the higher it scales.
I think that you'll see a lot of that in our infrastructure.
There are nine web servers running production. We can run on one. This is entirely for redundancy. There are two more web servers at the end which are dev and meta. They run a different config. The "nine" are homogeneous web servers with the exact same config. It makes it easier to deploy. You want plain, boring, and simple.
Right now, behind that are three service boxes running. One of the cool things we might be doing on those is buying new ones and sticking GPUs in them. Those run background services. Like that Providence thing I was talking about; it just hits Redis back and forth ... very low latency connections. If it takes more than 10 milliseconds, we alert on it. It's a very low threshold. The 99 percentile of those requests is 1 millisecond. That's the kind of level we want.
They run things like re-indexing related posts you see on the side.
When new ones pop up over time we don't run that live. We do it in the background. The indexing of the sites themselves run on that background tier.
Why the GPUs?
You can go to Stack Overflow and and say, "I want Java questions or C# questions, or to write tags" (a big part of what we do). It's not just tags. A lot of people have blogged about this, and we snicker a little because they're solving a simple problem of "What's C#?" You can pre compute that in the most common sort. Complicated ones are "What are the C# ones that aren't closed, that were open between these two dates with three duplicates?" All this has different criteria like "exclude this list by this person."
Our system supports all these scenarios, some of which are computationally expensive. Mark Gravell has done a ton of work on this. You have like a binary matrix in memory, and now we can fit all of Stack Overflow (the entire network) in that, I think, 700 megabytes. It's 1.7 gigabytes now. We can look at the tag engine, and we're doing that in memory on a CPU.
We figured out that a GPU lines up to this. With a GPU, you can take a simple task and put it across thousands of cores, it's an excellent use case. That is the greatest use case.
With the GTX 1080s which have 8GB of RAM on them now ( the old ones were 4), we can take a consumer graphics card for $700 and make a Tag engine 10 to 20 times faster than our CPU rewrite. We can fit the whole network in card memory now and grow a little bit.
The alternative before was taking an Enterprise GPU ($5000) for 35 to 40% more performance. That's an easy call. We can stick two of these in the server, and it will be exactly like what we have at home. I run 2 GTXs 1080s and so does Mark.
It's a pretty cool thing. We're just trying to get the quotas at the moment. There's a limit of two per customer. That's why we can't get them. That's the entire hold up.
So how do you actually test that? Did you test it locally on your computer and do some performance benchmarking and theorize that if you scaled to a certain level, it would help by a certain amount?
This came out of a mailing list Mark was on, and we decided that it seemed like a thing we could do. We looked into it and decided we could build it ourselves. We got him a GTX 1080 to try it on. Then we got another one to try the active/passive thing we'd need in a server.
After that, he got a GPU (he's running on a 980 GTX, I think), at that point we decided it was viable. We have a working product, and we work with Dell and NVIDIA to run it in their labs, so we just hop on and run our workloads, and we ran it on all the hardware we have (server types). We did performance numbers. We took all the load of Stack Overflow and recorded it out. We hacked in some code, added some logging and replayed that load.
It's very simple when you control the whole stack. We decided it was viable and we bought some cards to do it. I got the same thing, reproduced his results.
Now, I just need to build a couple of Dell R730 servers and put some 2U graphics cards in.
If you set something like that up for production, how do you switch over to a new setup while ensuring there's no down time or issues with the processing?
Since there are three Tag Engine servers, we just take one out and fire it up. The Tag Engine servers just listens on a port and it runs whatever behind itself, so it may be in GPU mode or it may be in CPU mode, so we would just add that server back to rotation.
The important thing is that our requests to the Tag Engine on the background and even Elasticsearch don't go through the load balancer. They go direct. The app is aware of all the backend systems. That gives you a lot more insight into what's going on, what's healthy, what's unhealthy, how many requests it had, what timings it has etc.
Those one millisecond requests I was telling you about in Providence do go through HAProxy because it's easier to load balance across some IIS pools. Also, we get all the logging coming out of HAProxy.
Do you use anything else for logging or is it just going through HAProxy and pulling those logs from there?
HAProxy has an output built-in of Syslog and you can add whatever captures you want to, so the applications are sending extra headers back, like you request. If you go to Stack Overflow right now and look at your response headers, one request will be Squid. We can log that exact request. Those things will pass through (it could be how much time I spent in IIS, how many SQL queries we did, how long it took) and these headers will come back from the app.
They get piped into the Syslog and shipped off to two SQL servers (one in New York and one in Colorado). They each log independently. So Syslog goes across the land from both places. We have a 10GB MPLS (Multiprotocol Label Switching) between them. We are talking like 5 megabytes of traffic, not a huge volume of traffic to log those, and then we stick it in SQL. The process that sticks it in SQL makes other things possible. We can do whatever we want to at that point.
One of the cool things we have that we work on during a rainy day is that it sends every question view over to Redis via Pup/Sub. We have a process that listens for that and builds a queue of who is actively hitting the site so it can demonstrate it to advertisers. For example, you post a job opening on Stack Overflow and you will know who will be seeing it.
Nothing makes quite as much impact as someone pulling up and saying, "Look at the hits you could be getting right now!" and look at the dots pop up on a map. We could actually do that for every load request we have although I think it would crash the browser.
Find out: can we put 4400 dots per second on a browser without killing it? I don't know.
Do you have any tips for smaller scale environments that are trying to implement this kind of detail where they can see all different kinds of relationships between calls, how long they take, where the errors come from? How do you set that up?
Get something that already exists. A lot of the big companies build this stuff and release it. There's MiniProfiler that we built. That's not ported to .NET Core yet because there are a lot of things to do with that. Then, there is Glimpse which is supported by .NET Core. They were bought by Microsoft. It's a good tool. It's a little more overhead than we want at Stack Overflow. But, long term maybe we might move to it instead of maintaining MiniProfiler.
We are not against anything. We have loyalty to nothing. If there's a better option that comes along, move to it!
The cost of better includes the cost of moving.
So things like MiniProfiler, I wrote a library called Exceptional that logs all your errors to a central place. We take all the errors from all the apps into one SQL central store. That is read by Opserver. The teams can very easily say, "Here are all my apps and errors that I care about today."
Exceptionless is another. There are plenty of these. Grab something that makes a fit. They are all a little different and they match whatever your approach is. Usually, you can very easily plug this stuff in. Just start looking. With MiniProfiler and Glimpse, you see the number right on the page.
If you are looking for performance issues, nothing screams performance like, "WoW! Why is this thing red, and why did it take 90 milliseconds?" You fire it up and see which file is slow. Whatever your case is. Usually there is so much "low hanging fruit".
The simplest thing is to grab something that pre exists and plug it in. A lot of people try to build it the first time. Don't do that. Take an existing tool and learn from it if nothing else for a day, or a week.
A lot of the time, that stuff is built for what you want. If it's almost what you want and open source, you can create PRs (Pull Requests) or issues. At least file an issue and someone else might do a PR. I encourage people to file issues on something that doesn't work. When I build MiniProfiler and work on it, I'm not the only one. Dapper or any of these libraries are being built for us. We know our use case.
There are some issues open for Redis Clustering because we don't use it yet. I can't dogfood this and tell you with high confidence that it's good. When we deploy a new Git library, one of the advantages we have working here is that we dogfood it on Stack Overflow. Every library we give you on Git is dogfooded. It's published and hidden and dogfooded on Stack Overflow before we hit the go button.
There is a new build of Stack Exchange Redis for .NET core out right now. It's just hidden (.605) is out there. We're just making sure that it is rock solid with high load before it goes out.
If you have spin up concurrency issues, we'll find them really quicky.
Speaking of those "low hanging fruits," let's say you find something that takes way longer than it is supposed to, and you figure out what is causing it. How do you take the next step in figuring out how to solve the performance issue? For example, there's a database query that's taking longer to respond, how do you open that up and figure out what's going on behind the scenes?
There are two approaches:
If you only know the area it is in, that's one of the reason we built the new profile this way using statements. So if this chunk of code is taking too long, you can put more precise (more granular) usings to see where it is. That would be the first step because we don't usually profile everything, only the big stuff. We may need to go a layer down.
If you know it's a database query, you take it out and put it in SSMS (SQL Service Management Studio). If I go to the Stack Overflow homepage, one of the things MiniProfiler has ( I'm pretty sure Glimpse has this too) is that you see every SQL query and how long it took, including the parameters, to easily copy/paste, and run it in SSMS. The first thing you want to do is reproduce any problem.
Keep in mind that SSMS isn't always going to be accurate in terms of run time. There are different things with plain cache and not hitting the same thing and not using SQL to prepare statements. There are all these things that affect it. I recommend the article, Slow in the Application, Fast in SSMS by Erland Sommarskog. It's a very long, old article that details everything to do with this. Every developer should read that.
SQL 2016 adds some bits here which are very handy; I don't know if you've seen live query plans, but you don't just see the query plan of where the cost was, you actually see it as it's running and which operations are in play. There is some overhead to do it, but you're just using it on your debugging and it doesn't fundamentally change the shape of the plan.
You say, "Do you need an index?" Some of that is intuition. Some is just experience.
You ask questions like, "How does this behave?" and "What is connecting?" The main thing is reproducing the problem including reproducing more detail of the problem. "Oh it's this hash mash join" thats slow.
For instance, we can't reproduce one of the things that came from the app and that was fine in SSMS - we're looking at it and we can't reproduce it. It was searching a Bot Detection based on an IP, on the Post table. The reason you couldn't detect it is because it was a varchar passed from the app as a parameter rather than a char which is what the database expects. That was doing a conversion for every row it looked at.
Sometimes you don't see that stuff, but if you would have copied from MiniProfiler out, and it was set to declare varchar (we add the statements with the parameters for easy running), that's very handy.
The other thing that we do that is extremely handy is that we add a comment to every line of SQL that runs through our ORM (Dapper). On the way in, we use caller member attributes which say which file it came from and which line number it came from, what method, etc. You can put this in the SQL statement. Here is an example of doing this.
Christophe: You are able to see so much information of what is going on at any point in time, and you can always go back and look at the logs and trace back to, "This interacted with this, this is going on here, and try to debug that information."
The more information you have, the better. If you are logging too much so it's noisy, that's a different problem. Most people don't have that problem. Having the raw information is good, but people conflate, "We shouldn't collect that," with "We shouldn't get an alert." No, you should scale back the alerting, but having all the data points is good within reason. If you are filling up hard drives (5 per minute...), that's probably not good.
Christophe: If you end up alerting for small things or action items which don't need to be solved right then and there, you can desensitize alerts and your engineers will start ignoring them, even if they are important.
You said you don't necessarily look at the tiny details but more the overall picture of stats, like performance stats. You don't look at the tiny fractional numbers but the requests. Is that how you reduce some of that noise?
Sometimes, at our scale, that's the granularity that you look at and then you go further down. So, if we are getting, for example, 500 errors per hour, then I start to look at individual requests and see what happens. "Can I reproduce this in any way?" We capture each HTTP request that comes to us (not the forum data). The user agent, the time it hit, the route it hit, how many SQL Queries ran, and the time it took is the kind of detail that goes in a database in a cluster column from SQL and we'll pipe it to Logstash in Elasticsearch. We have a 300 terabyte cluster for that.
That kind of stuff is very handy because, someone comes on Meta, which is our error reporting meta.stackoverflow.com and bug/feature area. They'll say, "Hey, I can't login," and we'll say, "What was this person's flow? Did their IP change in the middle or did it get kicked around? Why did they get a 514 or something like that in the middle?"
The person who designed this was Jarrod Dixson, who said we should log this stuff in SQL. I said, "Jarrod, you're an idiot. We'll never use that." Now, daily, it's used a dozen times. He was completely right. I'll admit it. It's tremendously handy to have that data. I can pull those stats for blogposts.
Chris: Those are very useful and impressive. I remember looking at those stats from 2014. When I read those stats at first, I thought it was for a month. Then I went back read the paragraph above and saw that it was actually a 24 hour period. They were monstrous stats, so definitely check out all those stats.
You mentioned Elasticsearch (using it with the database). I saw this posted on your Trello suggestions board as a question: How do you synchronize data between the database and Elasticsearch when you are talking about massive amounts of data?
There are a couple of tricks we use. Sam Saffron (who works with Discourse now) and I found out that Rowversion, a seldom used thing in SQL, where you say, "This number looks like a date time," but it's not. It's just an incrementing sequence of numbers when anything on that row changes. That looks appealing for deltas. If you could record the last one, and then come back and get everything since it, I know which things have changed on that table (which rows) whether they be new or altered. So, questions and answers are stored in the post table.
If those changed (say the title changed, text changed, body changed, score changed, whatever), those get re indexed because their post number changes. So they go back into Elasticsearch that way. So the service boxes I talked about, those 3 compete for a Redis lock and whoever gets it runs. Each of them compete on each site.
Stack Overflow runs every 30 seconds. Most sites run every 5 or 10 minutes for child metas which are seldom in terms of volume in comparison. They are much lower traffic.
When you do an indexing pass, you ask for all the posts and other things. "Give me all the ones since this creation date (a big integer, effectively a sequence number)." "Give me all the ones updated since then and it loops through and indexes them." It's a very simple approach.
The key point is that we store that document in Elastic itself. Elastic maintains its state and position and the position it's at. We index New York and Colorado independently from their local SQL server so each one keeps going. If we need to update Colorado to a new version of Elastic, these clusters don't talk to each other.
One of the dev local setup bits that we have, which provisions all the developer machine was made to install Elasticsearch 2.0, but we ran into issues which were documented on Elasticsearches' issue lists by me.
The 1.0 has a nasty little default which does multicasts discovery. So one of the things which will be rolled from 2.0, which was my plan, back to 1X - which is 1.7.1 or 1.7.2 - does multicast discovery so unfortunately everyone in the office was forming one giant Elasticsearch cluster with their developer machines; not a good side effect.
That actually happened to Amazon early on. Every customer saw each other's data. They had a lot of fun with that one. "Oops!"
The simplest way to set that up is to go to your local SQL server and index your local stuff. It self recovers. The Elasticsearch indexing, itself, is called "Project Cockroach." It's really hard to kill. We had to add some kill switches because it will bounce between service boxes and indexes. If this box is indexing Stack Overflow, the next two trying to take a lock will simply skip over it and go index the next site.
So they are all racing to index (400 some databases). If you want to recover the thing, you basically just delete the index, and it will completely repopulate. It's a very simple setup. I've open sourced a little of that in Screenshots. We are using the Microsoft TPL (Task Parallel Library) Dataflow to balance the processor allocations there.
There's a blog on How We Upgrade a Live Datacenter We upgraded the web servers to higher power CPUs. It was time. They were four years old. They were running fine at 20%, but they said we should spend money that year, so that's what we did.
One service box would kill the Elasticsearch cluster by sending things too fast at that point, so we had to back it off a little. We are hoping Elasticsearch 2.0 improves that.
Honestly, if someone from Elasticsearch came to me and said, "You're doing it wrong," it wouldn't surprise me. There's a lot on the table there, but it works well enough out of the box with very little tuning so we don't go super "non vanilla" there.
You've mentioned Redis a couple of times, now. I know you use Redis as a caching system, but you seem to use it for other things as well. Can you talk more about that?
There are two major points of Redis. One is the caching system. There is a database in Redis as an integer. That correlates to our site id on the sites table in SQL . That's your one to one. Stack Overflow is site one, superuser is 2, server fault is three (those may be backwards). That's our key space. Then you have prod, or dev, or local. You have key space isolation. You don't technically need that anymore. It's a safety mechanism in case someone downloads the site and connects it to their own place. We enforce that pretty hard.
When you store keys in there, there are a couple of things we do. Kevin Montrose has a good post on how we do it at meta.stack exchange about how we do L1/L2 caching. L1 is Http cache in .NET just in the local app pool. L2 is Redis. It's not a transparent cache. We store it manually in Redis. Some other smaller apps might use JSON or something, including in our infrastructure.
Other than that, you are talking about going to an API or database or file (whatever the source of that data is). Sometimes you just want L1 cache. Sometimes you want L1/L2. Sometimes you just go to Redis.
Things that would go directly to Redis would be like cache sets (anything with Atomic operations) you get an increment integer. An example is, on the top bar of Stack Overflow, your account, your reputation, your messages, how many are in your inbox, etc would be a direct hit to Redis.
Even for a normal login user, you're talking about hitting it seven times and that's nothing. We can pipe a quarter million ops per second per server through Redis and it's fine. It's an amazing software. We also use it for Pub/Sub. So, when you see a question list or a question itself, you see the score going up and down. The WebSocket that we have, that's using Redis.
We publish a message when an event happens on the web server through Redis and the subscriber or the WebSocket server is listening on the nine web servers and they're piping the WebSocket message out to you if you are subscribing to one of those channels, whether it be the score on that particular question, or you're looking at the list of C# questions. You are seeing new ones come in.
Redis has downstream servers currently called Master/Slave (they may change that verbiage soon). The child servers in Redis let us sync between data centers. So we have the master and a replica in New York and then two more replicas in Colorado. There are blogs with pictures of that in the Opserver. Pub/Sub actually travels through there.
A side benefit that never was designed but that actually works out well is that we could run WebSockets out of Colorado or we could spin up a Redis replica in Amazon or Azure or the North Pole and do WebSockets out of there.
WebSockets are something we are working on right now. They are problematic at our scale. Most people don't do them at our scale. If someone does, please let them talk to me. We have so many issues with scaling those out.
People assume that Google, Gmail, and Facebook use WebSockets, but they don't.
What do they use?
It's just polling, long polling, the old tech. Concurrent connections, persistent connections. It's a file handle, or two, through the load balancer (frontside, backside), so right now on Stack Overflow, it's late on a Friday so there's not a lot going on right now, but we have a constant monitor of WebSocket connections going through our infrastructure.
Right now, we have 455,000 concurrent connections through one IP, basically, on a load balancer. That opens up 1.58 million file handles. We usually get a hundred thousand more than this during the day.
What we're trying to do is say, "Can we use shared workers?", "Can we use some new stuff?" because if you have ten Stack Overflow tabs open, can we condense that down to one? We have that working, but it requires an extra web request because of the way the share worker has to be initialized with a url path.
Since they've locked that down, we have to do it with maybe an IFrame (inline frame) that creates the image which would allow us cross domain post message.
So now we can condense DBA Stack Exchange, Stack Overflow server, all those, down to one WebSocket connection for all the sites. That's pretty appealing. If we do that, it means that 455,000, we think based on numbers, will drop to about 100,000 less than that. That means we can enable WebSockets for more people.
On Stack Overflow, you have to have 30 reputation to get them right now because of these technical limitations. As we roll that down, it scales up who can access them. We're working on that.
Just in case someone is listening who doesn't know what WebSockets are, can you explain quickly what they are and how we can use them?
Instead of a really big Http connection that has headers, footers and all this orchestration around it, WebSocket is very light weight; a simple connection and upgrade. It's almost zero load to keep it open except for the sockets and file handles.
For us to send you a message after the fact is very cheap versus someone polling every five minutes for a new message.
A. It's instant. We pushed you exactly when it happened.
B. It's far less load overall for infrastructure.
It's very handy but has scale issues.
Can you think of any major mistakes you have made, looking back?
On the technical side? I don't think we've had that many. We do a lot of research and we make the best decision we can with the data available at the time. Redis seemed like a good idea at the time. Kevin answers this on Meta as well. Why did we use it? We did some research an it seemed like a good idea, and it has worked out pretty well. If something comes along that works out better, we'll change our mind. That's how brands are supposed to work. We can adapt over time.
We've obviously made some fat finger mistakes. When we were upgrading the data center in Denver, back in June, one of the last things I did was that I changed the buffers for the transmit and receive, and George did this tuning as well and rebooted the tier. It turns out NY is really close to CO on the keyboard, not really, but that's what I typed.
I rebooted the production web tier all in one. Luckily it rolled a little bit, and people only noticed for about ten seconds. Oops! Rebooted nine web servers. Restart computer ... dangerous command; even more dangerous if you don't give it an argument, you just reboot your laptop. I've done that.
We've lost data before. One was not our fault. Sam Saffron, before me, did a really good job of cleaning up. There's a podcast on this; Developer's life podcast; took a server hard down and he kind of stitched everything back together from it.
The other was when we were moving both data centers and data from SQL 2008 to 2012.
We lost five minutes of data from one database. One of the T logs (transaction logs) wasn't restored correctly.
I feel bad about that, but there were only three questions. If that's all we ever lose, we're doing well.
We've lost data on internal systems before. We tried Cassandra for a little while. It was an experimental phase. We were using Cassandra. It lost power hard. Someone in our data center in New York made a mistake, and one circuit failure took out all the feeds.
One of the Logstash upgrades went badly on 2.0. Luckily, we were OK. We just tossed the data. That was okay. It wasn't user data. It was just some log stuff behind the scenes. It was duplicate, replication stuff.
I don't think we've had that many failures like big stuff that has gone wrong.
We've had technology fail on us, and I think we've done a great job of bringing it back online. "Fighting fires" is a little bit of fun. We've been lucky.
How do you protect against sudden failures, like power outages?
There are three things we learned from the Cassandra failure:
A. They don't validate their time types or integer types to some degree that the wire type is not validated when it restores. This is a very scary thing and the reason we moved off. We broke Elasticsearch before that. We learned what went wrong, and you correct for it. Those machines had cheaper SSDs (the Samsung 940 series at that time 950 now). They don't have capacitors in them. The DC versions are the higher end super cap versions. Now we use Intel SSDs in everything that can't be lost. With those, power outages don't lose the data.
People say, "I have ten servers. I'm writing in quorum; there's no way I'm gonna lose data." That doesn't work. You have all the data going through the system at the same time. You're in the network, in the server and you're down the RAID controller maybe, and you're going through that same DIMM buffer with roughly the same commit from the same query, at the same time, on all the servers because you blasted it out to eight at once.
It's not inconceivable that all our DIMM buffers had roughly all the same thing in them when they lost power. Server quorum doesn't actually solve this issue. It's not a good thing to think.
We learned to really look under the floor and see where the cables are connected to the data center, that they go to two different panels.
You try to learn from your mistakes and not make them twice. If you fail two different ways, that's acceptable, do it and move on.
We do things that are risky. We do some things live, and riskier things not live. If it's failing over the load balancer back and forth, if it's safe to do it, do it.
What about non technical issues?
We've internally built some things that just didn't work out. We learned and improved our process. In the past we built some things that got passed around and the users said "UGH!" We should have been working on those things. Now, we go to the community way earlier.
Internally, we've also learned. For example, getting the right people on a call earlier is important, and every company can improve at that.
Just don't keep screwing up the same way. That's when you have to have a conversation. You don't get fired for screwing up here. You might get fired for screwing up the same way repeatedly. We don't have that problem.
The main episode ended there. We went on to talk about some office hardware equipment just for fun.
Chris: You were telling me you are setting up some wireless equipment in your house.
The first two pieces of gear arrived. I got some Ubiquity access points. I got recommendations from people online and here. They are like Enterprise grade: for campuses, etc.
You can't just plug these in. You need an engineer background to configure this stuff. Luckily, I happen to have one. You can do this cheaply. A lot of the campus deployments enterprise deployments take lots of equipment. You are talking $20,000 to $50,000 in hardware.
If we buy a switch for work (like a 10GB 48 board switch) you are talking about a $9000 purchase for the contract agreement and support for four years, etc.
This Access point at Amazon (1.7 GB) is 450 on the 2.4GHz and the rest on the 5GHz. Hopefully two will do your house. You are talking about PoE and they are $130.
You think about a top end router is over $400 on Amazon, so you can buy two of these with the injectors and run the controller and you're cheaper, and you have two access points, which is way better than any central router that can only cover an area.
So, I'm really hopeful about this stuff. I had recommendations. They said, "Yeah, I used to live across the street from a park, and I couldn't get Wi-Fi, so I would just dial into home and crank up the wattage, and then I had good Wi-Fi across the street."
I'm going to point a dish in the attic down in the woods we've cleared out ... see if we can get good Wi-Fi about 400 feet away. I'm hopeful. I'm waiting for a controller key to arrive.
I need to run a cable. Those RVU TVs I told you about require a cable connection, unfortunately. They won't run off the wireless even though it's like 802.11 AC.
Chris: I know you have some new hardware as well. You just built a new PC and you were mentioning the 1080s.
I keep the PC build on the website, by the way. There's a desktop built link up top on nickcraver.com. People are curious about what's in there. I need to do a blogpost on the office. A lot of people ask what gear I have, because I do a lot of research when I order this stuff.
For example, this treadmill is really good, but I wish the connection on the back wasn't IDB 15. So is the Z-5500. So extending is a pain because it's not the same wiring as a monitor HDH cable. The treadmill is good, but you can't hold the up/down buttons; you have to press it for every .10 miles/hour which is annoying. I figured out how I can plug it into Bluetooth and control it on my desktop. I now click to start my treadmill.
If I click here at 2.5 miles/hour, you hear the beep starting and then it will crank up. I can fully automate it. I'm putting an API on this guy.
I'm looking forward to this kind of stuff ... trying to automate this house. I'll tweet about it. Do you think people care?
Chris: I like it. It's interesting. I just finished my own build a couple of weeks ago. I didn't end up getting the 1080, I got a 1070 instead, which is still a beast. I love it! I've been running some games I don't usually play just to do some benchmarking. Also, I got an M.2 Samsung SSD, and that thing is lightening fast.
I might actually do a blogpost as well about it.
There's a tweet stream somewhere that with all the benchmarks and stuff if that helps you as a point of reference for the stuff I'm getting.
Chris: I'm curious to see how they benchmark, actually.
I have one that seems to be crossing into one lane territory. I'm not really sure why. I'm not using the SSD very much right now. I want to order the 1TB M.2 that's coming out. The thing is that the new m.2s,the SM961s are the new beasts. They are 1TB versions and they are 3.2GB per second read and 1.8 write. They're insane.
Here's the thing. They're PCI30 x 4. The theoretical of that on the GB transactions per seconds equals out to about 3.9GB per second, max. How many MDS 2.0 SSDs are there? A couple of months ago, there were like 10 out and we've already maxed out another connector.
People are mocking the PCI 4, saying we don't need the wattage ... turns out the wattage in the card was misreported by everybody. Somebody double checked me on that the other day and said, "I don't think that's actually coming out." It turns out the PCI working group on August 24th came out and clarified. They said, "Here's the SM961 link." You can get them from Australia. They are OEMs.
Chris: Here I see 700 AUDs.
The price is not that expensive. There is 3.2GB per second read and 1.8GB per second write. I have a connector on it that only maxes out at 3.9.
A PC at 4.0 doubles it, so now you can put an M2 that's got room to grow. The connector isn't necessarily being outgrown because the BUS on it is going up; whereas with SATA, you're locked in, plus you have that long command queue crap.
Chris: The problem I had was when I put it in the top slot on the motherboard and it shut down a bunch of the SATA ports, so I didn't have enough to run RAID for my other SSDs, so I had to move that down. I should have read the manual a little bit closer. I don't know why but I thought it would be the inverse.
You'd have to. That's the reason I built with the E-Series processors (whichever series you want to go with.) You should have Skylake-E eventually coming out.
I do a lot of research on this kind of product before I buy, but even on mine, I didn't understand exactly how it would disable the bottom PCI port on the second card.
I don't think it's documented that well, honestly.
Chris: The compatibility with the CPUs and motherboards is really weird. As you said, they don't have a whole lot of information about it. It ended up working out. I'm pretty satisfied with the build. It's a very nice build.
Chris: I guess we can wrap it up here. Thanks for coming on the show, Nick.
Contact: Twitter is Nick_Craver
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)