Handling trillions of events daily and conquering scaling issues with Keen CTO
Interviewed by Christophe Limpalair on 10/31/2015
How would you build a service that processes trillions of events every single day? What if those events were critical to other businesses' operations? It's intense, and that's exactly why I asked the CTO of Keen IO to tell us how they do it. We also talk about handling critical scaling issues and making sure they never happen again (or even in the first place).
An effective way to avoid terrible scaling issues is to monitor and track errors. My sponsor, Rollbar, is offering free accounts to audience members. Start tracking errors in real-time within 5 minutes.
Links and Resources
Can you tell us a little bit about yourself?01:41
Daniel, or Dan, has been working on Keen for about 4 years now. Before that, he was at Salesforce where he spent 5 years there as an engineer on their developer platform team.
What is Keen.io, and how does it work?02:29
Keen makes it easy for any software developer to collect data from anything connected to the Internet. It could be a browser, mobile app, backend server, or a wearable device.
Once data is collected, it lives on their servers where they handle the hard problems of storing, processing, and securing data. Then, they make it really easy for the owners of that data to ask questions using Keen's API.
For a basic example, you can send data on product purchases and query that data to figure out how many sales of product X happened in the last day.
Once you have those results, you can easily share them through charts, dashboards, or other API integrations.
Keen has lots of different use cases but one of the main ones is for people who want to deliver analytics to their customers. Many of their customers rely on Keen's service for their business to work.
What's your stack?04:22
- Running on SoftLayer
- Nginx (Load balancer)
- Flask (Python, apps)
- Tornado (Python, for API)
- Kafka (Event queuing)
- Storm (Distributed computing)
- Cassandra (Data Store)
They are hosted entirely in SoftLayer today. That decision was made back in 2012 because they needed more IOPS than AWS could give them at the time.
They use Nginx as a load balancer for both the API and their website.
Webservers are Python built on top of Flask. The API is Python built on top of Tornado.
There are a couple of other services hiding underneath:
There is a data processing tier which is where most of the action is. Kafka is what handles their event queuing.
Then they have Storm, which handles the distributed computing for both writing data (persisting it to disk) and querying data.
Finally, they have Cassandra which is their persistence layer.
Dan came from a Java background and wanted to try something new. They could have used many different languages but Python has worked fine. There have been challenges with Tornado but a lot of good things have come from it as well.
You originally used MongoDB, but switched to Cassandra. Why?06:38
One of the co-founders built something similar to Keen at another startup using Python and MonogoDB, so that was another big reason they chose these. They also picked Mongo because it gave them very quick iteration speed. It was easy to get started, get moving, and make changes.
They knew from the beginning that it wasn't going to be what they needed to scale the company, but when they started the company the risk was not "does the tech scale?" rather, "can we deliver a product that people are going to be interested in?"
Once they saw a big business opportunity, they knew they needed to invest in scaling.
How many requests per second? How much data are you handling?07:40
Some of their stats aren't public, but Dan says they get thousands of requests a second. He says that the more important stat is how much data they process—which is trillions of events a day.
Events are rich JSON documents that potentially include hundreds of properties.
Did you get to a point where MongoDB couldn't handle your scale anymore?08:26
It caused some problems and it wasn't serving up customer data the way that they needed it to. It worked for the most part, but not for some of their customers. It was a call they made looking at the future.
What does Cassandra offer that MongoDB didn't?09:02
A distributed system from the core.
MongoDB has sharding but Cassandra was just built to be a distributed system. It organizes data by rows and columns, which maps well to analyzing specific properties within a JSON document. Cassandra gives them the ability to scale by adding more nodes.
Did you decide against something like MySQL because of the nature of your data?09:34
Most customers they talk to have something like MySQL. They've tried creating their events table, and if they achieve any reasonable amount of success, that event table breaks their database. It's too much data for a single SQL node.
How have you tweaked Cassandra to fit your needs?10:08
They've had to do a lot of work to get it to work just right for them. There are certain JVM tweaks you can do like heap size tuning and GC tuning.
There aren't any magic numbers that work here, it depends on your workload.
Daniel says that the most interesting thing they did with Cassandra was to figure out how to store JSON documents in a columnar way. They've had to tweak things at the data model layer like transforming groups of events into columns and then storing those columns in a more compressed format. This makes it easier for them to do bulk aggregations. The most important optimization has been the data model.
If you're interested in learning how they figured out the JSON challenge, check out slides by Josh Dzielak.
How much data can you cache?11:34
This is a big challenge.
Keen advertises the ability for customers to send large amounts of arbitrary data without predefining any schema. At base case, there's no caching.
That being said, there's a lot of caching they can do in practice.
Say somebody is building a dashboard. Those queries get run over and over again. So they do lots of different kinds of caching:
- Recency caching: You've queried that before, so lets keep it around for a little bit.
- Disk layer caching: Instead of pulling data from Cassandra, certain kinds of queries can be cached in memcached for certain periods of time.
For their Keen Pro product, they also have what they call query caching or dashboard caching. You can predefine queries that you want to happen quickly.
How are you using Kafka queues?14:59
Kafka is a core piece of their stack because it is the layer that all incoming data is persisted to. When an event is sent to Keen, they don't consider it safe until it is persisted to Kafka. Then, Storm listens to the Kafka queues and that, in turn, persists data to Cassandra.
What kind of processing do you use Storm for?15:33
All events from all customers go to the same queue. Storm is responsible for collating all these events and collecting them into the appropriate collections (the equivalent to tables in MySQL).
In addition, they make sure that they are keeping data secure and not giving access to anyone who shouldn't have access.
They also do a bunch of data processing related to what they call add-ons. These add-ons can include IP addresses, geo information, parsing user agent strings, and a few other options.
After that, they do storage optimizations like compression and serialization.
What is Storm and why Storm?16:31
Storm is a stream processing service. They went with it because it was the best they could find at the time. Nowadays there are other options including Spark which is becoming quite popular.
What do you use for data visualization?17:14
I told Dan I've used D3.js to visualize data, and asked him what they use to create dashboards and charts.
(If you're interested in learning a little bit about D3, I wrote a tutorial to make pie charts.)
D3 is a very powerful tool that has been used to create amazing documents. There's an example from the New York Times that shows a mapping of the decline of "Stop-and-Frisk", which uses thousands if not millions of data sets, but you can clearly see a difference within seconds of looking at the map. Could you imagine sifting through data on paper instead?
Sponsored: Get a free trial from Rollbar to track errors in your apps (any language, any platform)As you'll read in a short moment, monitoring and error tracking is crucial when it comes to scaling apps. Frankly, even if it's a small app it's still a necessity.
I use Rollbar to track my errors for ScaleYourCode and any other app I build. Why? Because it works extremely well, takes no time to set up, and can scale no matter how big of an app I have or how big of a team we have.
Get your free trial they have going for ScaleYourCode fans, and thank me later :).
What advice can you give other companies on building documentation?19:07
There are a few things that are important to keep in mind:
- Iteration speed: how quickly you can update the docs
- Getting the community involved: the more people can help, the better
- Docs should be elegant and easy to navigate
To involve the community, you could have your docs hosted on GitHub. Alternatively, you could open the docs up for comments and get feedback that way.
Another good point Dan made was that docs should be crawlable by Google and easy to search. This can be tricky with single page docs and should be kept in mind.
You've open sourced a lot of your tools. Was that for the same reason?20:57
Yes. It's to build a community together. Dan says he thinks they have one of the largest open source analytics community out there. It helps them as a business but they also want to help the community out—they wouldn't exist without it.
How do you protect data, and back it up?22:01
From a security perspective they are hosted inside of SoftLayer which is now owned by IBM. Their networks are completely isolated so they are on private networks, and everything is behind a firewall. The configuration is automatically managed by Chef.
How do they store it? Like we talked about, everything eventually gets persisted to Cassandra. They use methods to authenticate against Cassandra so even if someone got access to the network, they'd have to have access to credentials to do anything with the data.
In terms of storage format, they are taking collections of up to 5,000 events at a time and breaking them out into their properties. Those 5,000 properties get stored in a single column.
What do you use to monitor?23:34
This is incredibly important. Dan said they have probably spent more time building monitoring tools than in some ways the core product.
You have to know if everything is working properly. Keen is a critical piece of other companies' infrastructure. They have to know when things are going wrong before their customers, or even before things will go wrong.
They actually use Keen to monitor Keen, at least in part. Of course that doesn't help much if the system is down, so they also use other systems. The biggest one they rely on is DataDog. They use it as a central aggregation point for metrics. Metrics from frontend load balancer stats to Cassandra and JVM. Alerting happens through PagerDuty.
They did have to build custom tooling to monitor Storm. Since a lot of action goes on between Kafka, Storm, and Cassandra, they want to make sure everything is running properly. ie: No jobs restarting or getting dropped, persistence isn't failing, etc...
You had your first failure in 2013. What happened, and how do you handle failures when your service is critical to businesses?28:11
In 2013, when they were still using MongoDB, they ran into an issue where a customer sent a query that should have been killed by their rate limiting, but it wasn't. The result?
Frozen app servers.
Unfortunately, their load balancer was configured to send the requests off to another app server if it didn't receive a response within a certain amount of time. Nginx ended up sending the bad query to every single one of their app servers, grinding the entire service down to a halt. Ouch.
How did they fix it? They ended up doing a few things, but one of the first changes was to switch Nginx from passing requests to another server on error only instead of timeout. If you want more info, read the report here.
Keen monitors 24/7 to be able to detect these failures and prevent them if possible, but sometimes things just go wrong. Systems fail in really unexpected ways, and it can be hard to know how to detect these failures. The way they handle these situations is to get the entire engineering team on board, figure out a protocol, and just get to work.
There was another failure early this year. What happened?30:27
This was a failure in their Kafka tier.
This time, not all services were down. Queries were still working, but since Kafka was down, they were rejecting new events.
The outage was caused by a maintenance event that should have been routine. Kafka started behaving in nasty ways. To fix it, they had to shut the system down, reconfigure it in some critical ways, and bring it back up.
How do you prevent these things from happening again?32:01
A lot of it has to do with monitoring more things and getting more visibility in your systems. You want to detect failures as soon as you can, or even before they are about to happen if possible.
Once these failures happen, take a good look at what caused them and implement new protocols & procedures to make sure they never happen again.
Has your experience working at Salesforce helped you in these intense situations?32:43
Certainly. It's helped Dan a lot, especially in the code design and algorithms phase.
At Salesforce, Ops and Engineering are separate, so he wasn't doing much of the live infrastructure work.
How do you protect neighbors from noise?33:24
Imagine a customer runs a time-intensive and resource-intensive query. How can Keen ensure that this doesn't affect performance for other customers?
This is a tricky one, says Dan. Every multi-tenant SaaS provider has this challenge. From AWS all the way down to smaller SaaS companies.
It all comes down to metering and rate limiting.
Keen makes the best attempt they can at figuring out how much of a 'slice' you can have. Unfortunately, it's not as easy as it sounds.
They have pretty beefy systems and they want them to be used. So, while people should not use more than their fair share, if there's abundant capacity, they'd love for people to be able to use it.
It's a matter of constantly looking at algorithms and signals to see how they could be improved.
How do you set up rate limiting?31:03
From Dan's perspective, rate limiting is one of those things that really hasn't been solved in a generalized way. Dan mentions that AWS has an API Gateway which is going in that direction, but he still feels like there is opportunity to improve.
So they decided to build the system themselves. The first version was built of off MongoDB and atomic counters. It has since then evolved from there and happens in multiple places.
They have a service oriented architecture, with a service that handles their queries which has a separate rate limiter than the one in the Python servers, for example.
Why did you choose SoftLayer, by the way?36:13
When they started the company, there were two main reasons they chose SoftLayer:
- IOPS: They needed to have their database provisioned in a way that would give them a lot of IOPS. Back then, AWS didn't have SSD instances, and the provisioned IOPS weren't quite where they needed to be yet.
- Network: They wanted to be across multiple data centers for data security and redundancy.
SoftLayer gives you network transport across data centers on a private network for free. That was a very big advantage over AWS.
Dan adds that he thinks AWS has pretty much caught up these days. There are still questions around data transfers between availability zones or between regions. At least when it comes to price.
"If I were to start the company again today, I'd probably start with AWS."
As we move towards a world with wearables, how will data analytics change?37:59
"It's definitely going to change things."
Probably the most impactful thing is going to be the scale. There are going to be so many more things connected and generating a lot more data. Having tools to deal with that kind of scale is going to be really important.
Beyond that, you're going to be able to do new kinds of analysis.
For example, you could start having data on what people are doing. Wearables give you this granular detail on what someone is looking at, or what's going on with their body—heart rate monitoring, etc...
Dan does say that it opens the doors to some scary things and we have to be careful with that. They try to be aware of that and make sure people aren't sending them "nasty" things. But on the flip side, you can really do a lot of interesting and helpful things.
If you've ever heard people talk about the virtual reality experience with products like Oculus Rift, it opens the world up to some crazy opportunities. What about when we have technology embedded in our bodies? Brains? Think of the drove of information you could collect, and the kinds of analysis you could run. Crazy, and a bit frightening.
You went to Washington D.C. to offer your perspective on Net Neutrality. Can you tell us about your experience?41:01
The building they were working in was a shared building at the time, and the group that sent him out were looking for startups that had opinions on Net Neutrality. Dan flew out to D.C. and met with a bunch of lawyers and a bunch of other startups who were passionate about it.
They went to the SEC, congress, and talked to a lot of different people. It was interesting to see what political insiders were saying and what the main issues being debated were.
Even though there are a lot of powerful interests out there fighting against it, Dan says it was nice to see that a lot of people really understood the core issues and the perspectives that a lot of proponents of Net Neutrality have.
Reach out to Dan or KeenEmail him: dan at keen.io
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)