Building a data analytics engine on AWS, the Simple way
Interviewed by Christophe Limpalair on 05/04/2016
Whether you are a smaller startup or a well-established company, data analytics can give you insights that drive engineering and business decisions. How can you build an engine for this, especially if you have limited resources? Once you do build the engine, how can you make sense of the data? These are two critical questions that Jeff Klukas answers for us in this episode.
See how Simple built an engine on AWS with Redshift and learn from it. Also learn performance tricks and how to efficiently query data from Redshift. How does it all work? Let's find out.
Scale and migrate large workloads after a free month test run on DigitalOcean, and get free Premier Support. Already work on AWS and want to learn it inside and out? Get trained at your own convenience for $9 the first month.
Links and Resources
You were a research assistant for the University of Wisconsin in Madison. Can you tell us a little about what you did there?
I ended up in grad school. I was intending to teach physics at the end of graduate level as a profession, but I got to grad school and ended up working in experimental particle physics on the Large Hadron Collider which shows up in the news every once in a while.
The LHC is the largest particle accelerator in the world. It finally found the Higgs boson, which we have been searching for for forty years. There was a Nobel prize for the theorists who originally came up with the idea. It was an inspiring project to work on because of the scale and being at a frontier of physics.
What that research life was like was doing a lot of computing. Particle physics has been pushing the boundaries of what we can do with distributed computing for decades.
I would often write some C++ code and test it out locally on some data and then send it to a global grid. Some jobs would run overnight. Some would run on our local cluster in Wisconsin, some would run in Taiwan, some would run at CERN in Switzerland where the actual experiment was. You'd come back and analyze the results the next day.
It was awesome to be part of an infrastructure that was that big. I ended up falling in love with programming as part of that. That's how I ended up getting into the kind of position I'm in today where I'm doing software development.
As I was doing research, I found that I was doing a lot of automating work flows for people and seeing analysis tasks that we did over and over again. I became very interested in how to structure this and build tooling so the people I'm working with can be as efficient as they can be. That has been a theme that has kept on going throughout my career.
Now you're at Simple, which is a banking platform. Did you take that experience of automation and finding relationships with data and carried it to Simple?
There are some direct connections in terms of technology. For example, I work with a lot of distributed systems now. A lot of it is also the mindset you get from doing a lot of analysis work in a field like physics. You get results that just don't make sense sometimes, and you need to make sense of it.
You have to go back and trace the pipeline to find out what we did and if it is something that is actually interesting or if it's just some kind of artifact of the way we've gone about doing the analysis.
There are a lot of parallels between that process and what you do to find bugs in code, as well as how to restructure it and make it perform well. The kind of questions to ask and the mindset you develop by doing that sort of analysis has been very useful when thinking about technology.
Dealing with data problems at Simple requires doing the data analysis itself and building infrastructure to make that efficient.
Simple is a banking platform. Can you tell us what it offers?
Simple works in the realm of consumer banking. We offer what you might expect from a bank like a full fledged checking account.
But we are very much a technology company, and what makes us different is that we're trying to build banking for the way people think rather than the way banks work.
We have a checking account, but we also have apps that have saving and budgeting tools built right in and very tightly integrated into your account. That really changes the relationship that people have with their money.
One of the best examples of that is that when you first sign up for simple and log into the app, the number that is presented to you front and center isn't your traditional bank account balance. It's what we call "safe to spend." It is the account balance, but it also takes into account what pending transactions and bill payments are coming up that you have scheduled as well as what goals you have set for yourself.
In the app, we have a system where you can set a goal and it will slowly take money out of your "safe to spend" and sock it away somewhere so you will have spending habits that are consistent with where you want to be and that number you see is "what you can spend today" without causing impact to goals or bills that need to be paid that are coming up.
Is that the kind of data that you're working with where you're crunching those numbers and trying to figure out what the "safe to spend" amount is and fraudulent activity and things like that, or are you working on a different set of data?
Our team sometimes gets involved with those product questions and helping to understand when we have a data problem like that, but that's not our main focus. Our team is really focused on analytics: getting a wider view of what's going on with the business rather than driving the product and writing the code that makes the app work and that gives you someone to talk to when you're on your phone.
So when I talk about analytics, a lot of what we've been focused on is creating a data warehouse which, by the way, we use Amazon Redshift for. It's about getting data from all the parts of the business because we have a lot of back-end databases that are powering all that user activity.
We want to integrate all that information into one database, and also have things like events from our web apps, from our marketing site, and other parts. Having all that in one place so we can do analyses that are across that historical time series without having to hit all those production databases, is important.
You've built a funnel that receives data from different source and then you sort it out and use Redshift to make sense of that data. Can you walk us through how that's set up? How does Redshift receive data? Where does the data come from, etc?
The data architecture that we've set up is largely mirroring what already existed in our back-end architecture. When I say back-end architecture, I'm talking about all the services that power the banking side of what we do and power the apps. They're what you're interacting with when you're using one of our web or mobile apps.
That was built as micro services that are mostly written in Scala. We adopted those same patterns so we've ended up building services to pull in data and then to load that data into Redshift.
The first thing we built was a service to do that loading of data into Redshift. So it reads information in on a message queue, in our case RabbitMQ. It batches that data up, puts it into S3 which is really the way you get information into Redshift.
That's how Amazon has set up the most efficient pipeline. So it batches up to S3 and then it has logic to actually do the copy commands to Redshift to take that data from S3 and get it into Redshift itself.
The rest of our data architecture is about funneling data into that loader service. So we have a variety of different services. One reads from various back-ends. RabbitMQ exchanges and transforms those messages, puts them onto the queue that our loader service eventually reads.
We have one that does queries against back-end databases to pull in new records and again transforms those and puts them on a message queue for our loader to take in.
Then we have one that accepts client analytics, so all those events of users clicking on our marketing site or on our web app or in the mobile apps end up hitting an HTTP service that then sends things off on a message queue to be loaded into Redshift.
(Image from Simple.com's engineering blog)
You have both internal and external facing APIs that receive data from different services you're using (like Postgres), as well as from other clients. After that, it goes to the loader which then batches it up and goes to Amazon S3 which then loads it into Redshift. Is that right?
Did you build the loader itself? Is it custom built?
The loader is custom built and this is one of the heartbreaking things about working in a world where technology is constantly changing. If we were building the same architecture from scratch today, Amazon offers a lot prebuilt that didn't exist a year and a half ago when we were first building this.
Amazon now has Kinesis Firehose. You put messages into Kinesis and it handles all the logic of batching and loading into Redshift for you and that's kind of behind the scenes using Lambda and Kinesis, their message queue system.
We're actually working to leverage some of what Amazon provides now so that we have less infrastructure to maintain ourselves. We're hoping to eventually get to a point where we'll use the Lambda loader part that Amazon offers so that once things are batched into S3, Amazon triggers the Lambda function that can copy the data into Redshift for you.
That will mean a whole set of service that we can take out of our infrastructure and leverage something Amazon manages for us.
They also offer SQS as a queueing service. Did you end up going with RabbitMQ instead because SQS wasn't there at the time or is there a specific reason that you chose RabbitMQ because it offers something that you can't find in other queueing services?
Yes, that's a really interesting question and that's something where again history becomes important. RabbitMQ has been a part of our back-end infrastructure fo many years. A decision was made at he time that it was the right tool for the job. I don't know that we've done a real in depth analysis as to whether it would make sense for us to move all of that to SQS or rather to Kafka.
Actually, what we're doing today is we have Kafka deployed and our back-end services are using Kafka as we're building new infrastructure. It looks like we're headed to a future where Kafka is going to be the backbone both for our back-end's banking production infrastructure and for our data infrastructure.
There are some nice things that Kafka provides. It has durability that RabbitMQ doesn't have. Like, you don't have to be directly subscribed to something. When the events happen, Kafka will keep those around for some number of days or weeks if you have it configured that way.
The durability is nice and the scalability and throughput that Kafka is able to handle is also very nice. It provides a lot of flexibility. Kafka can provide for a lot of use cases, whereas RabbitMq and SQS are a bit more focused in the guarantees that they provide in the places they are appropriate.
We're ending up using SQS a little bit in our infrastructure now, in that we use SQS to talk to some third party partners where they're able to put new information into an SQS queue. So SQS is an important part of our infrastructure now too. It's interesting how we now have three message queues that we're all dealing with in our production infrastructure.
Why Redshift instead of just leveraging PostgreSQL which you already have, or something like MySQL which a lot of people are really familiar with. What does Redshift have to offer that you can't find in those other engines?
At the scale of data which we have today, it's interesting to see things like PostgreSQL and MySQL evolve. PostgreSQL, in some of it's recent releases, has been adding some more analytical functionalities, some functions specifically designed to support that.
With our data scale, our warehouse is about a terabyte in size. We probably could get away with Postgres right now, but our data volume is going to be increasing a lot as we have customer growth and as we are building more pipelines to pull in more data.
I think we will be outstripping what Postgres will be able to do. Something Redshift can do that Postgres can't is that Redshift is an analytical database. It is meant to support being able to load data in and then do analytical queries where you're looking at all the records in a table rather than doing something transactional where you're putting in a record at a time, reading out a record at a time, and looking at information for one particular user.
With something like Redshift, you're generally doing a query where you're looking over all users to generate some aggregate like looking at "What is the average transaction frequency for customers?"
Architecturally, what makes Redshift different from something like Postgres is, first of all, that it is a columnar data store rather than storing by rows and it's also distributed.
The columnar data store like Postgres, every time you insert a record, it stores all the various fields for that row together, so it's easy to go in and pull out a particular row. All that information comes out together.
With Redshift, you only care about one column in a table that has hundreds of columns. With Redshift, you tend to load data in big batches, and it's able to take the data type that's in that particular one column and compress that information together so it's able to store it more efficiently and retrieve it more efficiently for these large queries where you need to get large blocks of data.
Often, there are a bunch of columns that you don't care about, so Redshift can avoid reading all that data for those columns that aren't included in your query. So, having that columnar focus optimizes it for analytical queries.
The distributed nature means that you can scale a Redshift cluster up to be much larger than what you could get from a single Postgres database.
As our data load increases, it's not going to be a problem to move up to ten terabytes and beyond.
Redshift is still SQL based so you can still come from a background of MySQL or Postgres and still have the ability to quickly and accurately pull data. Are there any cons to that or is it all beneficial?
It is beneficial in that it's an interface that's familiar to a lot of people and particularly our analysts on our data team. They spend most of their day writing SQL queries and using that as the basis of what they do.
It's nice to have people come in with SQL knowledge and be able to apply that directly to Redshift, but it's also ind of this false friend in some ways. If you are used to writing a lot of complex joins between a lot of tables in something like Postgres, it can be very hard to get that to perform on Redshift unless you're understanding how your data is distributed to the various nodes of your cluster.
If you write naive joins that don't take into account what keys we're using to distribute information to various places, Redshift ends up having to broadcast entire tables across to different nodes of the cluster.
You can end up with queries that have no hope of ever actually completing because there's so much information that needs to move around.
Can you talk a little bit more about the distributive nature of Redshift and how it is able to distribute the data and how you know how it's distributing it so you can query more efficiently?
When you build the Redshift schema you will notice that it looks very much like creating a table in Postgres. Redshift was actually forked from Postgres 2, I think, so it's very similar to Postgres in that sense, but you'll notice that it has two key concepts built into the table schema that don't exist in Postgres.
You define sort keys and distribution keys for all of your tables. When you create a table, you pick some column that's going to be the distribution key. In our case, we often use the user id as that distribution key. and what that means is that as a batch of records comes in and you're loading that into Redshift, you're going to look at what's the volume of that column for every record that's coming in and it will choose which node it's going to use based on some hash of that value.
With a specific user id, if that's what you've chosen for your distribution key, records associated with that id are always going to go to the same node within the cluster.
Once you have that, if you have two tables that are distributed on user id, that means user ids are already co-located and it's easy to do that join on user id for Redshift.
Whereas if you want to join on any other key of the table, it's very unlikely that those are going to be co-located and Redshift is going to have to do a lot of piping data around to get that to happen.
When your table structure matches the queries that you need to do, Redshift can be amazingly efficient, and we've had some great performance out of it, but when you need to do a query that doesn't match the structure of your cables, it can be difficult (if not impossible) to make that happen.
Is it really trial and error when you're first getting started with Redshift if you don't have a lot of experience, or is it really understanding how that distributive nature works and then customizing the queries around that and then just trying and seeing which queries take a while and which queries need to be changed? Is that how you approached it?
In our case there certainly has been a lot of trial and error. On a certain level, the documentation tells you certain things about what you need to pay attention to. When you are new, it can be hard to tell which of those are fundamentally important or not.
There's been some trial and error to convince ourselves that, "Yes, these distribution keys are pretty important, and you really need to pay attention to that." But I also think that going in we didn't know what queries were going to be important. It wasn't obvious going in what made sense to have for the distribution keys of various tables.
As we did analysis and learned what things were important, it helped us figure out what we should have done for the distribution keys. We ended up changing the structure and migrating to a new structure for a lot of tables.
We've been able to improve performance that way. We also have an ETL(extract, transform, load) solution that we do within Redshift. Nightly we have a lot of SQL jobs where we create new tables based on the kind of source tables that have data flowing into them.
In some cases we're taking the exact structure of a table and changing it to have a different distribution key so it's optimized for some other join that we need to make with another table. We've gotten some mileage and flexibility by duplicating some of the data by distributing or sorting it in a way that's going to make it efficient for analysis.
In that ETL process, it's been nice in that we can change those definitions very easily, and recreate them every night. We're not locked into that structure that we're locked into with the source tables that we're loading data into.
Redshift is actually an Amazon managed service, so are you able to easily plug it into CloudWatch and see some of those performance metrics or bottlenecks that you can quickly try to change or is that using other tools?
CloudWatch does provide a suite of metrics for Redshift. A lot of them are related to the machine level metrics of: what's the network throughput, disc usage, etc.
We do a lot of metrics visualization through Grafana. We have Graphite and StatsD in our architecture, and our services are all sending metrics that we visualize in Grafana by sending them through Graphite or StatsD.
At the time we started using Redshift, there wasn't a good way for Grafana to visualize CloudWatch metrics, so we ended up building a service to specifically pull in those CloudWatch metrics and republish them to Graphite so we could see them in Grafana.
There's a lot of stuff about performance that's important, but that's not published as CloudWatch metrics. There's a lot of information in Redshift's system tables that's important to understand: query queues, are things get backed up? Is there contention on commits ... that sort of thing .
Our service that we have that does this monitoring talks to CloudWatch and it does queries directly to Redshift to pull information out of those tables and publish it to Redshift. So we have a dashboard in Grafana that's able to look at all that information and give us a nice snapshot of how things are performing now and how that compares with previous hours or days.
I assume you can see those system tables in the Amazon Dashboard as well or you can use the API to pull that information and feed it into Grafana or whatever else you're using like that, right?
Within the console you can see those CloudWatch matrix for free. They have visualization for those. The information that's in the tables isn't visualized in the consoles. You need to do queries to Redshift in order to see that information, but Amazon also provides tools in their AWS labs GitHub accounts.
They have a whole suite of tools that give a lot of important administrative queries and provide the framework for you to be able to run those on a regular basis. If you want to have that regularly watched and convert it into metrics, there's some work you need to do on your own.
Amazon has been building up that suite of tools so that you know what queries are interesting and to get important information about performance.
Are you using the same monitoring tools to monitor things like RabbitMQ and Postgres and all those other services in your infrastructure?
Yes, we have all those other services hooked up to Graphite or StatsD. All of that's going to the same place and we have dashboards to monitor all that.
This is another case where tooling is continually evolving. Grafana now offers direct CloudWatch access, so that's a whole suite of stuff we can get directly into Grafana without having to do this roundabout thing of republishing to Graphite.
Things would certainly look differently if we were building it today, but what we have has been pretty reliable.
You also built a tool in Scala that collects and publishes Redshift performance metrics. Can you tell us a little bit more about that?
As I was discussing in our previous question about our whole ecosystem of Grafana and what exists, a lot of our infrastructure has been, "Let's build a service to do a particular thing." It's what our operations team is familiar with.
In particular, we do a lot of Scala so we have a really well defined workflow for how to create a workflow in Scala. We use the Dropwizard framework. In our case for data, very few of our services actually provide an HTTP connection. So we're using Dropwizard which is this web framework, and we're kind of disabling the HTTP part of it and using it as a nice framework for scheduling tasks and defining, "What do you want to do?".
That's the shape this took because we're familiar with it operationally. It was the thing that we know how to build. So we built a Scala Dropwizard service and it's pretty simple. It's organized into a couple of collectors for the different data sources.
One is CloudWatch, one is queries to Redshift system tables, and it just runs each of those periodically, a lot of them once a minute. Every minute it pulls down the new values to the metrics and publishes those to Graphite. Every minute it runs these queries, talks to Redshift and gets information back about things like: what does the commit key look like, how many user queries are sitting waiting to be executed. All those sorts of stats.
We have a dashboard and when somebody notices the performance on Redshift is slow, they'll be on our Slack channel and someone will say, "Hey, Redshift seems weird." Then the first thing we do is go right to that Grafana dashboard. A lot of times it will be obvious.
It'll be like, "Yeah, there are a bunch of user queries sitting around. Why is that?" You'll see there is this huge spike in disk usage which is a clear indication that someone has written a query that's spilling to disk and it's probably never going to complete and we end up killing that query in order to free up that disk space and get other people onto the queue.
Why Scala? The reason I'm asking this is because with Java 8 we've seen companies like LinkedIn where last I checked, they had moved to Scala and again, moved back to Java, I guess where they didn't see those benefits in performance or even finding the talent was difficult. Why are you using Scala?
It's going to be interesting to see how that evolves. We're pretty dedicated to the JVM (Java Virtual Machine). We have a lot of people with experience both in Java and Scala. We have people with experience doing performance tuning on the JVM. So I think it's unlikely that we'll move away from that suite of languages in the JVM as a base, at least any time soon.
But it is interesting to see the benefits that Java 8 has, and it does provide many of the things that made Scala particularly interesting compared to Java. Some of our back-end services are starting to get written in Java 8. We're doing some more migrating of Scala libraries. We're trying to factor that out to be pure Java so it can be used in either Java projects or Scala projects.
I'm not sure exactly where that's going to go. There are some folks that think we should do more Java 8 and simplify based on that, but even with the new features in Java 8 ... I'm working on a Java 8 project right now and it's a great learning experience to understand the things about Scala that are really nice like case classes. You can very easily create this immutable object that's thread safe.
You can pass it around, create copies of it with modified fields without affecting the old version of it. It turns out to be quite involved and error prone to try to match that same sort of pattern in Java. So we have a situation where you can choose between Scala that has a lot of this stuff built in or Java 8 where you might end up pulling in a bunch of third party libraries that are opinionated about how to do things. You woudn't have to use many of these libraries if you were using Scala.
If you have a Java developer, a lot of times they'll try to write Scala in a Java fashion in which you actually lose a lot of the benefits from Scala. You'll also have a few lines of Scala, but to do the exact same thing in Java you would need a few paragraphs. A lot of times you can really compact things up which is also a double edged sword because, in my case since I'm not that familiar with it, I could stare at the code for a few hours and still not really understand exactly what it's trying to do, so I guess it does have a steeper learning curve. How do you feel about that?
I had not had experience with the JVM before coming to Simple. I came into a culture that was writing all Scala rather than straight Java, so Scala was my introduction to the Java ecosystem. Now, we're starting to write more pure Java. That has been interesting to me. There are a lot of things that I take for granted in Scala. It seems crazy that you have to write all this boilerplate every time when you're writing pure Java.
We have some people in the company that come from a hardcore Java backgrounds and end up writing Scala that doesn't take advantage of a lot of the benefits that Scala provides. There are things that each are uniquely suited to.
"I'm expecting that we're going to end up continuing to use both of them in tandem. There will be cases where Scala makes a lot of sense and cases where Java 8 makes a lot of sense."
We've been talking a lot about the back-end of this data analytics engine. At one point in time you have to visualize this data, right? So maybe an engineer, an executive or a client wants to pull data out of it and make sense of it. How do you do that?
That's the whole point of having this infrastructure. At least for us originally, it was the executive leadership team wanting to be able to make clear quantitative statements about, "What is the business doing? Where are we headed? How are customers interacting with our product? What's their relationship with Simple?" and getting quantitative answers to that.
Some of that is Redshift offers a SQL interface. Our CEO, Josh, has a background as a programmer. He's done data stuff in previous careers. There are times that we've just given him access to Redshift. He dives into it and does his own analysis to kind of explore what data is there, which is a cool thing to see.
It's nice to have key people who have a very analytical mindset be able to do some digging on their own and to answer their own questions. To make the data more widely available, we do a bunch of visualization.
You want to talk about our data team. I'm a data engineer. I focus on infrastructure stuff like building out Redshift and the pipelines. But half our team is actually analysts that are focused on analyzing data, understanding the schemas, and answering business questions.
Basically the rest of our company is clients for the data analysts. So they spend a lot of time pairing with individual teams figuring out what kind of questions they're trying to ask, helping them refine what those questions are and how they relate to the information we actually have. Then they do visualization of that data.
Sometimes they're doing it on a case by case basis making presentations to groups, but probably the biggest form of visualization we have now is we use Periscope Data to create web dashboards.
We have a whole page that is various panels that are each a SQL query that's talking to Redshift and pulling information back and often visualizing that as a time series.
One thing you might take a look at is something we try to optimize for. When a customer first signs up, one of the frustrations of being a bank and providing a checking account is that we need to get the customer their debit card in order for them to really start getting the benefit from Simple and start to use the product.
That takes a little while to create the card, ship it off, and get it to them. We want to understand what's going on with that very closely. Our service that ends up talking to our card partners and ordering cards is sending events that end up being sucked into Redshift.
So we have tables that tell us (for a particular user) when their card was ordered, created, shipped and activated. Then we can keep track of who hasn't activated, how long it takes people to activate cards, whether it's trending upward or downward, and if there are things we need to talk with partners about. That's something our marketing team takes very seriously. That's something we can present in a dashboard to show in aggregate how things are going for customers in terms of that initial experience.
We want to know when they're getting their cards, if they are activating them, and if there are things we can do to get them more involved and make them more likely to activate their card when they get it.
Chris: So you can actually see sometimes when you lose the customer. They've opened the account, received the card, and maybe they never activate it. So you can think, maybe they should have had the card sooner, how long did shipment take and how to minimize that time. I can see that being super beneficial.
We definitely keep track of that funnel and the milestones for a new customer before they become active and begin really using their account.
It's a cool thing to see where data can really power a lot of what our marketing team does. We can optimize that experience and make it more pleasant, keeping customers more engaged and informed about what's going on.
We've also set up a whole email pipeline based on that data where we can let customers know when their card should arrive. We can show them what Simple offers in the meantime, and we can send them reminders if they haven't activated their card. It gives us a chance to build a relationship with that new customer.
How do you actually take the data from Redshift and plug it in to Periscope data?
I think one of the things that makes Periscope Data great is that it is a really simple idea. Its dashboard is in SQL, so there's really nothing we need to do other than signing up with Periscope and inputting some credentials for them to be able to connect to our database.
Beyond that it is doing direct queries to Redshift. It's actually a little more complicated than that behind the scenes:
Part of Periscope's magic is that they're running their own Redshift Clusters.
They actually pull data in in batches. They are kind of managing all that so they can provide answers more quickly rather than having to hit our database every time they do some caching and some smart things. But from our perspective, we don't have to know about any of that happening. You provide queries as if you were hitting Redshift directly. They handle whatever translation they need to do to get that to hit their cache.
Chris: So as a customer, all you have to do is write that SQL query and somehow the data just appears. It comes from your information, but behind the scenes they're doing a lot more magic. That's interesting.
When you build realtime data analytics, how can you make sure that when an executive, engineer, customer, marketing agent or whoever looks at realtime data, that the data isn't skewed or doesn't have noise to it that changes the way that we would maybe think of the data itself which would lead us to make a bad decision. How can you really do that and categorize the data in that way?
For the question you asked, there is an infrastructure piece of it. How do we know that the data is reliable and that what we're presenting is indeed actually what we're presenting. That's something we try to handle with alerting. We have these queries that run against Redshift.
Another piece that I didn't mention before about our monitoring service is that we do counts of the rows and various tables. We keep track of how that's changing.
If there is a table that we generally expect, there should be new records coming in every hour, and six hours go by and there's been no update in the number of records, we get alerts that show up in our Slack channel like, "Hey, there might be something going on with that."
We try to keep on top of some of the data issues that way and be able to intervene before someone is doing analysis and getting analysis that looks weird or looking at a dashboard and seeing there has been some fall off. We try to be proactive about that.
We're certainly not perfect at it and there are times when users have to come to us and said, "Hey,does this seem right to you?" and we end up having to figure out some pipeline issue.
The bigger problem that our analysts see is that in any analysis, there's a lot of context that's needed. You can present some graph that shows lines and those lines can be entirely true and infrastructure can be working perfectly, but they're true only if you understand in the sense of the context of where those numbers come from and then you have the right mental model to interpret what those mean for making decisions about what we're going to do or where does it look like we're headed.
That's the fundamental problem of being a data analyst. Really what you want to do is tell stories with data. You want to help people reach good conclusions from these quantities we're able to pull out of our infrastructure.
That can be hard. Something our analysts talk a lot about is averages can be be really misleading. A lot of what we do is aggregating to see what the average customer's experience is with some given part of the application.
What does that mean? The traditional concept of the mean where you kind of add up what it is for everybody and then divide by the total number of people and you get some number out of that.
Often, that's skewed from what the mean is. If you take that list of people and you take the one that's right in the middle, that might be a very different number from what the mean is. It can be a more telling number of what the average customer's experience is.
You might be interested in percentiles of the 99th percentile. If things are bad for 1% of customers that can be a really bad thing too. We need to provide context to really understand what we're looking at. "What does that really mean, what does it represent, why do we think that's really important or not?"
Especially when you're doing dashboards, it can be difficult to make sure that the people that are looking at that have all the context they need. A lot of times it's a problem of defining what you mean by the terms you are using.
So if you have some title on a panel that says "Looking at active customers." What is an active customer? Do we have a definition for that? Yes, we do have a definition. Sometimes there is a nuance of that definition depending on if you're marketing or if you're our risk team looking at fraud. Who do you want to consider to be an active customer?
If you're the finance department, you really care about those fraudulent customers because they end up costing us a lot of money. So we want to include that activity in what we present to our finance team. We might not want to include that in information we present to our marketing team. They're interested in what the experience is like for real customers that are signing up and will be engaged with the product.
Understanding those different needs of different teams has meant that we end up creating dashboards that are tailored to a specific audience. We have a better sense of the context going into it and what is the information they actually want? We have to present those numbers in a way that make sense to the audience.
This is a really interesting topic and a loaded topic as well, and a lot of what you just said can be applied to systems themselves where if you are measuring performance of a database for example. If you're just looking at the averages or that kind of information, what if you miss the big spikes that would tell you a story that something is wrong, but because you're looking at the averages itself, you're not seeing the spikes and think everything is ok.
I feel like this can be applied to all kinds of things, and so the reason I was asking you this question as we've talked about in the past is, a lot of times you'll see these political or financial graphs that are attempting to represent something, and they look very compelling, right? They're telling you a very compelling story, but maybe it's just the way the data is actually presented. If you represent it in a different way, it would tell you a different story and it would compel you toward a different side of things.
What if you have a big decision to make from this kind of data and it is skewing the data toward a position. It's not neutral data, so to speak, and you make the wrong decision. I think we could talk about this for days and we probably need to bring on a data analyst to dig through this. Definitely a very interesting topic!
It's interesting particularly as you talk about politics. There are two different layers there. You get to a point where you want to support a particular opinion and there's a lot you can do to choose how you're going to analyze it to come up with a result you want to show. But even if you're intentions are pure and you don't have an agenda, it can still be really difficult to tell a story that gets an answer that's useful for making that prediction of what's going to happen or for making a decision about what needs to be done.
Chris: To make a long story short, be careful what you believe on social media. Be careful what you share back and retweet because even if it looks right, make sure you're looking at how they even came up with those graphs.
Simple's data analytics engine has had a few years to mature now, so it's a well established platform, and you've had a lot of engineers work on it. If you were to go to a startup that needs this kind of data analytics, how would you recommend they get started without wasting too much time or too many resources?
On one side, it's changing every day in terms of what is available. Things have gotten a lot easier in a year and a half that we've been doing this. Starting today there are different things available than there were a year and a half ago. I also want to emphasize that we had a team of four engineers working on this. We now have six engineers.
It took four of us a good six months to build an initial version of this to have useful information in Redshift that people were using and being able to come up with the answers they need.
That can be a huge investment for a startup. In a lot of cases, that's out of reach (building your own architecture). There are a lot of things out there trying to make things easier. Amazon itself is making things easier. It's a lot easier to use Kinesis Firehose today, for example, and to put messages onto a Kinesis stream.
Amazon handles a lot of details of getting it to Redshift for you. Definitely use those tools that exist. A lot of providers are putting things together to manage the entire pipeline for you.
If you're a small startup, it can well be worth the money to go with a third party solution like that rather than trying to build the infrastructure yourself.
Keep in mind that as you grow, you're going to want to take on some of that responsibility and having a solution where you own the data and you can transition to somewhere else where you have more control over what you're doing.
Chris: Jeff, thank you so much for doing this episode. I learned a lot from it and I really enjoyed researching and speaking with you. If people want to reach out to you or have any questions, how do you recommend they do that?
@jeffklukas on Twitter I will be happy to talk with them.
Chris: I should have mentioned this earlier but Simple has an engineering blog where I really got inspired for this interview. Jeff wrote this detailed article about this entire data analytics engine. It has pictures and graphs which can really help you understand a little more of what we talked about. Simple engineering blog and Building analytics at Simple.
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)