Building a Data Analytics Pipeline from Scratch on AWS with Manisha Sule

Interviewed by Christophe Limpalair on 11/19/2017

Imagine you were put in charge of building a data analytics platform for one of your products in a matter of weeks or months. How would you pull that off?

Manisha was put in that position when she helped build our new platform called Cloud Assessments. Her team built a pipeline based on a Lambda architecture, all using AWS services. This architecture is capable of handling real-time as well as historical and predictive analytics. We cover this, and more, in this episode.


Links and Resources


Interview Snippets

In this episode, I'm joined by Manisha Sule, who is the Director of Big Data at Linux Academy. She is one of my coworkers, and she's been working on building a data analytics pipeline for one of our new products. She's going to share her experiences building it and answer questions that can help you take that information and apply it to your own products without requiring tens of thousands of dollars or a large team. For example, how can you build a scalable pipeline that collects data in real time, or how can you grab that information and make sense of it? How can you do all of that without spending tens of thousands of dollars, or having years of experience? Manisha, thank you so much for being on the show.

Manisha: I appreciate the honor of being on the show!

You've been busy building big data analytics into the platform for the past few months. I also know you've been busy on the weekends - for example, you just told me that you ran a triathlon. So I appreciate you taking the time to do this. To get started, I'd like for the audience to get to know you better, so could you tell us about your technical background and where you come from?


Manisha: Sure. I grew up in Mumbai, India, where I did my master's in computer science. I knew from eighth grade on that's what I wanted to do - I wanted to get into computer programming. Thankfully, I've had a chance to work with some wonderful people and wonderful teams, and - the dream continues - to be able to work with Linux Academy.

I actually started in enterprise software development with Java and J2EE. I spent about a decade with Sterling Commerce, which was later bought by IBM. And with IBM, that's when I made my switch from software engineering and software development to big data analytics. That's the time when "data science" was becoming a buzzword. Thankfully, I got the opportunity to get started in this interesting field.

My first experience was with Cognos, an IBM product that deals with business intelligence and data warehousing. They were also expanding into more general purpose analytics, collaborating with Watson Analytics. So that's where I started my analytics journey. Later on, I moved to IBM's Spark Center, which is where I got my experience with Apache Spark. I was performing the role of a data scientist and an Apache Spark evangelist. What we would do was work with various clients, understand their needs, create proofs of concept, and help them understand how Spark could fit into their solution.

After that, I came across this exciting opportunity at Linux Academy. And here I am. Initially, I started as a big data instructor. I've launched a few courses on Linux Academy - one on Apache Spark, then Big Data Essentials and Machine Learning. We have one on Azure, one on AWS, and the latest one is on the Big Data Specialty on AWS. So check those out if you haven't had a chance.


Manisha: Exactly.

And even though you have extensive experience with Spark, oddly enough, you didn't end up using Spark for this project. So you joined Linux Academy over the summer last year, please correct me if I'm wrong.


Manisha: That's correct, the summer of last year.

Perfect. Then you moved on to create courses for Linux Academy. Then earlier this year, I think in January, you got a phone call that said, "We've got an exciting opportunity, we're going to build a new product from the ground up, and that product is called Cloud Assessments." Can you tell us more about that?


Manisha: Of course, yes. Cloud Assessments is a brand new product launched by Linux Academy. It really serves two purposes. One is for the student. For example, if the student is wanting to target a particular job or a particular skill level, we provide assessments that students can use to measure where they stand with their skill level. If they're behind, we offer labs, which let them learn, or we offer training to get them up to speed. On the other hand are employers, who may use this as a recruitment tool for screening applicants.

The main theme behind Cloud Assessments is the concept of lean learning. What that means is to offer the right content to the right learner, at the right time, in the right way. So let me explain what that means. By the right content, I mean we do not want the student to get bogged down by too much. Statistically, we have this huge popularity for massive online courses like Coursera and edX, and you might be surprised to know that the rate of completion for these courses is actually less than 10%. The main reasons cited for students dropping out are either it's too difficult for them or it's too basic, or it's just too much and they don't have the time. What we want to do with Cloud Assessments is just offer the right content - exactly what you need, no more, no less - in the right way.

So what does the right way mean? We don't have just quizzes or plain questions and answers. We actually have live labs. For example, if you want to test your skill in AWS, we have a lab that lets you configure an EC2 server or an Elastic Load Balancer. We have a system that creates every configuration step you take, and that tells you how much you got right, how much you got wrong, and enables the learner to do exactly what they want. That's the concept of lean learning on Cloud Assessments.

So you have students go in and take assessments to test their skills, and they can take labs. Where does analytics play a part in the platform?


Manisha: Let me start by explaining the different types of analytics. We have something called descriptive analytics, predictive analytics, and prescriptive analytics.

Descriptive analytics allows you to understand the current state of affairs. So in this case, you can understand how your customer is using your product. Whether the customer likes the product or not. For example, for an assessment - we can determine, what is it that they like about the assessment? What is it that they don't like? What are their preferences and what would they like to see improved? This helps you in not only improving your product, but also in marketing and sales. Another type of descriptive analytics would be to understand a group of students that's more likely to use it, based on their skills, interests, maybe their job title or the industry that they work in. It gives you an immediate view into how your product is being used.

Another one is predictive analytics. This is more of a statistical or machine learning use case, where you uncover hidden trends, patterns, or correlations between the uses of your product. Again, the underlying outcome is to improve your product or retain your students. Or to maybe uncover a trend where students are more likely to terminate their membership - what you can do is take corrective action to prevent such situations.

So that's predictive analytics, and prescriptive is just taking it a step further: what action should you take to achieve a certain outcome? Under this category are also something called recommendation engines. When I say "the right content for the right learner," that's where recommendation engines play a big role. Based on a student's lab history, how many assessments they've taken, or just their user profile, we're able to make recommendations based on the data we have.

So there are several analytics use cases that help you improve the product, make strategic business decisions, to help you sell it to the right people and market it to the right people. Ultimately, the goal is to benefit your product.

That's what I find so fascinating about this story. When you think analytics, you often think, "Okay, this is for established companies that already have massive amounts of data, massive amounts of customers, and can derive value from it." But in this case, that's not what's going on. What you're saying is there are smaller use cases. For example, what's engaging to my users? What isn't engaging? Where are customers dropping off in the user experience? When you're thinking about startups of all sizes, in all industries, you can get a result that benefits you - you don't have to be a massive corporation. In fact, I mentioned that you started working on this sometime in January or maybe early February - what was the lifecycle of that project?


Manisha: Right, so the lifecycle was briefly divided into four or five phases. The first phase was the infrastructure phase - we determine what technology to use, what our data collection pipelines are going to look like, how we're going to process the data. You may have heard of the "data lake" term - it's synonymous with everything big data. So the question is, do we really need a data lake? Or do we already have some version of a data lake? All of those decisions. We need to, first of all, prove that it's going to work for us, and then implement it, and then test it.

The second phase is reporting. Once you have your data pipelines in place, data is flowing in, you have your data stores, then comes using that data. And there are two to three types of uses for the data. One is straightforward reports - you execute straightforward queries on your data stores. For example, if you want to know how many users have signed up, but are inactive, or not really engaging with the product. That's one example of reporting. Another may be to find out who are your student leaders - who's doing a great job of staying engaged?
Who are your highest scorers? It's a way of moderating students and users to make better use of the product. That's the reporting side.

Next is analytics. As I said, it has a descriptive and a predictive part - with descriptive analytics, we use dashboards or graphs, which let you not only get historical trends, but also real-time. That's important in today's industry because your organization has to be able to react to real-time events. So in our architecture, we use both real-time and historical descriptive analytics. We've used AWS QuickSight. Along with dashboards, you also have the ability to go to a certain granular level. For example, you want to analyze your geography, where students are coming from. So your descriptive analytics solution has to be able to correlate all these different items.

The fourth stage is predictive analytics, and this is where machine learning comes into the picture. Really, the goal of machine learning is to find hidden patterns. One important concept about machine learning is that it needs data. If you have a very clever algorithm, and on the other hand, a very large dataset, the algorithm with the larger dataset always has a better chance of predicting accurate results. Under this phase, the quality of your data is important. In our situation, we were building our big data solution at the same time Cloud Assessments was being built. So we did not have real data, plus there was a lot of fluctuation - data formats were up for grabs and we didn't have our mechanisms finalized. Not to mention data noise that typically comes in when you're constructing a product from scratch and there's a lot of testing going on. So we had to be sure we weren't polluting our pipelines with noisy data.

Then the fifth phase is maintenance and monitoring. Of course, you repeat - you don't just do it once and be complacent about it. Predictive analytics is a continuous process, you're always tweaking and tuning your machine learning algorithms. Repetition is very important, but also monitoring. Because you have so many different pieces and parts, it's very difficult to monitor all these moving parts and make sure that each one is doing its job, and they're all coming together to deliver the big picture.

How long did it take you to go from that first phone call, saying "We're shifting our focus to this" to something that was able to go into production?


Manisha: We had a very strict timeline - I would say it was about six to eight weeks. That six to eight weeks included learning the technology. As I told you before, I have an Apache Spark background. AWS is something I had just started picking up, but AWS was our choice for multiple reasons.

The first one is that AWS takes care of all the nitty gritty for you - things like scalability, fault tolerance, high availability. You can take these things for granted on AWS. Instead of having to create your own Hadoop clusters, installing Spark, having to integrate it with other products to build an ecosystem. AWS just makes it very easy for you to use these services. It was the solution for getting us up to speed in such a quick and efficient way.

So let's break this down. How are you collecting all this data that you were talking about previously?


Manisha: So on data collection we stick to the big data Lambda architecture. What that tells you is that you have three layers. First is the real-time data collection, second is the batch processing, and third is a serving layer. What real-time means is it gives you an idea of user behavior with respect to time. So what is a particular user doing at this particular second, versus batch processing, which gives you a whole picture of what a user has done over a period of time. You combine real-time and batch into a serving layer to give you the best analytics.

Our architecture was based on this Lambda architecture, where we have three layers, and on top of that, we decided to go the serverless route. What I mean is that as a developer, you're free to just focus on the code instead of your servers - worrying about how to monitor or maintain or patch those. AWS services like Kinesis, Kinesis Firehose, Athena - this was the first time we had used them. Of course there were a lot of proofs of concept that we tried before we delved into this. That was a phase that was really important to the lifecycle. Instead of just diving into a particular service, you have to first ensure that it fits your use case.

You mentioned there's a batch layer for historical purposes. When does it make sense to use that versus the other type - a stream or a speed layer - which collects real time events? How do you know which one makes sense?


Manisha: For batch processing, the goal is to compute historical trends. For example, if you want to understand a timeline, which months are busiest for your product, or how a student is doing over time, what you need is a master dataset. That's what your batch layer does for you. It preserves your entire dataset, collected over time, which you can query and provide these different results. Whereas, on the real-time side, you want to know what your product usage is looking like right now, or the lag when an assessment is created. If you want a real-time look at how your product is being used, you want to capture that data as it's happening. But if you want a historical look, you consider batch layer processing. Ideally, what you do is combine the two - real-time and batch - to get the big picture.

How do you even collect real-time data, though? How do you grab that information and put it somewhere else, how does that work?


Manisha: We'll go to some of the services we've used in this case. For example, Kinesis and DynamoDB. Kinesis has a provision of streams, and there are multiple ways you can read and write to them. They can be written to by external applications; that's what our Cloud Assessments website and our engine does. They use the AWS SDK/API to write events to Kinesis streams - this is how we capture real-time data. So when a student is taking an assessment or a lab, the different events like the start of an assessment, or when a student finishes a lab, are written to the Kinesis stream.

Similarly, with DynamoDB, when an assessment is finished and graded, that's when the student's performance is stored in DynamoDB.

Kinesis streams and DynamoDB streams can then be processed by things like Lambda functions, or you can even have your own custom logic. You can send them to S3 for larger collections, and they can also be processed using Athena tables.

If you're not familiar with these services, be sure to check them out. They're all AWS services and I'll also have a description of them below the video so you can look them up. Manisha, you already listed a few different services like S3, DynamoDB, Lambda, Kinesis. Can you list all the various services you're using in this architecture?


Manisha: I would highlight serverless, for the fact that you only pay for what you use. You can compare serverless to something like EC2 - with EC2, you have to pay by the hour, whereas with serverless, you only pay for what you use. For example, in Kinesis streams, a unit is called a shard. A shard allows you to do 2MB per second for reading and 1MB per second for writing. Now, depending on the data that's going into the stream, that's how you're charged, versus a constant charge for a particular service. Cost is one factor that puts serverless above other technology, and the other is flexibility. You don't have to worry about setting up clusters or adjusting them to scale your application. Things like Lambda automatically scale as per the data that's being processed.

To go back to your question, the services we use consist of S3, DynamoDB, Kinesis streams, Firehose, Kinesis analytics, Athena, QuickSight, SNS, and the most important of them all is Lambda.

Is there anything that you tried doing that didn't work with the serverless architecture?


Manisha: There were many situations where we had to rethink our strategies. For example, Athena is a fairly new service, and it's an interactive query service - which is serverless, of course. What it does is let you read data in your S3. Let's imagine you have a certain set of data in an S3 bucket. In Athena, you define a schema over that S3 bucket, which lets you create a table that is read-only. Then, using that table, you can execute queries.

Typically, AWS RedShift and RDS are database services that have been proven with complex queries that use "joins" or time windows. We weren't sure if Athena would be able to support that. We were also not sure if Lambda functions could execute queries in Athena. Ultimately, we did find a way to do that using JDBC (Java Database Connectivity). In Athena, you can query using a JDBC adapter, which restricts you to Java. But if you want to use Python, you can use third party libraries that convert your calls to JDBC calls.

The point is yes, there were points where we contemplated using other managed services. But AWS has a good idea of how their services are meant to be used, and they make provisions so there's no use case that gets blocked.

A question that I see quite often - especially since we have serverless content at Linux Academy - is, that sounds really good but what's the catch? What can't we do using these serverless services? You hit some of those on the head. And I know sometimes you can lose a little bit of flexibility because they are managed. They're managed for you, so you often don't have to configure them as much, but you won't be able to go into the server and change some settings at that level. Sometimes you lose a little bit of configuration there. What do you think about that?


Manisha: Yes, definitely. Another example is the timeout of a Lambda function. Lambda functions are typically meant to be used in a microservice architecture. The idea is to build your application in many independent services rather than one monolithic codebase. And in that spirit, your code is not supposed to be a long running job, but something that gets executed in a short amount of time. The timeout on Lambda is currently set to five minutes, and you cannot exceed those five minutes. What that does is puts a restriction: what if you have a very complex process? The alternative to such a restriction is putting your code in Docker and running it on EC2, which lets you create your own Lambda service. But it's not the same as a managed Lambda service. At that point, you lose the ability to monitor and integrate with things like API Gateway. So there are obvious pros and cons, like you said, of losing flexibility versus gaining integration points.

You mentioned a few different services and talked about how you're using Kinesis and Lambda and all those different services. Can we break it down a step further and figure out - so I know you're using Kinesis streams and Firehose for Kinesis with Lambda - but how does that work? If someone has never used Kinesis or Lambda before, how do those services work together and why did you use them?


Manisha: Sure. The purpose of Kinesis, like I said before, is to capture real-time data. The stream is either a producer or a consumer. In the case of a producer, you have apps that write to your Kinesis stream. Kinesis allows your data to be retained for 24 hours by default. It can also be increased to seven days. Plus, you may want your stream to be consumed by multiple applications downstream. For example, you're getting real live data in a Kinesis stream, and the next step is to process the data. To process it, let's say you have Lambda functions that perform some kind of logic on it, then finally persist it to S3.

Now what if you have multiple Lambda functions that are reading from the same Kinesis stream? In that case, there's a pattern called a fan-out. What you do is connect one Kinesis stream to write to multiple other Kinesis streams to increase your read throughput. There's a lot of flexibility given just one source of data. You can process it in multiple ways. This is the flexibility we were looking for - we wanted real-time data and the ability to process it in multiple ways. The combination of Kinesis and Lambda was perfect for that.

The next service was DynamoDB. Now we understand how that data is moving between Kinesis and Lambda, but how does DynamoDB play a role in that?


Manisha: The advantage of DynamoDB is that it's schema-less. That was exactly what we wanted while our product was being built. Let's say we want to store a user profile. It has to be schema-less because there are attributes that aren't present for every user, and we might want to add attributes at different times. For example, social media links - you may not have it today, but as your product matures, you may want to add that information. DynamoDB gives us the flexibility to be schema-less and add or subtract data as we grow.

DynamoDB also has support for streams. What that allows is similar to Kinesis streams, where you can connect your streams to Lambda functions. Whenever there's an update to your DynamoDB object, a Lambda function triggers and processes that change. That fits into the real-time processing story that we've talked about.

And again, DynamoDB is a NoSQL database. How it compares to something like MongoDB or other NoSQL databases is that it's serverless, in the sense that you don't have to manage the servers behind it. If you want to spin up a MongoDB cluster, you have to manage that cluster yourself, whereas in AWS, you can say, "Provision this amount of capacity, in terms of read and write throughput," and Amazon promises to handle that for you. Just in case listeners don't know what DynamoDB is.


Manisha: Right.

You've got DynamoDB, you've got Kinesis and Lambda. We also have a service called Amazon S3, which stands for Simple Storage Service, where you can store all kinds of different objects. How does that differ from DynamoDB and why did you use both of those services?


Manisha: S3 has a great integration point with things like Athena. Athena was our query service that would ultimately be fed into our dashboards. S3 is also touted to be very powerful - powerful enough to replace the HDFS that's used for big data a lot. HDFS, the Hadoop distributed filesystem, is the core component of Hadoop. Its goal is to be able to store an unlimited number of objects and volume of data, to provide a high throughput limit, and native support for versioning. All of these things are provided by S3. On top of that, you don't need to pay extra for data replication. That's what S3 takes care of under the covers. S3 is also supported by things like Spark and Hive. At this point, we don't use Spark, but there will definitely be a need. As we evolve our analytics product and add more sophisticated intelligence to it, we'll be using Spark for machine learning algorithms. So S3 is a perfect solution to act as a single source of data.

It can also act as storage for unprocessed data. When we have our real-time events coming in, we dump them to S3 as one straightforward pipeline. That's to make sure that if anything goes wrong down the pipeline, we still have a backup for the original form of that data.

S3 is also used to store processed data. In fact, we have a decoupled architecture where the unit is "data, process, and store," it's considered one unit. There are several of these "data, process, store" units where "store" represents S3. Once you have various forms of data stored in S3, there are multiple use cases that can plug into these different buckets that can provide things like reporting and analytics.

You mentioned the term "data lakes" a couple times, and I think that ties back to Amazon S3. Could you describe what "data lakes" means in case people don't know?


Manisha: Sure, so a "data lake" represents a large collection of data in any format. It doesn't demand that data should be processed in a particular way. In fact, data lakes are schema-less, and that's in direct contrast to a relational database, where data has to be structured in a particular way. A data lake is just a massive store where you dump your data.

The reason behind it is that when you're reading from the data lake, that's when you apply your schema. Let's say we have a machine learning use case that wants to look at the correlation between a student's skill set versus a student's assessment scores. At that time, you'll apply that schema on the data lake and read only that particular information. That's how a data lake is meant to be used - you don't need to apply a schema when you're writing to the data lake, only when you're reading it.

So how do you pull data from it then, if it doesn't have a schema? You're just dumping the data in there. How do you know what data to pull out of it or what to do with that data?


Manisha: We have these pipelines consisting of Lambda functions, which pull out the data and process it into more schema-oriented data. For example, computing a student leaderboard. We'd want to look at all the assessments, the scores, and the time taken by each student. In that case, we'd have a Lambda function that reads data from the data lake. Again, the data is in different S3 buckets. We have one for assessment history, another for scores, another for user profiles. This Lambda function is able to read from these different buckets, join the data together, and output the top scorers. Our pipeline uses Lambda functions that write to S3, and so, as I said before, S3 acts as a store for processed data and unprocessed data.

With S3, I know you can store those objects and it can grow a lot - they call it "infinitely scaling," although if they can actually infinitely scale is a question to debate. But you have other features with S3 that make it attractive for data, such as versioning, right? It has high bandwidth and throughput, and it also has something called lifecycle transitions. Can you describe how you're using some of those features outside of just storing objects? Are you using any of those features?


Manisha: S3 is more useful when your data is being frequently accessed. And like I said, there's no need to pay extra for data replication, but you can make use of things like Glacier for archiving your old data. Since our product is relatively new, we don't necessarily have a need to archive old data. But we envision a point, after a year or so, when we'll want to archive this unprocessed, raw data into Glacier, but at the same time, keep it available if it's needed.

You also mentioned Lambda a few times. Can you describe a little more about how it works? I know they have Lambda functions, you can feed it and it executes functions, but how does that work?


Manisha: Lambda is the heart of our serverless architecture, and it also represents microservice architecture. That means you have to divide your functionality into separate pieces of code. The advantage to this is they can be written, tested, and deployed independently. You can configure Lambda in a variety of languages. Whatever code you have, you can upload that into your Lambda, and further, you can set up triggers. Let's say you want to trigger, or invoke, this function when a DynamoDB event happens. Let's say when your user profile is updated with a social media link. At that point, you have a Lambda function that updates that link and goes and fetches some additional details. That's a great example of how you can have your Lambda unit of code respond to an event. Once it responds, it does its processing and it can emit results to various different sources. It can write to Kinesis streams, Firehose, RDS - it's a pretty large variety of services you can output your results to.

The important thing to remember about Lambda functions is they're event driven and stateless. You don't maintain any state - whatever storage you have, you take data from your storage, send it to your Lambda function to do processing, and output to that storage in a different format.

There's one other service that we haven't touched on too much, yet. You've mentioned it a few times and I know it's being used, and that's Athena. What is Amazon Athena?


Manisha: Great question. It's a relatively new service and it's meant to quickly query patterns in your data. Athena is also supported by Presto, which is a distributed SQL processing engine in AWS. What that allows you to do is query vast amounts of data in relatively quick amounts of time.

The way we're using Athena is to apply a schema on our S3. Whatever data is in our S3 buckets, Athena reads from it and is able to create tables, which can then be queried. Furthermore, these Athena tables can be plugged into QuickSight, AWS's visualization tool. So Athena is an important component in our descriptive analytics solution. It has advantages of being able to yield results very fast because of the Presto support, and being able to integrate into QuickSight, which lets you create dashboards very quickly.

So does Athena actually have the visualization aspect? Or do you have to use something else to pull those queries out and actually visualize it?


Manisha: QuickSight is what integrates with Athena. Under the covers, when you create dashboard in QuickSight, there's an Athena query that gets executed, and fetches the data that gets displayed by QuickSight.

We're starting to run out of time here. I know there's a lot more we could be talking about, but is there anything specific that you think we might have missed or that we should mention as it relates to this?


Manisha: Along the way, there were a few lessons that we learned. One of those is don't let perfect be the enemy of good. What that means is you can start small - don't have a grand vision that you're trying to tackle. That can be very overwhelming and challenging. Instead, it's better to have small successes and build into a more grand vision. Following the Agile methodology, creating smaller goals and proofs of concept to ensure you're going to be able to reach those goals with the technology you've chosen - that's important.

You also have to avoid failure of collection. That means don't worry if you do not have data today, that your pipelines are not going to work. That's not the case. Have your pipelines ready so that when your data starts rolling in, you're in the position to start making use of that data.

The other things is it's better to have a diverse set of skills for your team, especially in a microservices architecture. It gives you the chance to develop and deploy in an independent fashion. You may have developers who are experts in Python, some maybe in Java, some in R. Each of these technologies have their own strengths and weaknesses. Having a diverse team will help you in utilizing all of them.

Another thing, especially for a beginner data project, is to have leadership. It's very important that they understand how data can be used to improve the product or make strategic decisions. In that matter, we've been very lucky to have Anthony as our leader. He has that vision of how we can use data.

After you've created your product - last, but not least is evangelizing. If you have a product, but there's nobody to use it, it's meaningless. You may have to go the extra step of training your team to use the tool. If you have a dashboard creator, you have to train your team on how to make best use of those dashboards. That's where all your evangelizing efforts need to be in place, where you're conveying to your company or your team how to make best use of the product you've created.

Manisha, thank you so much. I appreciate your time and your experience in sharing this information. If people have questions or if they're interested in learning more, how do you recommend they do that?


Manisha: We at Linux Academy have a big data team that can answer any of your big data questions. We have a diverse set of experience on AWS as well as Azure, so feel free to reach out to us on the Linux Academy Community and also through Scale Your Code.

How did this interview help you?

If you learned anything from this interview, please thank our guest for their time.

Oh, and help your followers learn by clicking the Tweet button below :)