Building APIs, the Cloud Elements way
Interviewed by Christophe Limpalair on 09/19/2016
Cloud Elements is a platform for API integration and management. This episode walks through how they built the platform including their architecture setup, development pipeline, and how they've implemented features like performance analytics. We also talk about engineering culture and communication, which are both very important but difficult challenges.
Great products deserve great tools. Sign up with the fastest growing cloud infrastructure provider, DigitalOcean, and get 1 month free on a 1GB droplet (code: SCALEYOURCODE). Also, get 90 days free ($87 value) of real-time error tracking for all of your apps, with Rollbar. Set it up in 5 minutes and enjoy!
Links and Resources
Can you tell us a little bit about your background and what you do at Cloud Elements?
I'm a Computer Science graduate, and I've been working with computers my whole life. After college, I got a job at a large healthcare IT company. I worked with Perl and C# before moving on to Cloud Elements. I'm loving being here at Cloud Elements and have been doing a little bit of everything.
You've been working with computers your whole life - do you mean building the hardware itself, or building software?
Kind of both actually. When I was in elementary school they had the Apple IIe which is where it started. My friend just sent me a picture of an Apple IIe. He loaded a program that said 1994 on it. I started programming and built a few computers, but mostly on the software side.
What do you actually do at Cloud Elements today?
I'm a Director of Engineering. I try to come up with some of the best practices for coding and some of our development processes. As much as I can, I try to continue coding. That's what I love to do. It keeps me on pace with what's going on in our development team as well as figuring out what will work best for our team from the development standpoint.
In the past, I've done a little bit of everything. I was employee number one. There was a lot of everything going on at the time so having a hand in all of that was cool and now it's leading the engineers, being a mentor, and making sure our processes are where they need to be.
Can you tell us about some of the features you offer, and how you built them?
Cloud Elements is an API integration platform. You can publish, integrate, aggregate and manage all your APIs through this one platform.
There are basically two main things you can do with it. You can connect to categories of cloud services; so you take something like CRM (customer relationship management) - you've got, say, Salesforce and others in the space. If you want to integrate to those APIs, normally you have to figure out how to integrate to Salesforce, what their authentication is like, and how does this and that work with their API.
If you integrate with Cloud Elements instead, you've already integrated with and figured out how to integrate with Salesforce and several others. You can just come to us, and the API will look the same which is a pretty cool thing. You don't have to dig through documentation for twenty different endpoints to figure out how they work. You just figure out how Cloud Elements' API works. We've tried to make it as simple as possible and you can connect to those others really easily.
The other major use case for our platform is using it to synchronize data between cloud services. So, if you have some contacts in Salesforce and support tickets in Zendesk and you need data to flow between the two, we have a cool toolkit that lets you integrate those two together and lets data pass to and from it, by transferring data as well and getting those over where you need them to be.
Those are the main points of what our platform is aiming to do.
Let's take an example with using Amazon S3 and Dropbox together. When I upload something to Dropbox, are you able to send a request to your API and also send that to S3 or vice versa? Is that how it works?
Absolutely! It's pretty flexible so we do as much as we can with WebHooks so if you upload something to Dropbox, Dropbox has a pretty good webhook functionality, so they would send us a webook. You would simply set it up to tell Cloud Elements to say, "Hey, we received a file." So we would get that.
You could do any number of things with that.
Workflow is the normal industry term for something like this where you could say, "Here's the name of the file. Can you go grab that out of Dropbox and then dump it into S3?" We could do that for sure. That's definitely a use case that our platform would handle.
Let's talk about the infrastructure. What powers all this integration? For example, when you send a request, where does that request go, how does it get processed, and what pieces of infrastructure does it touch?
We leverage AWS pretty heavily, so we have most of our infrastructure hosted there. In fact, I'd say that today all of it is hosted at AWS. From the top part we use Elastic Load Balancing in AWS, and that load balances over to some internal NGINX web servers that we have.
That's where the first touches are. The Elastic Load Balancer goes down to NGINX and then down into Tomcat and then there is a lot of stuff under the covers making all that work.
What about the database? Do you use Amazon RDS (Amazon Relational Database Service) or did you roll out your own on EC2 instances?
We did not have all our infrastructure on AWS initially, so we use PostgreSQL for our database. We also leverage Elasticsearch as a datastore. So for those things that don't really match the data model of Postgres, like time series data - we throw into Elasticsearch.
We host them on AWS, but we don't leverage too many of the AWS specific services yet.
Is there a specific reason for that, or is it because you didn't start there and you haven't transitioned there yet?
Probably, yes, we transitioned to AWS for all our environments relatively recently. So we haven't put a ton of research into those other services that AWS offers. We are definitely not opposed to it. There are a lot of benefits that we can see, but it just hasn't happened yet.
Christophe: Sometimes you don't have any major difference, so why would you spend the time and risk downtime when you might not see a drastic difference.
What do you run to manage all those things? Are you using something like Chef or Puppet or Ansible, or how do you manage the infrastructure?
We tried to use Ansible for a while and it kind of got unwieldy for the amount we wanted to do with it, so we took a step back and started on with just basic Shell script to ge some of our infrastructure going.
Going forward, we're using more in the Microservice realm and making use of VPCs(Amazon Virtual Private Cloud) with AWS so we can segregate some of our Microservices. For those new Microservices, we're using Terraform from HashiCorp as our infrastructure deployment program.
That has been working pretty well for us as far as getting things up and running quickly and eventually allowing us to scale easily as well.
Speaking of scaling, how do you handle spikes in traffic? I assume sometimes you get more API requests than other times. Do you have Auto scaling running, or how does that work?
We actually don't have auto scaling at the moment, so when we get spikes, we actually have a pretty good amount of spare capacity I would say. We have a certain amount of app servers that can handle the spikes as they are.
We're definitely growing, so we're on that path to getting more of that elastic and automatic scaling going on. Currently, if we needed it, we would use some of those Shell scripts or Terraform scripts to get those up as quickly as we could.
Would you say you have any major pain points as far as scaling goes that you haven't really addressed yet or is it pretty smooth for the most part?
We started as a Monolithic Applications so there are a lot of pain points. Over a couple of years, we've tried to pull some of those things out and figure out what they are and fix them. What we've started to do, rather than rewrite the whole thing (not a good idea in my opinion) is to pull things out that we think we could pull into Microservices.
The other big pain point is putting the wrong kind of data in Postgres. We started doing that, putting a bunch of time series stuff, so we were always running ourselves out of disk, overloading the database when it shouldn't have been and so we've been migrating those kinds of things out to Elasticsearch and at least scalable database systems that can handle that sort of thing and that are designed for it.
That has all been kind of resolved and it's running pretty smoothly right now, but we've got more work to do pulling additional things out.
Those are the pain points we've had with having a Monolithic Application and a "tough to scale" database in Postgres.
What are the main differences between Postgres and Elasticsearch?
They're pretty different in that Postgres is your more relational database model, and Elasticsearch is schema-less so you can just sort of put data into it, and you can slice and dice the data however you want, but it's not necessarily in a structured query language like with Postgres.
We're not experts in these, so there are probably better ways to do things that we just haven't come across yet.
Postgres is tough to scale. You can scale vertically. You can add ram and disk and things like that, but it's tough to get Postgres to scale horizontally, meaning sharding or moving information from node to node where you don't have to worry about, well, "I've got a terabyte disk already. How much bigger can it get?" You can just add more nodes in an Elasticsearch situation. You get redundancy with Elasticsearch and the scalability with it.
With time series data, and event driven data, there is tons of data, so throwing all that into a relational database didn't really make sense for us even though we were doing it for a while when we had low volume.
Putting it in Elasticsearch was a lot better. The reason we chose Elasticsearch as opposed to Mongo or some other system was that we were already using Elasticsearch for some of our logging and monitoring and things like that.
We already had it set up and had the servers there and we decided it would be the best option to keep with that, stick with what we know. As you mentioned before, is it really worth it if the benefits are not necessarily that great? Elasticsearch has been working pretty well for us. We've put in some custom dashboards using Kibana on our Elk Stack. We use Logstash as well.
Deals offered by sponsors:
Thanks to Rollbar and DigitalOcean for sponsoring this episode!
They're offering readers of the show some sweet deals:
1. Rollbar is offering their Bootstrap plan free for 90 days ($87 value) to help you track your application errors across all platforms and all languages.
2. DigitalOcean is offering a free month on a 1GB droplet, which is perfect to start a project of any size on. Enter promo code SCALEYOURCODE
I personally have used both services and highly recommend them.
You actually just answered the question I had in mind of why Elasticsearch as opposed to Solr (Apache Solr)? Do you know others people can look up?
It was a matter of, "We've already got this. It's working well. Let's just keep going. If there are any other pain points, we can change."
Speaking of Elasticsearch and searching, I know you give some insights to customers regarding performance of the API and other things. Do you also run any internal analytics that you can search through Elasticsearch?
Yes, absolutely. We have been working on some internal dashboards and actually improving our metrics and monitoring over to Elasticsearch that we can then pull back.
We have some dashboards and I show a CPI response time and then traffic by accounts and by element.
An element in our system is kind of like an endpoint Salesforce or Dropbox or something like that. So we can see what's happening. We can slice and dice the data pretty well so we can figure out who's using traffic and what endpoints are getting hit the hardest. We definitely leverage Elasticsearch for a lot of those things.
So that's the kind of data you're talking about where it doesn't really make sense to store it in Postgres? Instead, it's better to store that kind of information in Elasticsearch?
Exactly. Metrics is one of them and another one we use in the application itself is like event data. So when I mentioned getting an event from Dropbox, we would store that event in Elasticsearch versus something like Postgres because we get thousands and thousands of events continuously, so we are putting those in Elasticsearch because it can handle it and scale that way, but Postgres kind of falls over in that scenario. We put that time series event related data in there as well.
Out of curiosity, you mentioned event driven data. Do you guys make use of AWS Lambda at all?
We don't actually. We looked into it a little bit, and I think that some of the limits that are currently on there were not quite up to what we would need because we have some payloads that are pretty big. Some of the event payloads are quite large and would blow out what Lambda would allow.
We're definitely keeping an eye out on that one and there's some potential there for some of the things we are doing. We just haven't leveraged that yet.
As far as security and protecting against things like DDoS attacks, or Noisy neighbors where you have one customer who sends a lot of requests to an endpoint and that might affect other customers. How are you able to implement a layer of security before that even reaches your infrastructure? Do you have something in place for that?
That's something we're actually in the middle of implementing. So we've got two things there:
- Rate limiting
- Noisy neighbor
As far as rate Limiting, we haven't really had the need to do that yet. Like I said, we're in the middle of analyzing the data that we have, all of our metrics, figuring out what would be the right way to limit calls and how we want to do that.
Do you do like AWS does and charge per use or is there a tiering system? We're figuring that out now. It hasn't been a huge problem. We can pinpoint things as they happen because we're not megascale yet.
For the Noisy neighbor problem, we leveraged Kafka a lot in our infrastructure so when we get an event or something, we dump it into Kafka versus "Let's try to handle it right away," which kind of means you could get flooded with events. We have metrics around Kafka, so if we dump all our events to Kafka and then pull them off when we can, we can see if there is a queue that is growing rapidly and we can figure out what's going on pretty quickly with our monitoring and alerting.
If there are any issues there, we can shut that off.
For the Noisy neighbor thing, we can use Kafka's partitioning which we partition by a subset of accounts so, if you have one noisy neighbor, they're not going to take out everybody.
They might take out one or two of the accounts in there, but at least it gives us enough time to figure out what's going on in the Kafka queue and find out who's causing the issue and then what we need to do about it.
I would say Kafka has been our big helper there as far as our message broker. We can dump things out there versus taking things as they come and then potentially falling down under a huge load. We can defer that to Kafka and then pull it off when we can.
So in Kafka, you have a subset of accounts that are walled off from the rest. Is that a software imposed restriction or is it hardware imposed? How does Kafka work in that domain?
Yes, that's software. Basically we have an N-number of app servers, and they connect to Kafka with an N-number of partitions. Those partitions get assigned to a particular app server. You can actually have a partition go to two or three app servers. Only those app servers will be working off that partition. Then you've got the other app servers that aren't dealing with that particular partition.
In that way it's possible to segregate large amounts of traffic in a particular area from other parts of the application.
Chris: It sounds like Kafka is a really good option for that kind of use case for sure.
Kafka has been really great. It was a little tough to integrate with initially. We're in Java, so trying to figure out what to do with partitions, what to do when we're shutting down, committing and things like that. There were some definite challenges with that, but now that we have it the way we want it, it has definitely been a good one for us.
Are you able to cache any of these calls or is it different information every time where you're not able to actually cache it?
We use Redis as a caching layer. We actually use Redis for lots of different things. We use it for the initial step in our Logstash system. So basically, all our logs go directly to Redis first and then Logstash pulls them off and puts them in Elasticsearch. That keeps things very synchronous.
As far as actual caching goes with Redis, what we can do is cache our Element instance data. Each of our Element instances basically contains authentication information, transformation of data information, and that kind of thing.
So, how you set up your Salesforce accounts to look and whatever your OAuth token for Salesforce is, we can cache that in the caching layer at Redis and if we're getting thousands of those calls, we don't have to go to Postgres for it, we can go faster to Redis in-memory store for that.
If we have to switch out OAuth tokens, refresh OAuth tokens, or if someone goes in and changes the transformation or something, you will have to dump that, but for the most part, people aren't changing their instances that often. We can keep that warm in cache and then get some pretty good gains off that.
When you say instances on your platform, do you actually mean AWS instances that you spin up in the backend?
An instance in our system refers to an authenticated instance into and endpoint. So, if you come into Cloud elements and you say, "I'm going to connect a box or something." You click "add instance" and that will take you to the OAuth flow box and then it will create an element instance.
You mentioned Logstash, etc. Are there tools that you use apart from Elasticsearch that you use to collect and aggregate all that data and then store and analyze it?
One of our dreams is to have all our logging and analytics in one spot, but that's not the reality currently. We have a bunch of different systems that we use for monitoring and alerting and metrics and things like that.
We use Datadog actually quite a bit which is really easy to use. You just put in their agents and they have integrations to pretty much anything; Tomcat, Redis ... essentially all our stack. You can just install one of their agents, tell it you have a Redis server, and it will start shooting data over to Datadog. You can immediately start seeing results of that.
You can easily put monitors on that, alerts, anything you need. We have been leveraging that for some of the system level kind of things, RAM, CPU, etc.
It tells us some throughput and response times as well, kind of "canary in the coal mine" type stuff there.
So that's how you get some of those response times that you can pass off to user and then say, "This is how long it took to process this." Does it also involve other times to reply to or to create and finish the API call?
From a New Relic standpoint, the response time isn't terribly helpful. What we provide to users on the console in our web page in our web console is actually response times and the number of API calls you make, and that's a little bit different. That actually is stored in Elasticsearch.
When you make a call into Cloud Elements, the first thing we do is say, "Start the clock," and the last thing we do is say, "End the clock," and we kick that out to Elasticsearch and that's how long your call took.
Obviously, that doesn't take into account the network delay eventually between the client and us, but at least it tells how long we took to handle your request.
We have some Kibana dashboards to show us those values as well, and then we can slice and dice that data as well.
You offer something called data mapping with your element mapper, and I understand this allows you to map and normalize data objects and fields between those cloud services that we're talking about. How does that work?
Basically, you can open up the data mapper and what that allows you to do is choose an object from your endpoint and I can go to Salesforce because that was one of the first ones we did really well as an endpoint. You can choose accounts. Then, what we try to do as much as we can is to discover what field you have on an account object at Salesforce. We say you've got first name and last name, account name, whatever it is, we try to populate that as best we can.
We have a couple of ways to discover it. Then you can create your own object in our system called "My Account." Then you say, "I want "My Account" to have first name and last name. That's all I need." Then you can go into another CRM system, pull that same window up and then match those fields. Maybe it's "first underscore name."
You can map "first underscore name" to your "My Account" object. Then you can do transformations to your "My Account" object. Then you can make API calls against "My Account".
Rather than getting the exact data from Salesforce or the exact data from "System B," whatever it is, you can just say, "Get me My Account," and no matter which system it comes from, it's going to look like "first name/last name" as you want it to look. That gives you a lot of flexibility because in your own system it helps. You can always expect the same structure of data in your system regardless of which endpoint you're hitting.
It also is nice for syncing data between systems. Basically, you can say, "I only need to sync the first and last name anyway," so let me make an object for that and then I can move that around to a different system. I'm going to afford you a lot of flexibility there.
When it comes to the code, how do you test your API endpoint calls to see whether it's your code logic that has issues or if it's another API that isn't returning the proper response because maybe they have something going on on their end? How are you able to have that kind of visibility in your code?
Our elements builder is another one of our integration toolkit products, you can essentially create an element basically by just describing it. So as long as it's at a relatively restful endpoint and it has OAuth authentication, you can create an element basically by describing the resources, "What do I need to save off of it?" and things like that.
To go along with that you can also try it out and trace what's happening. If you say, "I want to try this out," we will show you exactly what we're calling on the endpoint and what they're returning. You can see if it looks right to you and whether it's what is supposed to come back in general.
That's how users of the system can check to see if things are working the way they should.
Obviously, as developers of our platform, we use that to create new elements. When we are doing it, we have access to code so we can step through and say that the endpoint is not returning something we think it should and then figure out how to fix that. We try to give as much flexibility and power to the user to do that as well.
Let's move into how you do testing overall.
In our engineering team, we are all responsible for creating those tests, and making sure and proving that what we are doing is working. We've tried to make our testing as easy as possible for our engineers to just go in and say, "This endpoint should be able to go in and do a full CRUD (create, retrieve, update, delete) on accounts."
That becomes part of a test suite, so when we run our test suite, that runs, we make sure it works, and if something is wrong we can take a look at it to see what has changed or what we need to change.
Can you tell us more about your development culture as a whole the pipeline or application of the life cycle ... when does it touch which team and how do the teams collaborate?
We have a couple of different inputs into the products. We have sales requests, people wanting to know if we do a particular thing, and then we decide that it's a great idea, we don't do it, and then we decide, "Let's do it." So it comes in that way.
We have customer feedback, from existing customers saying, "This works great, but I'd love it if it did x, y, z...." We also have internal input which is engineers deciding "... we could definitely do this better." All those things are what come into our pipeline.
Then we have product management which is just one or two people right now. They decide if things are worth doing, "Is it things a bunch of people are requesting?" Once we've decided that, "Yes, this is something we really want to do," then we pull that in and create user stories for it.
Actually, we're leveraging GitHub and ZenHub together for our Kanban/Scrum boards. We create user stories for that. Once those get in and someone pulls that off the top, we prioritize everything. Someone pulls that off the top and they are responsible for setting up any design sessions or talking with other engineers, getting opinions of how we think things should work or if it's not clear or if it's a bigger story, things like that. Then we go on from there.
How would you deal with, say you went to a new organization. You are few weeks or a month in, and you realise the the entire development cycle pipeline is all in disarray. Maybe you show up and say, "You guys haven't really worked on having a pipeline and there's no collaboration between different teams like the QA team, development team, and security team." How would you change that and instil a culture where there is a pipeline and collaboration?
Definitely! We've struggled with that too. We had a QA team for a while. It's a tough thing to figure out when something goes over to QA, when does it come back and how do you deal with committing to points on the sprint, or anything like that.
I think the tools are really helpful. Taking some feedback from the engineers, we tried to streamline that into, "Let's just have one place to go, which is GitHub."
Then ZenHub integrates into that, and let's make sure everybody is onboard, product management and everybody is working off the same playbook. They're all in the same place. Let's be sure it's all organized in one spot.
All you have to do is go to one spot for your work. You can figure out what you need to do based on that.
Communication, tooling, making sure everybody is working on the same page is going to be key to get that to work.
There is definitely no one solution for a particular organization to get that to work
We definitely had to work through and figure out what our engineers wanted to do and what would work for the product people.
Can they work in GitHub as well? It's more developer focused, but is it easy enough for them to use and put ideas in and put stories in and things like that. I think just listening to to the organization and figuring out how you can unite them is probably the best way forward on that.
Chris: Communication is a tough problem, and how you solve that Im not sure. Maybe encouraging that kind of discussion and communication. Maybe it has to do with tooling as well where it makes it easier to send a message to someone and say, "Hey, I've got a question about this," or "Check this out and give me your input." Those tools, an open culture where you're not afraid to ask questions and things like that. It is a difficult challenge for sure.
I would agree. We've been using Slack for a long time. That's pretty standard now. We have such a great culture and we want to maintain that. We're relatively small. There are about fifty people. That's getting to where you want to make sure that you can keep that culture.
We haven't had too much of a problem with communication. We try to be sure we don't put up Silos between teams because that can be detrimental. Like you said, we want to make sure that you're comfortable going from team to team and saying, "I just have a question."
Sometimes that can create animosity because you've got engineers that need to do their work, but you have another team coming in and bothering them. You think, "There's context switching going on." It's definitely a balancing act, trying to make sure there are no Silos, but then also making sure there are some sort of barriers there where it's not just a "free for all" where you can bother anybody any time.
I think some of that asynchronous communication through Slack is helpful. As long as people are respectful of other people's time and things like that, you can make it work pretty well.
You were talking about keeping that culture and things like that. Does that start with finding and hiring the right person and right character and personality of someone who can fit right into that team and then reinforcing that culture that you have? Or is it a balancing act of letting the culture grow beyond what it currently is? ... letting it take root and do it's own thing? How does that work?
I think it absolutely starts with hiring. When we hire, at least on the engineering side, we hire a lot for passion. We want to make sure someone is passionate about what they do and about learning.
We've been talking to some recruiters lately, and we've asked them similar questions like, "How do you make sure someone fits the culture?" That's a soft and squishy thing. There's not just one answer, but it's kind of a gut feeling when you're talking to someone, you try to get a sense of if they would fit your culture or not.
I absolutely think you need to try to find or try to make sure that whoever you're hiring will fit your culture, whatever that may be. That's not to say that you want to keep your culture as it is and there's no room for growth or difference.
Everyone is going to bring their own flavor of culture no matter what.
That baseline of being a hard worker, responsible, and passionate make us comfortable bringing that person on. They keep that culture. As far as after that goes, we have actually a culture club with a new organization that tries to make sure we continue that culture.
We do outings and hike days once a month or ski days in the winter to keep that camaraderie and keep the team together. I think that's a really cool thing. We've got a great team. We're trying to keep that together and keep us all on the same pitch.
You're in Denver, Colorado. So there is plenty of skiing around there, right?
Yes, we have a good few months of skiing and it's a lot of fun to get out there with the rest of the team on a Friday. It's a lot of fun to go hiking too.
Are you in your office or at home?
Yeah, this is my home office. We have a pretty good remote working policy, so I usually work from home a couple of days a week (Wednesday and Friday). Originally, we tried to make Tuesdays and Thursdays "meeting free" days so you can focus. It has been working pretty well. Sometimes we'll stay home Tuesdays and Thursdays to get a lot done and then we'll be there Monday, Wednesday and Friday. Just in general, if there are meetings we'll be there. Otherwise we can work from home.
At the same time, finding talent is hard if you don't open it up to remote. Also, sometimes it's fun to just be home like me now. I left a little early. That flexibility really helps with the morale.
It attracts talent, but it has its own set of challenges.
We have an office in Dallas as well. We typically try to hire within that area of Dallas or Denver depending on where we need folks. We have a couple of people, too, that have been with us for a while. They are amazing engineers and they decided they wanted to move somewhere else in the country, but we are fine with that.
We've gotten a sense for them and know they are hard workers. We know they are amazing engineers, so we let them work from wherever in the country they happen to live.
Initially we hired in the Denver, or Dallas or San Francisco area depending on what we want, providing that flexibility that eventually comes with it.
Thanks for chatting! If people want to chat with you or have follow up questions, how do you recommend that they contact you?
Email: Travis at cloud-elements
Of course, check out Cloud Elements and leave a comment below thanking our guest for his time and knowledge.
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)