Going from a monolithic app to microservices and building engineering teams

Interviewed by Christophe Limpalair on 05/31/2018

How do you go from having a monolithic application to microservices and how do you build teams and processes around increasing development velocity? In this episode, we find out with James Judd from Lucid Software.

Downloads

Interview Snippets

How do you go from having a monolithic application to microservices and how do you build teams and processes around increasing development velocity? In this episode, we find out with James Judd from Lucid Software.


Let's start with a little bit talking about you. How did you get started in this industry?


[00:23] James: I was originally a finance major in school and one summer a professor in the atmospheric science department was interested in having someone do some Perl development over the summer. I said, "Okay, I'm interested kind of. The money would be great," but I didn't think I wanted to do programming. My interview was, "Do you know Linux?" I'm like, "Yes. I've used Linux before." "Great. You're hired." I did Perl over the summer and by the end of the summer, I was like, "Crap. That was actually really fun." I had a mid-college crisis and decided, "I actually want to do computer science instead." I finished my finance degree, did computer science and the rest is history.


Can you give us an overview of what Lucidchart is and how it works?


[01:35] James: We're Lucid Software - we make Lucidchart and Lucidpress. These are tools that help people communicate more clearly. Either they're online, in the browser, collaborative editors, you can use Lucidchart to do things like make flowcharts and relationship diagrams. You can also diagram your entire AWS account automatically in minutes.


We also make Lucidpress which is a tool to make flyers, magazines, newspapers, et cetera, when you need to really stay on brands or like franchises, real estate things like that.


You started, I believe, at Lucidchart as a software engineer and you're now an engineering manager. Can you talk about some of the skills that you've had to develop going from just being a software developer to having to manage other software developers and what that transition has been like?


[02:33] James: I think one of the most important things to do as a manager is, you've got to learn to hire and retain the best people and just somehow get results. It's really weird because as an engineer your work is the deliverable. You're directly working on it. When you become a manager, your team becomes the deliverable which is really, really scary. You have to build up trust with your team. You have to develop a great relationship with them. Then, I'm a huge proponent of servant leadership kind of serving as that shield, absorb distractions, miscellaneous meetings, maybe the small bug fixes or things like that and then taking that and helping your team succeed.


One of the hardest things to do, especially for first time managers, is learning how to delegate. Learning how to get those results from the team, not necessarily your own output just like you were saying. What are some tips that you have in terms of learning how to delegate and making that work happen through your team?


[03:43] James: One of the biggest things with delegation is trust. If you can trust that you'll ask someone to do something and they'll do it and they can get it done well, that makes it much easier. I think developing a great relationship with the people on your team is very important. Having weekly one on ones is a great way to accomplish that. You meet with someone on your team for 30 minutes, have the check-ins, you have there go ins, see how things are going. If there's something that's brewing, you can catch it much earlier, you can resolve it. It doesn't sit there and fester. It's a great way to build trust to help delegate.


What are some ways of measuring some of that success? You mentioned the one on ones and things like that, but how do you ensure that they're successful? How do you show them what success looks like so that they're really able to own that position and get results and succeed in their position?


[04:34] James: There is a few different levels of measuring success. It all starts by having a solid foundation. We set very clear expectations, a description of what we expect people to be - the level we expect people to be at as an engineer, how they can meet those expectations, how they can exceed them. Then we'll set individual goals in a sprint. We do two weeks sprint development cycles. Then, we look at whether you are meeting your sprint commitments and then we'll set OKRs, like quarterly goals which we call OKRS, objectives and key results, and we'll set those around business results we're interested in seeing and are we hitting our OKRs.


Ultimately, it comes down to "is the business succeeding?" We're doing all of this so that the business can actually succeed. That's setting that foundation, meeting your sprint commitments, hitting the OKRs, and helping the business succeeding, that's how we would measure success.


Christophe: We actually use the same system. We use OKRs here at Linux Academy and we use a very a similar process. Having those OKRs defined on a quarterly basis, so that then each individual team member has that internal compass. They know - this is the direction I need to head in.


Now, a lot of times, even though it's a part of an OKR, it doesn't mean that it's the only things you're going to be working on because you might have other requests that come in, things that are unexpected, other team members that need help, but having that internal compass that you can go back to and that can guide you towards that direction that again helps with some of those strategic goals that the organization has, has been super helpful for us. It sounds like it's been helpful for you guys as well.


James: Totally agree.


One of the reasons that I was asking you some of the questions is obviously since you're in that management position, but also I'm curious to hear from the audience if you guys want us to spend more time on questions like these in terms of management, in terms of learning how others are doing it, how they're structuring their teams and things like that. Please leave some comments. Let us know what you think about that. I'd love to hear it.


Now, moving on to more of the architecture, speaking of which, can you describe what the Lucidchart architecture looks like just in a nutshell?


[06:47] James: We serve a rich application on AWS using microservices. Our front-end is comprised of angular, typescript, WebGL, and the Google Closure Compiler. We use WebGL because we do all of our own rendering. You get into the actual editor, this little rectangle where the actual editor is. We do all our own rendering so it's pixel perfect across different devices and then on the back end, we're using a bunch of Scala microservices and as I mentioned earlier, we're hosted on Amazon.


We also use some angular and I was asking some of our engineers if they had any questions. We'll dive a little bit more into angular, but you had mentioned previously that you guys used to have a PHP monolithic application that you ran and then you moved over to the Scala microservices that you just mentioned. First of all, why the change? Why did that happen?


[07:36] James: Lucid has been around for about two years when we decided to switch away from PHP to Scala. We were constantly having problems with our monolith and with PHP. We're using a framework which had lots of security problems. We had trouble with things like injection and things like this. So, we're evaluating different frameworks to switch to in PHP or a different language. The CTO went off and tried things out in Java, in Scala, in PHP which were the three finalists and came back and said, "I really like Scala." Everyone else tried it out, agreed and we went forward with Scala. It's been great to move to a microservices architecture using Scala.


There's two separate things. It's been great to move away from a monolith because scaling a monolith is really hard. Performance is hard to reason about. You don't know if you're spawning a thread in something that's related to fonts if it's going to impact document saving, putting those in completely separate applications or in separate servers makes it much easier to reason about that.


Along those lines as well, the failure blast radius was really big. If you chew up resources doing a bunch of font processing, again, your document might fail and you don't want that to happen. You want fonts to be able to fail independently of your document service.


Scala has been great because we're a big fan of strongly typed languages. I know that there's a religious word about that, but we can catch a lot of bugs in compile time that we catch at runtime. Scala's great because it runs on the JVM. We get all the libraries, benefits, and problems that come with the JVM, but in general, we don't have to worry about rewriting a bunch of HTTP libraries or something like that. We can just use all the Apache stuff and call it good.


In terms of the actual rewrite itself, how do you go from having this PHP monolithic application to rewriting into microservices? How do you even approach that project? The reason I ask is because we have also gone through some these rewrites for parts of our applications, and one the issues that comes with that is, obviously, you have limited resources. When you have limited resources how do you maintain or when do you maintain the current existing application and how many resources you put behind that to fix bugs and add new features, versus how many resources do you throw behind the actual rewrite itself, especially if it can take a while. What would be your advice into how you guys approach that?


[10:19] James: What we did is, we started off by saying all new services that are not going to be part of the current monolith are going to be written in Scala. That was how things were for about a year or so. I think they went through and one of the first things they did was, they wrote a document service because they wanted to greatly expand the capability in collaboration that we had. That involved ripping some code out of PHP but a lot of it was new stuff. Then, a few years later after we've been writing some new services in Scala, they decided, "Okay, we're going to start migrating all the PHP to Scala."


At first it was ad hoc. We set up a leaderboard, how many lines of PHP have you converted to Scala? That generated a little bit of internal competition that was fun. Every line of PHP we converted, we converted into like 10 cents or a penny or something like this towards an engineering team activity where we all played laser tag or something like that, but that only got us to about a third of the way there. We then dedicated a team of about four to five engineers for four months at the end of 2015 to take us over the finish line.


That's primarily how it got done.


What about with angular? When did you decide to switch to angular and what were you using before then?


[11:38] James: Before we switched angular, it was a mix of jQuery and an internal templating language based on jQuery -- it was not very good. It was not very fun to use. We decided to switch to Angular, Angular two at the time, the day after it came out of alpha. We decided to do a new version of the Lucidchart editor using Beta Angular at the beginning of 2016. We spent a few months on it, we got a good chunk of it done, we had some conversion issues that we worked through with Google, but that's when we decided to which over.


It was just, "Hey, we're taking a big crack at refreshing the UI that we could do it in what we have now, we could try to make what we have now better or we could just use Angular. We decided to use angular.


Can you talk a little more about the integration of Angular itself? Are you using it or integrating? How does that integration work between the front end and the back end through Angular specifically?


[12:37] James: What we've got going on is, we have Angular that is primarily dealing with the UI, the menus, components, various things like that, and then a lot of the business logic will be just TypeScript code. And what will happen when we're communicating between the front end and the back end integrating our Angular UIs with our actual back end is we've got a rest API on the back end.


We use hypermedia as the engine of application state. Everything returns urls. What ends up happening is the TypeScript code or our iOS code or Android code or integration code, will contact a bootstrap URL sending a header saying, "I want this particular version of it and I want it in JSON," and it gets its links back to, "Okay, here's where you can go to contact user service, here's where you can go to contact document service."


When you contact the user service, "Okay, here's where you go to get this information about a user, here's where you get information about this user subscription," this sort of thing and you just follow these links, until you end up with the research you want. The advantage of that is, it makes it really easy to version your API because the bootstrap URI stays the same and then you just accept different mime types whenever you need to mutate the model.


The biggest challenge with it is that it's very chatty.


If we had zero latency between client and server, this would be fantastic, but over more higher latency mediums you do see a little bit of issue because round-trip times are much higher, but for the most part people are using the application from desktop computers at work. In general, it works pretty well for us.


Do you have dedicated full stack developer? Is that focus on the front end and focus on the back end separately or do you have full stack developers that focus on both at the same time?


[14:34] James: Yes. Every engineer that we hire at Lucid, we hire full stack. We like to hire smart people and get things done and figure we can teach them the rest. We have these full-stack engineers that spend some time on teams, we switch teams up three times a year where they can-- on a particular domain of the product. Some are more front end, some are more back end some are more integrations and some are more mobile, but at the end of the day we expect people to be able to work from front to back, granted in any given time you're probably working on largely front-end or largely back end.


I've worked all the way from infrastructure on the devops team, all the way to writing CSS on our front end and I really enjoy that.


When it comes to code standards, how do you really enforce those or even define those for the teams?


[15:20] James: What we do with code standards is, we have automatic formatting and linting, which is actually relatively new development. It's only about the last six months or so, when we finally got big enough that we decided we needed the automated tools. Then, beyond that it's code reviews. We'll say, "Okay, all code before it goes and gets merged in the master has to go through a code review. You assign somebody who is an expert in the code and will use some plugins to try to smartly figure out, "Okay, here's people who've been working on this file, they probably know what it is, assign it to one of them, get some code reviews, get QA, this sort of stuff. And that helps us keep our codebase sane and the code of pretty high quality.


Speaking of which, we talked about some of the processes that you guys as a team have implemented to help with some of the development, speed up the development and things like that. You mentioned that some of those that have been very helpful and positively impactful have been: releases once a week, smaller teams of about eight people or so. Finally, you talked a little bit more about the on-call team. Let's start first with the releases once a week. First of all, how did you even achieve those releases once a week?


[16:34] James: We used to release once every two weeks. Ever since I joined Lucid three and a half years ago, we've been releasing once every two weeks. We had a two week sprint. The sprint would happen on the Monday after the sprint ended, which would end on a Friday. We would cut a release candidate branch. That release candidate branch would go through a manual QA regression blocker and would be fixed and then we would release to production on Thursday afternoon.


What we did to get it down to a week is, we took a look at what services we were releasing once every two weeks. Some of them we decided, "Okay, we have test coverage here that we can just continuously deploy with. We just started continuously deploying a lot of our back end services that we are low risk and aren't likely to break anything too bad if they break in production. Then, we started working on automating a lot of manual regression testing.


We have a dedicated QA team at Lucid. They're awesome, but you can't have people just doing regression all the time, it's just going to burn people out. What we were doing is, we took an engineering team, we worked on getting a lot of that manual regression we were doing. We wrote a lot of end to end tests. We did just all sorts of stuff that we could do to make that QA regression cycle shorter from a full week to just a couple days. As we were improving the regression time from five days to four days, to three days, to two days, we changed our releases to go correspondingly.


We said, "Okay, we're going to release every ten business days, then we're going to release every nine business days, then eight, then seven and six and so on and so forth. We eventually reached five business days, full release. We just release now every Wednesday, which is great.


Can you speak of some of the positive impacts that has had? The reason I bring that up is because, I'm willing to bet we have people listening to this right now who are not there yet. Their organizations are not anywhere near that and it could take a tremendous amount of investment to even get to that point. When does it truly make sense to put forth that investment and why and what is that impact?


[19:01] James: I will definitely say that it takes a lot of investment. It took us about a full year and it was not an easy year to get the weekly releases, but the benefits have actually been great. We used to see a lot of changes that would be released as hot fixes or emergency releases, we like to call them, to production because some teams say, "Hey, this is going to go out in two weeks and that's all." "No, I need this sooner than two weeks. We just want to tweak something on a marketing page, we want to tweak this AB test or we want to do this thing."


Those don't happen anywhere near as frequently. There's something magic about, "Okay, that's going to go out next week, I don't have to worry," versus two weeks. We saw a massive drop off on those which is fantastic because that means a lot of things will go through and there'll actually be QA and there's less risk of breaking the production system. Then, one thing that was actually surprising, didn't expect it to be as big of a benefit as it was, was we decoupled sprints from the release cycles.


Sprints and the release cycle used to be in lockstep. Once every two weeks, we got a sprint. The big challenge with that is on the Friday afternoon at the end of the sprint there will be this massive wave of pull requests that will just be merged. Things will be going along pretty steady and then just bam, it's 50 to 100 commits or even more that will just get merged all at once, and master would go from stable to just a giant mess, and you'd have to spend a bit of time on Monday and Tuesday cleaning everything back up because all these things getting combined at once cause regressions. When we see that we're decoupling the sprints from the release cycle, you're not rushing to meet your sprint commitment, and get everything merged right at the end of the sprint. It's been really great to have that. Things stay more stable, the change sets are smaller, I've been a big fan of it.


The next implementation that you mentioned is keeping those teams a little bit smaller. So, having about eight people per team, can you talk about why that's important, and also how or what those teams are comprised of?


[20:49] James: What we have is about eight people per team. This will usually be four to five engineers, a QA specialist, a product manager and a user experience designer. What we do is, the engineers are the people working on a specific domain of the product. They'll be working on, say, our data linking stuff, or our AWS integration or integrations with the enterprise team or something like this.


You have a product manager who is very familiar with that part of the product, you have a QA specialist who is embedded on your team, they'll work with your team and maybe a couple other teams. They are familiar with testing your part of the product and your designer who is familiar with that as well. We find that these integrated teams do really, really well because you don't have somebody who is disconnected from the team who sits down and says, "Okay, now we are going to plan a sprint." No, they are with you on a day to day basis, they know what's going on, these cross-functional teams we think really help us achieve a better product.


We don't currently have a user experience designer as part of the engineering teams, that's actually something that I'm pretty interested in, and we'll have to see if it's something that makes sense for us as well, but I could see the benefit of having that especially when otherwise you might have some different sprint plannings, and something might pop up that we need to create a design for that was not originally thought of because the task wasn't broken down or wasn't thought through


[22:12] James: To that point, we've been really a big fan of having UX designers working directly with the engineering teams because you've got a UX designer, they know what's being worked on in that sprint, they'll go talk to an engineer and say, "Okay, your component, you've got this thing in the design, and it's infeasible to implement like this." or "Hey, can we tweak it? I've sat down, I've tried to use it, and I think that it's worked a little better." They can pull them over to their machine when they're working on things, and we think it really just helps to speed up development when you don't have to wait for another cycle to start for a different team, another sprint or anything like that, you can just day to day pull them in, and get feedback, and work on these things as you're moving along.


Even if you have a very well thought out scope with wireframes and designs and you look at those designs and you say, "Man, this looks fantastic, this is going to be awesome." Once they put it out on a web page, sometimes it can be entirely different. You go through it and you're like, "this is not as good as I thought it was." There's just a disconnect that's very hard to get around. Again, I can definitely see that being useful for sure, thank you for that.


The last thing you mentioned was having these on-call teams. You explained what the on-call teams are responsible for and some of the changes that you implemented in that, can you tell us a little more about that?


[23:18] James: We've had an on-call team structured this way since I joined Lucid. I joined it about, well, I guess, a year and a half ago. Back then we had a very small on-call team, it was about five to seven people because we didn't really have different tiers of access in our on-call. You either were unprivileged or you had root pretty much everywhere, which we have since changed, there's now intermediate levels, but that's how it was back then. It was a very small group of people that were embedded throughout the engineering organization.


You take on-call once, they would cycle through everybody, so, once every seven weeks or something like this, and when an alert went off you get paged because something has broken, you'd hop on your triage, you'd try to get things healthy again, and you'd move on. We love embedding these people directly in the teams because it's really easy to become disconnected as an engineer working on an engineering team that never deals directly with production.


You write your feature, it gets shipped to production, but that's about it. Now, you write a feature, you ship a feature to production, it has scaling problems. Somebody from your team or somebody else in the engineering organization, maybe your good friend or something like that, gets paged the night before, they come in all haggard the next morning, and you are like, "Man, that's not great. I woke so and so up at three in the morning last night." If somebody is on your team they can talk about that, you see that impact, and you feel a little bit more ownership of production when there's an on-call person directly on your team. It also empowers the team too. If they need to do a hotfix release or something like that, they can do all that directly within the team, and they don't have to go to a special on-call team or the devops team to get a release approved, they can release it, they're responsible for it, if it breaks anything they can go fix it.


I do want to ask you a bit more about some of the monitoring, and tooling around, maybe errors or even performance. Before we do that, let's talk a bit more about some of the other teams that you guys have, and how you've structured those. Also some of the AWS services that you are using since you did mention that you run on AWS. Let's start first with the other team structures. I know you have a systems team and a back-end team, can you talk about what those teams are comprised of, and what they represent?


[25:28] James: I work directly with our systems team and our enterprise and integration teams. Our systems and our back-end team are pretty much the same team. Systems is our back-end applications team and devops is more back-end infrastructure team. Systems is responsible for the scalability, reliability, performance, and architecture of the application. Then, the devops team is responsible for a lot of the same things, but they approach it from: do we have the server capacity, are using the right services, do we need to do more infrastructure related things?


Systems team, we've done all sorts of fun things. Occasionally, we are going to optimize or rework part of the architecture that we have outgrown. Sometimes we'll do the more technical parts of the product. We don't actually have a product manager on the systems team because we tend to tackle the things where a product manager won't be as useful because there's pretty much no front end or user-facing component of it, but it's still important to having encryption, security, performance, things like that. Then for the rest of the engineering organization, we have teams that are broken down by domain for the products.


We have got an enterprise team, an integrations team. We have a team that's related to our data linking, and data visualization features, things like that. As their names might imply, enterprise works on things that enterprises care about, like identity management and security, and integrations. Our data binding team works on our AWS integration and hooking up with Google Sheets and visualizing data, and stuff like that.


Let's talk more about that infrastructure you mentioned, running on AWS, what are some of the services that you use, and what do those services do?


[27:13] James: We use a lot of AWS services. I won't list and go through all of them. The primary ones that we use are EC2, that's where we run all of our services on, that's the actual service that we are running our application on. We use RDS with Aurora, so we can actually store our data. Even though pretty much everything is SQL, we use a little bit of DynamoDB. We store a lot of data in S3. We also use things like SQS, SES, Kinesis, things like that. I know that's like a laundry list of Amazon Services that we use - I can talk a little bit more about how the architecture and the infrastructure looks...


What we've got is, it's nothing totally exotic, we've got people coming in--requests coming in from the internet, they get routed to Amazons ELBs so we can do SSL termination, and they get passed off to HAProxy which is our load balancer, and then from there they'll get sent off to different groups for individual services because again we are running microservices.


If you make a request to the document service you are going to get flown through HAProxy to the document to a particular document servicebox, it's going to service your request, and it's going to send it back. If it needs to contact the database--actually, just in the last two days, we now contact the database directly, we had a database load balancer, we still have a database load balancer, but that's that.


That's the flow, how requests get all the way in through the internet, to an application box, to the database, and back out, and then just another random net is where we're using Chef for management of our service. When we need a box to speed up in production, we say, "Okay, you need to be a document service box." Chef will configure it, and run it, will bake an AMI, but you just have to make sure everything stays up to date and security add-on patches are running and all that wonderful stuff.


When you use different services like that, especially when you have microservices, one of the things that's easy to keep in mind is, when you have a request, that request can actually touch so many different pieces. One of the issues that you can have there is, what if one of those pieces becomes unhealthy, and that response time really increases or there's an error rate increase or anything like that. How do you guys prevent that or if you can't always prevent it, how do you measure that, and how do you make sure that somebody can see this is the piece that's unhealthy, this is the piece that's taking longer to respond?


[29:38] James: We don't always prevent it. I wish we could always prevent it, but we don't. I'll start with the monitoring piece, how do we monitor it, how do we know when there's a problem? Then I'll talk a little bit about some of the tools we've built to make our lives easier. We'll use StatsD and Datadog to measure a lot of metrics around application performance. We'll send those off to Datadog.


We will alert on them, we will track on latency. When latency for this endpoint or for the service is slow and is slowed down. Pass some SLA that we've defined. Go ahead and alert, someone will take a look at it. When there's enough errors, go ahead and alert. We'll ping them as well. We'll say, "Okay, constantly ping the document list through the editor and if you can't get there, they're on alert," those sorts of things. That's how we know when things are wrong, as we have a lot of metrics that we use with Datadog and Loggery.


That helps us keep track of when things are working well, when there aren't. So, in response to some things that we've seen, and these are pretty common problems we've just had with distributed systems in microservices. The classic example I like to use is, we'll run into issues with retry attempts. You have service A that makes a call to service B that calls makes a call to service C, and this call is set to retry five times. If you have the request fail from B to C, it will retry five times and then because it's failing from B to C, if it fails from A to B, A to B then retries five times.


We've also got an exponential growth in the number of requests we're going to be making to service C. Service C gets a little bit sick and all of the other services are just like there's this exponential explosion of retries, and if service C was sick, it's now dead. What we've done to deal with this is, inspired by Netflix's Hystrix library which is a way to do inner service communication in a fault-tolerant manner. It does things like circuit breaking and keeping track of requests as they flow through different services.


We ended up writing an internal library that we call Kronos where we say, "Okay, so you're going to manage timeouts, this request should be taking this long, if it's taking longer than that go ahead and stop." If we detected an upstream service is sick, don't just keep trying to make requests to it because it's sick, give it some time to recover and then connect back to it again.


A great case where we saw this actually work was, earlier last year, I guess it was almost a little over a year ago now when Amazon S3 went down. We depended on a lot of things in Amazon S3, we stored a lot of images there, we stored a lot of fonts there, for example. Before we had Kronos, the site would have probably just been down because we would have been depending on S3, it wouldn't have worked. What ended up happening was, we were degraded but we didn't go down.


We tried to contact S3, it all failed. We just flipped the circuit breaker and that circuit breaker didn't flip back up until S3 started to become healthy again several hours later. People didn't have images, people didn't have fonts, but which arguably if you don't have images and fonts, how useful is it really, but you could still go in there and edit your text, you could still open documents, if you need to get to something you could still use the product.


That was really cool to see that in action.


That's excellent. Do you have another example of maybe other pieces of your application, where you have that kind of switch? Was it permanently focused around those images and fonts? I'm just trying to think from the perspective of how modular can you make that application so that if that system goes down it doesn't impact the rest of it?


[33:25] James: Every single client, internal and external now gets Kronos. You're making a request from our document service to fonts and images, or documents service to our user service, or the spelling service or whatever. All of these different clients use Kronos.


We've also seen instances where we'll do things like we'll completely consume a thread pool and the default behavior or a level like a connection pool or a thread pool will just be blocked forever. You end up with this infinitely backing up queue until either your service crashes because it runs out of memory or everything just crashes because nothing's ever getting done. You have thousands of requests coming in, but they're all blocking at the single point. This is why we developed something like Kronos, which is just going to handle this chaos that develops in the system.


How do you measure the performance of the individuals or the team that works on that function? What I mean by that is, do you have a set of KPIs or anything that you track in order to say, "Okay, we need to have this amount of up-time. If this disaster happens, this is how it should be handled." How do you guys measure that?


[34:50] James: What we'll do is, we have a set of KPIs that will usually track back to OKRs. Like, on every single quarter OKRs, we usually have an uptime OKR that we want to maintain. Some of them will say, "Okay, they've been on OKRs for a year or something like that." They're not just part of the ingrained culture so they come off, but they're still very important to us. We'll measure up-time, bug count.


We have a special category of bugs that we call production issues for things that cause alerts to go off. We want to keep those low or when they happen, we want to solve them quickly so that the on-call team doesn't get burned out. Various things around this. Then we also measure things like response time, the percentage of healthy requests. We also measure inside the actual editor which is a little different than probably most web applications. We measure things like the frames per second we're getting and how many resource stores we have or is there a particular document where performance is just really anomalously horrible or something like this.


That's a lot of different things to measure, but these ultimately boil down to overall health of the application and overall performance of the application.


Let's wrap this up with a wildcard question, you ran an advertising campaign piggybacking off of the Doggo Craze. What was this all about?


[36:08] James: This is a doggo, the small doggo, is a pupper, he's a sad pupper, but a big old pupper that's actually a doggo, not to be confused with a big old doggo because that's a woofer.


Can you tell me a little bit about how that went, what was that advertising campaign and how did that go?


[36:22] James: We have a video marketing manager named Caleb, and he's an awesome guy. He ran a couple of videos last year. We got together and he said, "Okay, we want to really get in touch with the engineers. We're going to the other engineers and figure out what they like, and we're going to get advertisement about that." I got together in a room with Caleb a few other engineers and he said, "Okay, here's some ideas that I have and I love to get your thoughts." We got to talking and I was like, "Hey, what about doggos?" We got this doggo thing we could do.


The Internet has this kind of lingo around doggos. You got doggos and puppers and woofers and floofers and subwoofers and like a bunch of these made up words. He made a flow chart where he walks people through, "Okay, this is a doggo, this is a pupper, this is a woofer, this is a floofer." We posted it on Facebook, we posted on Reddit. It got a few thousand views and we were like, "Okay, well I suppose that was a moderate success." Then, I woke up on a Saturday morning, we launched the video on like a Thursday or Friday afternoon.


A friend texted me and he's like, "Hey, so did you guys make this doggo thing because I'm seeing it on Facebook and it's blowing up right now?" We hop on there that Saturday and we found all these people who are re-posting it like 9Gag picked it up, UNILAD picked it up, a few other people picked it up. It went from like 3,000-5,000 views to like three million, five million views. Then, we made a few more videos, and then the next thing it was like, just millions and millions of views and we're ecstatic about it, that we can just take largely nonsensical words from the Internet and put them into flow charts and they're really, really fun. It's always really fun to be able to have fun with your advertising and have it actually work.


It's oftentimes the weirdest things that end up going viral that way.


[38:05] James: It's true.


That sounds like a lot of fun. I'm glad that worked out for you guys. James, thanks again so much for your time and for doing this episode. I learned a lot personally from doing this and I'm sure listeners will as well. If anybody has any questions, like follow up questions or anything like that, or if they're just interested in following you for your work, how do you recommend they do that?


[38:23] James: You can find me on Twitter, I'm just @jamesajudd on Twitter, you can also send me an e-mail at james at lucidchart.


Thanks James and thanks everybody for tuning into this episode. Like I said earlier, please let us know if you want to see more questions about management and things like that. I know that can help a lot with career growth. Let us know if you want more of that and otherwise, we'll see you next time.



How did this interview help you?

If you learned anything from this interview, please thank our guest for their time.

Oh, and help your followers learn by clicking the Tweet button below :)