Using peer-to-peer technology to change the CDN industry
Interviewed by Christophe Limpalair on 03/19/2018
Every time someone mentioned peer-to-peer in the past, I always thought of torrenting tools. Fast-forward to today, and P2P possibilities are now embedded in modern browsers thanks to technology like WebRTC. It enables browsers to talk with each other without needing any middlemen, like servers, to relay information. This has unlocked a number of fascinating possibilities, one of which we explore in this brand new episode. Shahar Mor from Peer5 joins us to explain how they are streaming data in a much faster and more scalable way than CDNs can do on their own.
Let's start with your background. How did you get started in the IT industry and where do you come from?
Shahar: I come from Israel, in Tel-Aviv. It's a small country in the Middle East where most of the startups come from these days. We're going to talk about that later but more about myself...
I've been a developer for about 10 years now. I started as a web developer. I started learning more and more about servers and Linux and stuff like that at my first job. It was an advertisement company. It had its own issues with scalability.
I stayed in the ad space for about seven years when I decided that I wanted to do something else. Something that can change the world.
I started looking around and I first encountered Peer5. When I first talked with them, they told me their idea of basically using the users themselves to broadcast video online which sounded amazing to me. I thought - "it's going to change the world." It's going to be like what the TV did to the internet.
That's when I joined Peer5, about three years ago, as the first employee. That's how I got here.
Peer5 is a peer-to-peer CDN, but can you tell us a little bit more about what it is and how it works?
Shahar: Traditionally, a Content Delivery Network or CDN as we said, deploys servers around the globe to use the physical distance between the assets that they host and the users themselves.
The problem with that is obviously the scale because at any given point in time those CDNs have a limited number of servers that can support a limited number of users.
Eventually, those servers are also used by not only one customer but many customers. The performance is prone to going down.
What we did is basically remove the servers from the equation. Now, we don't have any servers that we need to scale and maintain at all times. You just re-use their viewers themselves.
The users view the content, the videos online, to distribute between themselves parts of the video at any time and construct the video back in the browser and play it back seamlessly without any software installation or plug-in installations.
Why aren't more content providers using peer-to-peer streaming? Does it require a more specific use case? Or is it a matter of the technology that just wasn't there a few years ago and it's just now starting to be feasible?
Shahar: Yes. Until about five years ago, the technology didn't exist. The technology is named WebRTC. It's what the neighbor's browsers do to talk with each other without having any middlemen like a server in the middle.
Until now, the entire peer-to-peer industry was based around software installations like BitTorrent and stuff like that. You had to manually install that software which obviously no one would do for watching a video online that they can just do it in the browser.
Now that WebRTC is there, it's enabling most of the browsers specifically Chrome and Firefox and now Safari as well. It's much more reachable for all of the users.
We see more and more customers that come to us asking us to help them with scaling their video infrastructure. It's becoming more and more important for customers because everyone is watching video online now, not on television.
Can you dive a little bit deeper into what WebRTC is and how it works? I know it allows browsers to talk to each other, but are there security vulnerabilities with that? How do they talk, how does that work, how do they know to make those connections?
Shahar: Initially, when you connect to a website and you want to connect a different browser, you need to somehow transfer some information, some metadata about yourself to the browser.
The way it works is you have a signaling server, that's where Peer5 comes in. We act as the signaling server. We pass that information, that metadata about you, to the other user and he passes some information back to you through us.
Once we pass that information and the handshake is done they can talk directly with each other. They basically expose themselves to the internet and then they can connect directly to each other and transfer information.
Everything is secure. There's an SSL/TLS certificate on the data on the tunnel. It's really awesome to conjure right now because it allows you to do stuff that you couldn't do in the past like real-time video chats.
On your website, you claim that the more people show up -- for example, say you have a sporting event, a world cup or something like that -- the less buffering you should have because the more peers you can connect to and get that data from. Is that right?
Shahar: Exactly. It's actually exactly like that. Peer-to-peer is less relevant when there's no concurrency because obviously, you need to connect to other users. In big events like the world cup or the champions league or the Olympics or basically any big main sports events or concerts or something like that, there's a huge concentration of users.
You can connect to really good peers which means that they have a very fast network between you and them where you can transfer parts of the video really fast. Even faster than the CDN itself and the HTTP server.
It makes you basically have a redundant network path that you can take and use at all times and get some part of the video from.
When you say that it can even be faster than the original CDN, is that because you could be geographically closer to the user? If you're in a more populated city you have more users watching that stream so they're closer to you. Are there other reasons that would be the case?
Shahar: Yes, that's the most obvious case because you can be-- I can watch some stream online and my neighbor who is just the next door watching the same thing can give me the same content without even going outside of the building. That's obviously much faster than going back to servers and going back to you.
Some servers can be far away from where you are. They can be overloaded if there's a lot of viewers watching the same content. For example, you can see online every time there's a big show that users complain about the server going offline and the network is not responding and stuff like that because the server just can't handle the load.
We need something that is elastic and can scale with the demand, unlike servers that are static. Users, when they join, become a node in the network. They can help spread the load between them.
How does something like WebRTC differ from WebSocket connections or other types of connections like that?
Shahar: WebSocket is mostly used for the client-server connections, what it gives you is basically the way for the server to also send messages to the client and not only wait for requests from the client. It can be used with WebRTC -- as I said before, to have some messages between the clients when they want to connect. WebRTC enables you to connect the browsers directly without any server in the middle, which gives you a secure way of connecting to clients basically without having anyone in the middle that you need to maintain.
I do have a few more questions about what you call the serverless side where you get the peer-to-peer connections. I do know that you also run some infrastructure behind the scenes. In fact, you run on three different cloud forms; you've got Google Cloud, AWS, and Microsoft Azure. Can you explain first of all what you run on the cloud and second of all why are you using these three different platforms?
Shahar: As I said before, we don't use any servers to transfer the files themselves, unlike traditional CDN. We do have some servers for the signaling as I said before, and some servers for analytics for collecting the data from the clients of how many bytes were transferred via peer-to-peer, and how many bytes were transferred via HTTP, and some other video metrics that they collect like playtime and buffering time that we then can show the customer.
The reason we decided to use the three different cloud providers is because when we tested them, each of them had its own benefits and advantages and disadvantages. Some parts of our service required some advantages that others didn't have.
For example, when we first started using AWS, we noticed that their load balancer was not sufficient for us, because it wasn't scaling fast enough. The way it works right now is that they have their own nodes, their own servers, and they scale those servers when they see that you need more capacity. Our capacity spikes are much more frequent. For example, when there's a major soccer game, users don’t show up for the game like one hour earlier, they show up like one minute before the game starts. Everyone shows up at the same time and then the scale is not fast enough with AWS. When we moved to Google Cloud, we noticed that their load balancer could scale much faster.
It's basically always there and always available. There were no issues scaling up. That’s why we decided to put our main signaling service in Google.
For example, we decided we needed a way to control how many servers at the given time can be taken down for maintenance for the cloud provider. They have a feature called availability sets where you can tell them for this specific set of servers, I want to always have, at most, one server down. We needed that for our analytics servers that we’ve based on the elastic search, so that we will always have at least one node up, and they could never take two servers down at the same time.
We now also use AWS for the assets, like our client library and Android SDK, iOS SDK, and stuff like that.
So on the Google Cloud Platform side, you don't have to wait for that warmer period time. It's immediately responsive to that increase in requests and you can just serve those really fast.
Shahar: Yes, exactly.
What are some of the challenges that you have had working between those three different cloud platforms? Has it introduced complexity? How have you mitigated that complexity?
Shahar: The first thing that was important for us when choosing a cloud provider was the performance. Obviously, speed and scalability are key factors for a CDN. When we tested that, we noticed that Google was the fastest for our use case.
We decided that each service will be a standalone service and will be hosted on its own cloud provider. In addition, we noticed that when we wanted to access some resources that are located on a different cloud provider there might be some network issue or stuff like that which caused us some trouble. We started using queues on the edges of each service, where when you need to read something, you just go and get it from the queue and not access directly the database.
What are some of the tools that you use and what does your architecture look like?
Shahar: The main influence for Peer5 is the WebSocket connection, which has it's own issues if you want to talk about that. It's a Google load balancer, which does the SSL termination. Once that's done, it's being routed to one of our tracking servers. Trackers match between their different peers to give them the initial peers to connect to. That's all handled via a cluster that we are heavily using.
When it comes to sending messages between different peers, let's look at an example:
If I’m connected to server one and you are connected to server two and I want to send the message to you, I need to go through server one and to server two, and then go back to you. But the way that we're connected between the servers is using a pub/sub mechanism. It is based on Redis. It's a custom submission we built in order to make it scalable.
Then we send all the analytics that we collect to a cluster queue, which is then being collected using blocks sent by analytics and BI (Business Intelligence).
Can you talk a little more about that pub/sub in Redis that you just mentioned? You’re using pub/sub to-- You said to send information between servers, right?
Shahar: Yes. We need to transfer some information. When one user wants to send a message to another users and is connected to a different server, you need to pass that message along. There are some services that the cloud providers provide, like Google pub/sub. On our scale, they're just too expensive to use. We had to build something on our own.
We started using Redis pub/sub as is and we noticed that it’s also not fast enough for us, so we built some custom solution around it. We basically made it auto-scalable. Basically, it adds more and more nodes as the users join the service. That gives us basically unlimited amount of pub/sub messages that we can send without having any issues.
Can you talk about some of your stats of how much data you are pushing between the servers or how much data you're collecting from the peer-to-peer connections?
Shahar: On an average day, we have around one billion WebRTC sessions that we open, which means one billion users that connect between each other. We see around two billion messages going to our servers, that's on an average day. I think we see around 20 million to 30 million users connecting to our service.
The most amount of users that we had was two million at the same time connected to our servers. All of them are using WebSockets, which means that there is a system connecting to us. It’s not like a normal HTTP request. Each connection sends around two to three messages a second, or something like that, which sometimes gives us around three hundred messages in a second. After a while, you stop sending messages once you stabilize.
That's a tremendous amount of data in such a short period of time. How are you able to ingest all of that and make sense of it? Maybe we can look at a specific use cases where maybe you're looking to troubleshoot or see some of the issues that peers are having when they're trying to connect to each other. How can you filter it down to that level of detail and make sense of that data?
Shahar: When we first started collecting our analytics about our users, we collected on a per-event basis. Basically everything that happened on the user was collected which eventually caused our servers to handle too much load. It was around one billion events a day or two billion events a day. We decided that we had to aggregate it on the client-end in order to be able to use the data.
We do have some internal analytics that we collect for the debugging process just as you said. It's more granular so we can dive into a specific session and see what errors it had and what caused it to not work or stuff like that. That is also limited - so for example, if there are a million users, we're not going to collect a million sessions. It's going to be too much data so we basically choose some percentage of the users to collect the data from.
I know you also collect data such as buffering, session length and some other key metrics like that. How can you improve those metrics? Say you see that the buffering time or session length isn't where you want it to be and you can look at some of the data that's coming in. How do you figure out how to solve that? It could be out of your control, it could be due to network issues, latency issues, things like that. How do you fix that?
We always monitor playback to make sure that it is always playing and there are no buffering issues and stuff like that. We always make sure to fill that bigger buffer as much as we can. In fact, sometimes we can see that the ACP server of the CDN, the traditional CDN is too slow and we can prefer to fetch some stuff, some bigger files from the peer-to-peer network instead and basically save the user from buffering by having that additional path that we use.
Shahar: Yes, it's all being done in real time on a profile base of for example, you can start downloading and let's say that after a few minutes the network goes down or the server goes down. You can keep watching the movie by using the peer-to-peer network which can already have those bigger files that you are missing.
Shahar: Right, exactly. That's exactly what it means. In fact, we noticed we have some edge cases that, for example, if a user connects to the website and starts watching the video, if the peer-to-peer was fast enough we can have even 100% of peer-to-peer without ever going to the server. Because when the page loads we immediately connect to our network and we know what video it's going to see so we can start giving it some part of the video and peer-to-peer before it even starts loading. That way we can achieve even 100% of peer-to-peer which is basically amazing.
One question though. If say for example, I'm on a mobile device or I'm on a capped internet. I've got a capped amount of bandwidth which I can use per month or per day, how do you make sure that the user bandwidth is not being used too much especially the upload which I know can sometimes be limited versus download. How can you make sure that it's not using too much of that person's bandwidth?
Shahar: First of all, our customers have a control panel on their side where they can decide what they want to do with mobile users. For example, they can decide that they don't want them to upload anything and they don't want them to download anything or they want them to do both.
I've got a little bit of a wildcard question here. What would you say has been one of your biggest challenges since you've joined Peer5?
Shahar: That's an excellent question. We have a tradition here that every Champions League final or big event we gather around in the office and watch the stats go up as the users connect.
We were crossing the two million users and we started noticing that it's working but it's going to hit the limit, so we had to figure out how we are going to handle the next big event which is going to be The World Cup and it's coming up in the following months.
Then we decided that we were going to implement some custom scaling solution that can handle the increased amount of users that basically join at the same time.
When the users join and in the initial minutes of the game, the peek is very high so we need to try our own custom scaling solution that is not provided by the club providers.
For example, if users join too fast and we do not have enough servers to handle them, it's going to blow up.
What we did is when we detected that the load of the user connection is high, we automatically scale up the service by a few factors in order to give an advantage over the amount of users that join.
That's probably going to be our biggest solution for handling massive amount of users.
By the way, how big of a team does Peer5 have now?
Shahar: Our development team is 4 developers that are located in Israel and we have a sales team in the United States - San Fransisco, and in LA. We have part of the team in Spain. We are about 11 people right now.
Congratulations on your round of funding, by the way! What was the thought process behind opening that up in California? I know your headquarters are in Israel. How has that helped you or what was the thought process behind that?
Shahar: We noticed that most of the networks for business development is located in San Fransisco. They told us there that if we want to succeed we basically need to open an office in San Fransisco and be able to reach out physically to some of the customers that we want and some of the networking that we need. That's when we decided to open an office there.
Our CEO moved there and he's now located in San Fransisco. We also hired more sales in the United States and Spain in order to expand our network and reach. I think that it really helps us increase the amount of customers that we can acquire in a really short time.
I've heard that the Israel startup ecosystem has been blowing up the last few years and some incredible technologies are coming out of that region. Can you tell us a little more about what it's like currently and what's going on?
Shahar: Yes, it's actually really amazing. Wherever you walk you look, there's a startup. Everyone is looking for more employees and more development teams. There's a lot of meet-ups right now and conferences mostly about advanced technology and DevOps and development and stuff like that. It's really emerging and really amazing.
Shahar I appreciate your time and thanks for doing this episode!
Shahar: I want to thank you for having me.
If anybody's got questions for you specifically or about Peer5, can they follow you on twitter or reach out to you somewhere else?
Shahar: Yes, they can go online to our website peer5.com. There's a small box in the bottom right where they can contact us directly.
Christophe: Shahar, thanks again for doing this episode. If you have any questions, please reach out to Shahar also you can reach out to me on YouTube by subscribing to the Scale Your Code channel or reach out to me on Linkedin, or even on twitter @ScaleYourCode.
Thank you all for tuning in and see you next time!
How did this interview help you?
If you learned anything from this interview, please thank our guest for their time.
Oh, and help your followers learn by clicking the Tweet button below :)