WebRTC stands for web real-time communications. It is a very exciting, powerful, and highly disruptive cutting-edge technology and standard. WebRTC leverages a set of plugin-free APIs that can be used in both desktop and mobile browsers, and is progressively becoming supported by all major modern browser vendors. Previously, external plugins were required in order to achieve similar functionality as is offered by WebRTC.
WebRTC leverages multiple standards and protocols, most of which will be discussed in this article. These include data streams, STUN/TURN servers, signaling, JSEP, ICE, SIP, SDP, NAT, UDP/TCP, network sockets, and more.
This article is meant to be a high-level overview of WebRTC, and to cover any relevant terms and concepts so that the reader has a basic, but solid understanding of the technologies, protocols, and processes involved. Please refer to the references below for additional information and technical details.
WebRTC can be used for multiple tasks, but real-time peer-to-peer audio and video (i.e., multimedia) communications is the primary benefit. In order to communicate with another person (i.e., peer) via a web browser, each person’s web browser must agree to begin communication, know how to locate one another, bypass security and firewall protections, and transmit all multimedia communications in real-time.
One of the biggest challenges associated with browser-based peer-to-peer communications is knowing how to locate and establish a network socket connection with another computer’s web browser in order to bidirectionally transmit multimedia data. The difficulties associated with this may not seem obvious at first, but let me explain further.
Now let’s say I wanted to have a video chat with my dear ol’ mom. My mom’s computer is not a web server. Therefore, the problem is how do I make the request and actually receive her audio and video data directly, while also sending my audio and video data directly to her, but without going through an external server? Enter WebRTC!
Firewalls and NAT Traversal
Most of us access the internet from a work or home-based network. Our computer typically sits behind a firewall and network access translation device (NAT), and therefore is not assigned a static public IP address. From a very high level, a NAT device translates private IP addresses from inside a firewall to public-facing IP addresses. NAT devices are needed for security and IPv4 limitations on available public IP addresses.
Here is an example of NAT at work: suppose you’re at a coffee shop and join their WiFi, your computer will be assigned an IP address that exists only behind their NAT, say 184.108.40.206. To the outside world, however, your IP address may actually be 220.127.116.11. The outside world will therefore see your requests as coming from 18.104.22.168, but the NAT device will ensure responses to your requests are sent to 22.214.171.124 through the use of mapping tables. Note that in addition to the IP address, a port is also required for network communications, and the required knowledge of an accompanying port is therefore implied throughout this article.
Given the involvement of a NAT device, how do I know my mom’s IP address to send audio and video data to, and likewise, how does she know what IP address to send audio and video back to?
This is where STUN (Session Traversal Utilities for NAT) and TURN (Traversal Using Relays around NAT) servers come into play. In order for WebRTC technologies to work, a request for your public-facing IP address is first made to a STUN server. Think of it like your computer asking a remote server, “Howdy, would you mind telling me what IP address you see me as having?”. The server then responds with something like, “Sure thing ‘ol chap, the way I see it, your IP address is 126.96.36.199”.
Assuming this process works and you receive your public-facing IP address and port, you are then able to tell other peers how to contact you directly. These peers are also able to do the same thing using a STUN or TURN server and can tell you what address to contact them at as well.
Please refer to the resources section for more information on STUN/TURN servers, and note that TURN servers will be discussed below.
Signaling, Sessions, and Protocols
Signaling is not specified by the WebRTC standard, nor implemented by its APIs in order to allow flexibility in the technologies and protocols used. Signaling and the server that handles it is left to the WebRTC application creator to sort out.
Assuming that your WebRTC browser-based application is able to determine it’s public-facing IP address using STUN as described, the next step is to actually negotiate and establish the network session connection with your peer. This process is analogous to making a phone call.
The initial session negotiation and establishment happens using a signaling/communication protocol specialized in multimedia communications. This protocol is also responsible for governing the rules by which the session is managed and terminated.
One such protocol is the Session Initiation Protocol (aka SIP). Note that due to the flexibility of WebRTC signaling, SIP is not the only signaling protocol that can be used. The signaling protocol chosen must also work with an application layer protocol called the Session Description Protocol (SDP), which is used in the case of WebRTC. All multimedia-specific metadata is passed using the SDP Protocol.
Any peer (i.e., WebRTC-leveraging application) that is attempting to communicate with another peer generates a set of ICE candidates, where ICE stands for the Interactive Connectivity Establishment protocol. The candidates represent a given combination of IP address, port, and transport protocol to be used. Note that a single computer may have multiple network interfaces (wireless, wired, etc.), so can be assigned multiple IP addresses, one for each interface.
Here is a diagram from MDN depicting this exchange.
The Complete Process Summarized
Each peer first establishes it’s public-facing IP address as described. Signaling data “channels” are then dynamically created to detect peers and support peer-to-peer negotiations and session establishment.
Muaz Khan, in an article of his on signaling concepts, relates these “channels” to being unique and private rooms, in which only those who “know” about, and hangout in the room (i.e., channel) are able to send and receive messages. These “channels” are not known or accessible to the outside world, and require a unique identifier to access them.
Note that due to the flexibility of WebRTC, and the fact that the signaling process is not specified by the standard, the concept and utilization of “channels” may be slightly different given the technologies used. In fact, some protocols do not require a “channel” mechanism to communicate. We will assume in this discussion that the implementation does utilize “channels”.
Once two or more peers are connected to the same “channel”, the peers are able to communicate and negotiate session information. This process is somewhat similar to the publish/subscribe pattern. Basically, the initiating peer sends an “offer” using a signaling protocol (e.g., SIP) and SDP. The initiator waits to receive an “answer” from any receivers that are connected to the given “channel”.
Once the answer is received, a process occurs to determine and negotiate the best of the ICE candidates gathered by each peer. Once the optimal ICE candidates are chosen, essentially all of the required metadata, network routing (IP address and port), and media information used to communicate for each peer is agreed upon. The network socket session between the peers is then fully established and active. Next, local data streams and data channel endpoints are created by each peer, and multimedia data is finally transmitted both ways using whatever bidirectional communication technology is employed.
If the process of agreeing on the best ICE candidate fails, which does happen sometimes due to firewalls and NAT technologies in use, the fallback is to use a TURN server as a relay instead. This process basically employs a server that acts as an intermediary, and which relays any transmitted data between peers.
This is in contrast to true peer-to-peer communication, in which both peers bidirectionally transmit data directly to one another. When using the TURN fallback for communications, each peer no longer needs to know how to contact and transmit data to each other. Instead, they need to know what public TURN server to send and receive real-time multimedia data during a communication session.
It’s important to understand that this is definitely a fail safe and last resort only. TURN servers need to be quite robust, have extensive bandwidth and processing capabilities, and handle potentially large amounts of data. The use of a TURN server therefore obviously incurs additional cost and complexity.
WebRTC allows a desktop or mobile browser-based application to access the device’s microphone and video camera. The browser typically informs the user that an application is requesting access to their computer’s camera and microphone. Once the user allows access to use these devices, WebRTC can create individual streams of transmittable audio and video data from data generated by these input devices. This data is then transmitted via network data ‘channels’ established by the previously discussed processes.
The primary WebRTC APIs include, Navigator.getUserMedia (capture audio and video), RTCPeerConnection (create and negotiate peer-to-peer connections), and RTCDataChannel (represents a bidirectional data channel between peers).
Please refer to the references section below, particularly MDN, for more information about the individual WebRTC technologies and APIs.
WebRTC is an amazing and highly disruptive standard that involves the orchestration of many technologies and protocols. Both desktop and mobile-based multi-person multimedia chat applications are fully achievable by leveraging WebRTC.
I hope this article has provided an accessible and informative high-level overview of WebRTC. Happy WebRTC’ing!