In-browser media compositing via WebRTC Insertable Streams API

Screen sharing is one of the core parts of real-time communication, especially for remote teams. When you go into a video conferencing app and share your screen on your browser, the app will either open a new RTCPeerConnection or send the screen track over the existing one. In both cases, the upstream and downstream data usage will increase for both sender and receiver.

Most applications focus on shared screens if they exist on their UI and place camera streams around them. If you have a bad connection you may want to stop streaming your camera while screen sharing or receivers may want to stop watching you in order to see your screen better.

I use loom a lot and like to show up at the bottom-left in a circle while recording my screen and I thought it’d be great if I could stream my screen and camera in the same way. It will decrease the bandwidth usage of all participants including me because I no longer need to send two different video streams.

Captured from loom.com

This is where WebRTC insertable streams come in handy. With incoming features, we will be able to manipulate captured media before encoding and after decoding. I have exciting ideas about the post-decoding side but now let’s focus on pre-encoding.

As described in the WebRTC NV use cases section Funny Hats, we’ll try to implement In-browser compositing. We have new interfaces like MediaStreamTrackProcessor and MediaStreamTrackGenerator. Before going into that, let’s talk about MediaStream and MediaStreamTrack.

The MediaStream consists of several tracks, like video or audio tracks. Each track is an instance of MediaStreamTrack(which is an interface that represents a single media track within a stream). Within insertable streams, we will be able to direct access to video (or audio) streams and apply changes directly to both of them.

We are already able to do this by drawing a video on a canvas and capturing the stream from it (after modifying frames) but all of this occurs on the main thread. Even if you want to use web workers with OffscreenCanvas you have to grab frames of the video and send them to the worker in the main thread within a loop triggered by the requestAnimationFrame method. This method is throttled by browsers when the tab loses focus and you need to work around it by playing almost silent audio on the tab. I’m glad we don’t need to do that anymore.

As in the diagram above, we have 2 video tracks (camera and screen) and we need to combine them to have a single one (composed).

A MediaStreamTrackProcessor allows the creation of a ReadableStream that can expose the media flowing through a given MediaStreamTrack. If it is a video track, the chunks exposed by the stream will be VideoFrame objects; if the track is an audio track, the chunks will be AudioData objects. This makes MediaStreamTrackProcessor effectively a sink in the MediaStream model and a MediaStreamTrackGenerator allows the creation of a WritableStream that acts as a MediaStreamTrack source in the MediaStream model.

After getting streams in the main thread, you just need to pass readable and writable forms of them to the worker.

In the worker, we’ll use TransformStream to pipe the data between a ReadableStream and a WritableStream and it accepts a transform method in the constructor that enables transforming data in stream form

We rely on the camera stream and using its stream form as a loop via the transform method. As we got a new frame from the camera, we try to get another one from the screen stream.

When I tried this by drawing the frames and then creating a composed track, it worked like a charm, until I tried to share a tab in Chrome. Because auto-throttling is enabled by default when you share a tab to decrease CPU load and bandwidth usage. When you lose your focus from the tab or there isn’t a lot of animation there, it throttles and delays the generation of a new frame and it blocks our transform loop because we use await at line 6. Even if there isn’t a new frame from the screen stream, we must continue to draw the scene using the last frame of the screen stream.

So, it should be something like that:

After getting frames from the camera and screen, we draw them on an OffscreenCanvas and create a new VideoFrame from the canvas.

What about the Audio?

Chrome allows sharing tab audio and we need to mix our voice with that if it exists. You can create a MediaStreamTrackProcessor for an audio track and access the raw samples of all channels but I think this is more suitable for processing tasks like low pass filtering.

I could also use AudioWorkletProcessor and AudioWorkletNode to move the mixing process outside of the main thread but it is not in the scope of this post. So, I’ll keep that simple and use AudioContext to create a media stream destination, and connect all sources to it.

Now you can see the result:

https://yak0.github.io/media-composer

Chromium is becoming a powerful browser with incoming features and I hope others will follow it. The fragmentation of WebRTC by WebTransport + WebCodecs + WebGPU and features like insertable streams opens new areas to explore.

I will continue to share my experiences on these topics.

Demo: https://yak0.github.io/media-composer

Source: https://github.com/yak0/media-composer

Software Developer @Superpeer