Technology

WebSocket & WebRTC Tutorial: Real-Time Communication on the Web

A hands-on guide to real-time communication on the web, built with Next.js, Socket.IO, the browser's native WebRTC APIs, and the OpenAI Realtime API. Six progressive tabs take you from a simple broadcast chat all the way to live AI-powered speech transcription.

Wayne Cheng

25 min read

Published

Complete Tutorial Code

Follow along with the complete source code for this WebSocket & WebRTC tutorial. Includes six self-contained demos — from a broadcast chat to peer-to-peer video calls and live AI transcription.

View on GitHub

Introduction

Real-time communication is at the heart of modern web applications — from live chat and collaborative editing to video calls and AI-powered voice interfaces. Two browser technologies make this possible: WebSockets, which provide a persistent bidirectional channel between a browser and a server, and WebRTC, which enables direct peer-to-peer connections between browsers for ultra-low-latency data, audio, and video.

This tutorial walks you through both technologies side by side. The app has six tabs, each a self-contained demo that teaches a different communication pattern — starting simple and building up to more advanced techniques, including live speech-to-text powered by the OpenAI Realtime API.

WebSocket vs. WebRTC: A Quick Comparison

Before diving into the code, it helps to understand when to reach for each technology.

WebSocket

Browser ↔ Server (persistent)

• Server is always in the loop
• Scales to N clients via broadcast
• Simple setup — no signaling needed
• Great for group chat, live feeds, notifications

WebRTC

Browser ↔ Browser (direct P2P)

• Server only assists during setup (signaling)
• Ultra-low latency once connected
• Supports data channels, audio, and video
• Great for video calls, P2P file transfer

Tech Stack

The tutorial uses a modern TypeScript stack that keeps all six demos running on a single port:

Frontend: Next.js, React, TypeScript, Tailwind CSS

Backend: Custom Node.js HTTP server (server.ts) that mounts both Next.js and Socket.IO on port 3000

Real-time: Socket.IO (WebSocket transport + signaling relay), native browser WebRTC APIs

AI: OpenAI Realtime API (gpt-realtime-whisper) for live speech transcription

Project Structure

The project is organized around a custom server, shared hooks, and per-tab UI components:

server.ts                              # Custom HTTP server: Next.js + Socket.IO
lib/
  webrtc.ts                            # Shared WebRTC constants & types (ICE servers, status labels)
  audio.ts                             # Shared audio utility: Float32 → PCM16 Base64 encoder
  socket/
    chatHandler.ts                     # Server-side: handles "chat message" events
    webrtcHandler.ts                   # Server-side: WebRTC data-channel signaling relay
    webrtcVideoHandler.ts              # Server-side: WebRTC video/audio signaling relay
    serverTranscriptionHandler.ts      # Server-side: proxies audio to OpenAI Realtime API
hooks/
  useSocket.ts                         # Client hook: WebSocket chat state & logic
  useWebRTC.ts                         # Client hook: WebRTC connection & data channel
  useWebRTCVideo.ts                    # Client hook: WebRTC video/audio peer connection
  useTranscription.ts                  # Client hook: shared transcription state & event handling
app/
  api/
    transcription-session/
      route.ts                         # API route: mints ephemeral OpenAI token (server-side)
  components/
    BasicWebSocket.tsx                 # Tab 1 UI component
    BasicWebRTC.tsx                    # Tab 2 UI component
    WebRTCVideo.tsx                    # Tab 3 UI component
    WebRTCTranscription.tsx            # Tab 4 UI component
    WebSocketTranscription.tsx         # Tab 5 UI component
    ServerTranscription.tsx            # Tab 6 UI component
types/
  message.ts                           # Shared Message type

Tutorial Overview

The six tabs are designed to be explored in order. Each one introduces a new concept while building on what came before:

🔌 Basic WebSocket — Real-Time Chat

A classic broadcast chat where every connected browser tab sees every message in real time. The server is the hub — all messages flow through it via Socket.IO.

Socket.IOBroadcastuseSocket.ts

📡 Basic WebRTC — Peer-to-Peer Chat

Messages travel directly between two browser tabs via an RTCDataChannel. The server only assists during the initial signaling handshake (offer/answer/ICE).

RTCDataChannelSDP Offer/AnswerICE Candidates

🎥 WebRTC Video — Peer-to-Peer Video Call

Live camera and microphone streams flow directly between two browser tabs using RTCPeerConnection media tracks. The server is only involved during signaling.

getUserMediaaddTrackontrack

🎙️ WebRTC Transcription — Live Speech-to-Text via WebRTC

Microphone audio is streamed directly to the OpenAI Realtime API over a WebRTC peer connection. Transcript text appears word by word in real time. The server only mints a short-lived ephemeral token.

OpenAI Realtime APIEphemeral TokenRTCDataChannel oai-events

🎙️ WebSocket Transcription — Live Speech-to-Text via WebSocket

The same live transcription as Tab 4, but the browser opens a WebSocket directly to OpenAI. Audio is captured via the Web Audio API, encoded as Base64 PCM16, and sent as JSON messages.

WebSocketWeb Audio APIPCM16 Base64

🖥️ Server Transcription — Secure Proxy via Socket.IO

The browser never touches the OpenAI API directly. Audio is sent to your own Node.js server via Socket.IO, and the server opens a WebSocket to OpenAI using the API key stored securely on the server.

Socket.IO ProxyServer-side API KeyNo Ephemeral Token

Getting Started

Follow these steps to run all six demos on your local machine:

Prerequisites

Node.js (v18 or higher)
OpenAI API key (required for Tabs 4, 5, and 6 only)

Installation Steps

1
Clone the repository:
git clone https://github.com/audoir/websocket-webrtc-tutorial.git
2
Install dependencies:
npm install
3
Set your OpenAI API key (for Tabs 4, 5 & 6):
echo "OPENAI_API_KEY=sk-..." > .env
The key is only used server-side and is never exposed to the browser.
4
Start the dev server:
npm run dev
This starts the custom Node.js server (server.ts) that mounts both Next.js and Socket.IO on port 3000 — not the standard Next.js dev server.
5
Open your browser:
http://localhost:3000

Tab 1 — Basic WebSocket: Real-Time Chat

The first tab demonstrates the simplest real-time pattern: a broadcast chat where every connected browser tab sees every message instantly. Open the app in two or more tabs, type a message in any tab, and watch it appear in all the others.

How It Works

When the component mounts, useSocket calls io() to open a Socket.IO connection. Socket.IO uses a WebSocket under the hood (with HTTP long-polling as a fallback). When you send a message, the hook emits a "chat message" event to the server. The server receives it and calls io.emit("chat message", msg, socket.id) — broadcasting to all connected clients. Each tab listens for the event and appends the message to its list. If the incoming senderId matches the local socket's own ID, the message is ignored (it was already added optimistically).

// lib/socket/chatHandler.ts
socket.on("chat message", (msg) => {
  // Broadcast to ALL connected clients, including the sender
  io.emit("chat message", msg, socket.id);
});

// hooks/useSocket.ts
socket.on("chat message", (msg, senderId) => {
  if (senderId === socket.id) return; // already added optimistically
  setMessages((prev) => [...prev, msg]);
});

Tab 2 — Basic WebRTC: Peer-to-Peer Chat

The second tab demonstrates a peer-to-peer chat where messages travel directly between two browser tabs — the server is only involved during the initial handshake. Once connected, the server is completely out of the loop.

Phase 1: Signaling (Server-Assisted)

WebRTC requires a brief setup phase called signaling before two peers can talk directly. Both tabs join the same room name. The server assigns roles: the second joiner becomes the initiator and creates an SDP offer. The server relays the offer to the first tab, which creates an SDP answer. ICE candidates are exchanged through the server relay in both directions.

// Signaling flow
Tab A emits "webrtc:join" → server assigns { initiator: false }
Tab B emits "webrtc:join" → server assigns { initiator: true }

// Tab B (initiator) creates offer:
const pc = new RTCPeerConnection(ICE_SERVERS);
const channel = pc.createDataChannel("chat");
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
socket.emit("webrtc:offer", offer);

// Tab A creates answer:
await pc.setRemoteDescription(offer);
const answer = await pc.createAnswer();
await pc.setLocalDescription(answer);
socket.emit("webrtc:answer", answer);

Phase 2: Direct P2P (Server No Longer Involved)

Once ICE negotiation succeeds, the RTCDataChannel fires its onopen event and the status updates to "Connected (P2P)". From this point, channel.send(text) pushes messages directly to the other peer's browser. The server never sees these messages.

Tab 3 — WebRTC Video: Peer-to-Peer Video Call

The third tab extends the WebRTC pattern to live video and audio. The key difference from Tab 2 is the transport: instead of an RTCDataChannel carrying text, this tab uses media tracks (RTCPeerConnection.addTrack) to stream real-time video and audio.

Tab 2 — Data Channel

• pc.createDataChannel("chat")
• Carries text messages
• No media permissions needed

Tab 3 — Media Tracks

• getUserMedia({ video, audio })
• pc.addTrack(track, stream)
• pc.ontrack receives remote video

Before connecting to the server, useWebRTCVideo calls navigator.mediaDevices.getUserMedia({ video: true, audio: true }) to request camera and microphone access. The resulting MediaStream is immediately attached to the local <video> element so you see your own preview right away. All local tracks are added to the peer connection via localStream.getTracks().forEach(track => pc.addTrack(track, localStream)). When the remote peer's tracks arrive, pc.ontrack attaches them to the remote <video> element.

Tabs 4, 5 & 6 — Live AI Transcription

The final three tabs all produce the same result — live speech-to-text powered by the OpenAI Realtime API (gpt-realtime-whisper) — but they differ in how audio gets to OpenAI and where the API key lives.

	Tab 4 — WebRTC	Tab 5 — WebSocket	Tab 6 — Server
Audio transport	Native media track	PCM16 Base64 over WebSocket	PCM16 Base64 via Socket.IO → server WS
Auth mechanism	Ephemeral key in HTTP header	Ephemeral key as WS subprotocol	Server uses OPENAI_API_KEY directly
Ephemeral token	Yes	Yes	No
API key exposure	Ephemeral key in browser	Ephemeral key in browser	Key stays on server only

Tab 4 — WebRTC Transcription

The browser establishes a WebRTC peer connection directly with OpenAI. The server's only role is to securely mint a short-lived ephemeral token so the browser never needs to hold your real API key.

// Step 1: Mint ephemeral token (server-side)
// app/api/transcription-session/route.ts
POST https://api.openai.com/v1/realtime/client_secrets
  → returns { value: "ek_..." }

// Step 2: Establish WebRTC connection (browser-side)
const pc = new RTCPeerConnection();
pc.addTrack(micTrack, stream);                    // send microphone audio
const dc = pc.createDataChannel("oai-events");    // receive transcript events
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

// POST SDP to OpenAI with ephemeral key
POST https://api.openai.com/v1/realtime/calls
  Authorization: Bearer ek_...
  Body: SDP offer
  → SDP answer

await pc.setRemoteDescription({ type: "answer", sdp: answer });

// Step 3: Receive transcript events over the data channel
dc.onmessage = (e) => {
  const event = JSON.parse(e.data);
  // conversation.item.input_audio_transcription.delta → live text
  // conversation.item.input_audio_transcription.completed → final text
};

Tab 5 — WebSocket Transcription

The browser opens a WebSocket directly to wss://api.openai.com/v1/realtime. Because the browser WebSocket API does not allow custom HTTP headers, the ephemeral key is passed as a WebSocket subprotocol string.

// Open WebSocket with ephemeral key as subprotocol
const ws = new WebSocket("wss://api.openai.com/v1/realtime", [
  "realtime",
  `openai-insecure-api-key.${ephemeralKey}`,
]);

// Capture microphone via Web Audio API
const ctx = new AudioContext({ sampleRate: 24000 });
const processor = ctx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  const base64 = float32ToPcm16Base64(float32);
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: base64,
  }));
};

Tab 6 — Server Transcription

The most secure approach: the browser sends audio to your own Node.js server via Socket.IO, and the server opens a WebSocket to OpenAI using the OPENAI_API_KEY stored securely on the server. No ephemeral token is needed, and the API key never leaves the server.

// lib/socket/serverTranscriptionHandler.ts
socket.on("server-transcription:start", () => {
  // Server opens WebSocket to OpenAI with OPENAI_API_KEY
  const openaiWs = new WebSocket(
    "wss://api.openai.com/v1/realtime?intent=transcription",
    { headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` } }
  );
  openaiWs.on("open", () => {
    socket.emit("server-transcription:connected");
  });
  openaiWs.on("message", (raw) => {
    socket.emit("server-transcription:event", raw.toString());
  });
});

socket.on("server-transcription:audio", (base64) => {
  openaiWs.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: base64,
  }));
});

The Shared `useTranscription` Hook

Tabs 4, 5, and 6 all use the same useTranscription hook for event handling and transcript display. The hook accumulates delta events into a rolling text block, updating it in real time as you speak. When the completed event arrives, the in-progress segment is replaced with the corrected final text and a newline is added so the next utterance starts on a fresh line.

// hooks/useTranscription.ts
// Two event types from OpenAI:
// 1. delta  → small chunk of new text (fires continuously as you speak)
// 2. completed → full corrected text for that speech turn

handleEvent(raw: string) {
  const event = JSON.parse(raw);
  if (event.type === "conversation.item.input_audio_transcription.delta") {
    // Append delta to the in-progress segment identified by event.item_id
    updateSegment(event.item_id, (prev) => prev + event.delta);
  }
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    // Replace in-progress segment with the final corrected text
    finalizeSegment(event.item_id, event.transcript + "\n");
  }
}

Key Concepts Demonstrated

WebSocket Broadcasting

Persistent bidirectional connections with Socket.IO. Server broadcasts to all connected clients simultaneously.

WebRTC Signaling

SDP offer/answer exchange and ICE candidate relay to establish direct peer-to-peer connections across NATs and firewalls.

Media Capture & Streaming

Using getUserMedia to capture camera and microphone, then streaming media tracks over a WebRTC peer connection.

Real-Time AI Transcription

Three approaches to live speech-to-text with the OpenAI Realtime API — WebRTC, WebSocket, and server proxy — each with different security and complexity trade-offs.

Secure API Key Handling

Minting short-lived ephemeral tokens server-side so the real API key never reaches the browser, and a fully server-proxied approach where the key never leaves the server at all.

Custom React Hooks

Encapsulating all connection logic — WebSocket, WebRTC, and transcription — in reusable custom hooks that expose a clean API to UI components.

Full Comparison: All Six Tabs

Tab	Message/Media Path	Server Role	Scales to N Clients
1 — WebSocket Chat	Browser → Server → All browsers	Always in the loop	Yes
2 — WebRTC Chat	Browser ↔ Browser (direct)	Only during signaling	No (2 peers)
3 — WebRTC Video	Browser ↔ Browser (direct)	Only during signaling	No (2 peers)
4 — WebRTC Transcription	Browser → OpenAI (direct)	Only mints ephemeral token	Yes (per-user session)
5 — WebSocket Transcription	Browser → OpenAI (direct)	Only mints ephemeral token	Yes (per-user session)
6 — Server Transcription	Browser → Server → OpenAI	Full proxy (manages OpenAI WS)	Yes (per-user session)

Learning Outcomes

By working through this tutorial, you will have gained practical experience with:

• Building real-time broadcast chat with Socket.IO
• Implementing WebRTC signaling (offer/answer/ICE) from scratch
• Streaming live video and audio between browser peers
• Connecting to the OpenAI Realtime API via WebRTC and WebSocket
• Encoding microphone audio as PCM16 Base64 using the Web Audio API
• Minting ephemeral tokens server-side for secure API access
• Building a server-side proxy for environments where the browser cannot reach an external API directly
• Encapsulating real-time logic in reusable custom React hooks
• Comparing trade-offs between WebSocket, WebRTC, and server-proxy architectures

Conclusion

WebSockets and WebRTC are complementary technologies that together cover the full spectrum of real-time web communication. WebSockets excel at server-mediated scenarios — group chat, live feeds, notifications — where the server needs to be in the loop. WebRTC shines for peer-to-peer scenarios — video calls, file transfer, low-latency data — where you want to minimize server involvement after the initial handshake.

The three transcription tabs show how the same AI capability can be delivered with very different architectures, each with its own trade-offs around security, complexity, and browser compatibility. Tab 6's server-proxy approach is the most secure and the most straightforward to reason about — the API key never leaves the server, and no ephemeral token dance is required.

About the Author

Wayne Cheng is the founder and AI app developer at Audoir, LLC. Prior to founding Audoir, he worked as a hardware design engineer for Silicon Valley startups and an audio engineer for creative organizations. He holds an MSEE from UC Davis and a Music Technology degree from Foothill College.

Further Exploration

Explore the complete tutorial repository and experiment with extending the demos. Consider adding a third peer to the WebRTC rooms, implementing TURN server support for stricter network environments, or building a multi-user transcription session where all participants see a shared live transcript.

For more AI-powered development tools and tutorials, visit Audoir .