Technology

WebSocket & WebRTC Tutorial: Real-Time Communication on the Web

A hands-on guide to real-time communication on the web, built with Next.js, Socket.IO, the browser's native WebRTC APIs, and the OpenAI Realtime API. Six progressive tabs take you from a simple broadcast chat all the way to live AI-powered speech transcription.

25 min read
Published

Complete Tutorial Code

Follow along with the complete source code for this WebSocket & WebRTC tutorial. Includes six self-contained demos — from a broadcast chat to peer-to-peer video calls and live AI transcription.

View on GitHub

Introduction

Real-time communication is at the heart of modern web applications — from live chat and collaborative editing to video calls and AI-powered voice interfaces. Two browser technologies make this possible: WebSockets, which provide a persistent bidirectional channel between a browser and a server, and WebRTC, which enables direct peer-to-peer connections between browsers for ultra-low-latency data, audio, and video.

This tutorial walks you through both technologies side by side. The app has six tabs, each a self-contained demo that teaches a different communication pattern — starting simple and building up to more advanced techniques, including live speech-to-text powered by the OpenAI Realtime API.

WebSocket vs. WebRTC: A Quick Comparison

Before diving into the code, it helps to understand when to reach for each technology.

WebSocket

Browser ↔ Server (persistent)
  • • Server is always in the loop
  • • Scales to N clients via broadcast
  • • Simple setup — no signaling needed
  • • Great for group chat, live feeds, notifications

WebRTC

Browser ↔ Browser (direct P2P)
  • • Server only assists during setup (signaling)
  • • Ultra-low latency once connected
  • • Supports data channels, audio, and video
  • • Great for video calls, P2P file transfer

Tech Stack

The tutorial uses a modern TypeScript stack that keeps all six demos running on a single port:

Frontend: Next.js, React, TypeScript, Tailwind CSS
Backend: Custom Node.js HTTP server (server.ts) that mounts both Next.js and Socket.IO on port 3000
Real-time: Socket.IO (WebSocket transport + signaling relay), native browser WebRTC APIs
AI: OpenAI Realtime API (gpt-realtime-whisper) for live speech transcription

Project Structure

The project is organized around a custom server, shared hooks, and per-tab UI components:

server.ts                              # Custom HTTP server: Next.js + Socket.IO
lib/
  webrtc.ts                            # Shared WebRTC constants & types (ICE servers, status labels)
  audio.ts                             # Shared audio utility: Float32 → PCM16 Base64 encoder
  socket/
    chatHandler.ts                     # Server-side: handles "chat message" events
    webrtcHandler.ts                   # Server-side: WebRTC data-channel signaling relay
    webrtcVideoHandler.ts              # Server-side: WebRTC video/audio signaling relay
    serverTranscriptionHandler.ts      # Server-side: proxies audio to OpenAI Realtime API via WebSocket
hooks/
  useSocket.ts                         # Client hook: WebSocket chat state & logic
  useWebRTC.ts                         # Client hook: WebRTC connection & data channel
  useWebRTCVideo.ts                    # Client hook: WebRTC video/audio peer connection
  useTranscription.ts                  # Client hook: shared transcription state & event handling
app/
  api/
    transcription-session/
      route.ts                         # API route: mints ephemeral OpenAI token (server-side)
  components/
    BasicWebSocket.tsx                 # Tab 1 UI component
    BasicWebRTC.tsx                    # Tab 2 UI component
    WebRTCVideo.tsx                    # Tab 3 UI component
    WebRTCTranscription.tsx            # Tab 4 UI component
    WebSocketTranscription.tsx         # Tab 5 UI component
    ServerTranscription.tsx            # Tab 6 UI component
    ui/                                # Shared UI: ChatInput, MessageList, RoomPicker, etc.
types/
  message.ts                           # Shared Message type

Getting Started

Follow these steps to run all six demos on your local machine. The full source code is available at github.com/audoir/websocket-webrtc-tutorial.

Prerequisites

  • Node.js (v18 or higher)
  • OpenAI API key (required for Tabs 4, 5, and 6 only)

Installation Steps

  1. 1
    Clone the repository:
    git clone https://github.com/audoir/websocket-webrtc-tutorial.git
  2. 2
    Install dependencies:
    npm install
  3. 3
    Set your OpenAI API key (for Tabs 4, 5 & 6):
    echo "OPENAI_API_KEY=sk-..." > .env

    The key is only used server-side and is never exposed to the browser.

  4. 4
    Start the dev server:
    npm run dev

    This starts the custom Node.js server (server.ts) that mounts both Next.js and Socket.IO on port 3000 — not the standard Next.js dev server.

  5. 5
    Open your browser:

🔌 Tab 1 — Basic WebSocket: Real-Time Chat

A classic broadcast chat where every connected browser tab sees every message in real time. The server is the hub — all messages flow through it via Socket.IO. Open the app in two or more tabs, type a message in any tab, and watch it appear in all the others.

How it works — step by step

Browser Tab A                  Server (server.ts)              Browser Tab B
─────────────                  ──────────────────              ─────────────
socket.emit("chat message",    io.emit("chat message",         socket.on("chat message", …)
  "Hello!")          ───────▶    "Hello!", socketA.id) ──────▶  renders message
  1. 1. Connection (hooks/useSocket.ts) — When the component mounts, useSocket calls io() to open a Socket.IO connection. Socket.IO uses a WebSocket under the hood (with HTTP long-polling as a fallback).
  2. 2. Sending a message (hooks/useSocket.ts → sendMessage) — When you type and hit Send, the hook calls socket.emit("chat message", text). The message is also added to local state immediately (optimistic update).
  3. 3. Server broadcasts (lib/socket/chatHandler.ts) — The server receives the event and calls io.emit("chat message", msg, socket.id) — broadcasting to all connected clients.
  4. 4. Receiving a message (hooks/useSocket.ts) — Every tab listens for "chat message". If the incoming senderId matches the local socket's own ID, the message is ignored (it was already added optimistically). Otherwise it's appended to the message list.

Key code

// lib/socket/chatHandler.ts
socket.on("chat message", (msg) => {
  // Broadcast to ALL connected clients, including the sender
  io.emit("chat message", msg, socket.id);
});

// hooks/useSocket.ts
socket.on("chat message", (msg, senderId) => {
  if (senderId === socket.id) return; // already added optimistically
  setMessages((prev) => [...prev, msg]);
});

Key files

FileRole
hooks/useSocket.tsOpens the socket, listens for events, exposes sendMessage
lib/socket/chatHandler.tsServer: receives and broadcasts "chat message"
app/components/BasicWebSocket.tsxChat UI for this tab

📡 Tab 2 — Basic WebRTC: Peer-to-Peer Chat

A peer-to-peer chat where messages travel directly between two browser tabs — the server is only involved during the initial handshake (signaling). Once the connection is established, the server is completely out of the loop.

Phase 1 — Signaling (server-assisted)

WebRTC requires a brief setup phase called signaling before the two peers can talk directly. The server acts as a relay only during this phase.

Tab A (first joiner)           Server (webrtcHandler.ts)       Tab B (second joiner)
────────────────────           ─────────────────────────       ─────────────────────
emit("webrtc:join", "room") ─▶ joins Socket.IO room
                               only 1 peer → waiting
  ◀─ "webrtc:waiting"

                                                               emit("webrtc:join", "room") ─▶
                               2 peers → ready!
  ◀─ "webrtc:ready"                                            ◀─ "webrtc:ready"
     { initiator: false }         (relay)                         { initiator: true }

                               Tab B creates offer:
                               emit("webrtc:offer", …) ──────▶ relayed to Tab A
Tab A creates answer:
emit("webrtc:answer", …) ────▶ relayed to Tab B
ICE candidates exchanged via server relay in both directions
  1. 1. Joining a room (hooks/useWebRTC.ts) — When you click Join, useWebRTC opens a Socket.IO connection and emits "webrtc:join" with the room name.
  2. 2. Server assigns roles (lib/socket/webrtcHandler.ts) — If the room has 0 peers, the server emits "webrtc:waiting". When a second peer joins, the server emits "webrtc:ready" to both tabs. The second joiner gets { initiator: true }, the first gets { initiator: false }.
  3. 3. Offer/Answer exchange — The initiator (Tab B) creates an RTCPeerConnection, opens a data channel (pc.createDataChannel("chat")), generates an SDP offer, and emits it to the server. The server relays the offer to Tab A, which creates an SDP answer and sends it back.
  4. 4. ICE candidate exchange — As each peer discovers network paths, it emits "webrtc:ice-candidate" to the server, which relays them to the other peer. Google's public STUN servers are used to discover public IP addresses.

Phase 2 — Direct P2P (server no longer involved)

Tab A ◀──────────────────────────────────────────────────▶ Tab B
              RTCDataChannel ("chat") — direct P2P

Once ICE negotiation succeeds, the RTCDataChannel fires its onopen event and the status updates to "Connected (P2P)". From this point, channel.send(text) pushes messages directly to the other peer's browser. The server never sees these messages.

Key files

FileRole
hooks/useWebRTC.tsAll client-side WebRTC logic: signaling, peer connection, data channel
lib/socket/webrtcHandler.tsServer: relays signaling messages between the two peers in a room
app/components/BasicWebRTC.tsxChat UI for this tab (room picker + chat view)

🎥 Tab 3 — WebRTC Video: Peer-to-Peer Video Call

A peer-to-peer video call where live camera and microphone streams travel directly between two browser tabs. Like Tab 2, the server only assists during the initial signaling handshake — once the connection is established, all media flows directly between the two browsers.

The key difference from Tab 2 is the transport mechanism: instead of an RTCDataChannel carrying text, this tab uses media tracks (RTCPeerConnection.addTrack) to stream real-time video and audio.

Phase 1 — Camera/Microphone Access

Browser
───────
navigator.mediaDevices.getUserMedia({ video: true, audio: true })
  → localStream (shown in the "You (local)" video element)

Before connecting to the server, useWebRTCVideo calls navigator.mediaDevices.getUserMedia({ video: true, audio: true }) to request camera and microphone access. The resulting MediaStream is immediately attached to the local <video> element so you see your own preview right away.

Phase 2 — Signaling (server-assisted)

The signaling flow is identical to Tab 2. Both peers join the same room name, the server assigns roles (initiator / answerer), and the SDP offer/answer plus ICE candidates are relayed through the server.

// hooks/useWebRTCVideo.ts → createPC
const pc = new RTCPeerConnection(ICE_SERVERS);

// Add all local media tracks — this is what sends your video/audio to the remote peer
localStream.getTracks().forEach(track => pc.addTrack(track, localStream));

// Receive remote peer's tracks
pc.ontrack = (e) => {
  remoteVideoRef.current.srcObject = e.streams[0];
};

Phase 3 — Direct P2P Video (server no longer involved)

Tab A ◀──────────────────────────────────────────────────▶ Tab B
         RTCPeerConnection (video + audio tracks) — direct P2P

Once ICE negotiation succeeds, pc.onconnectionstatechange fires with "connected". The status updates to "Connected (P2P Video)". When the remote peer's tracks arrive, pc.ontrack attaches them to the remote <video> element. The server never sees any of this media.

Key files

FileRole
hooks/useWebRTCVideo.tsAll client-side logic: media capture, signaling, peer connection, track handling
lib/socket/webrtcVideoHandler.tsServer: relays signaling messages between the two peers in a room
lib/webrtc.tsShared constants: ICE server config, connection status types & labels
app/components/WebRTCVideo.tsxVideo call UI: local/remote video panels, status bar, room picker

🎙️ Tabs 4, 5 & 6 — Live AI Transcription

The final three tabs all produce the same result — live speech-to-text powered by the OpenAI Realtime API (gpt-realtime-whisper) — but they differ in how audio gets to OpenAI and where the API key lives.

Tab 4 — WebRTCTab 5 — WebSocketTab 6 — Server
Audio transportNative media trackPCM16 Base64 over WebSocketPCM16 Base64 via Socket.IO → server WS
Auth mechanismEphemeral key in HTTP headerEphemeral key as WS subprotocolServer uses OPENAI_API_KEY directly
Ephemeral tokenYesYesNo
API key exposureEphemeral key in browserEphemeral key in browserKey stays on server only

Tab 4 — WebRTC Transcription

The browser establishes a WebRTC peer connection directly with OpenAI. The server's only role is to securely mint a short-lived ephemeral token so the browser never needs to hold your real API key.

Browser                        Your Server                     OpenAI Realtime API
───────                        ───────────                     ───────────────────
POST /api/transcription-session ──────────────────────────────▶ POST /v1/realtime/client_secrets
                               (uses OPENAI_API_KEY)
  ◀─ { value: "ek_..." } ◀────────────────────────────────────

RTCPeerConnection.createOffer()
POST https://api.openai.com/v1/realtime/calls  ──────────────▶ (SDP answer)
  Authorization: Bearer ek_...
  ◀─ SDP answer

RTCPeerConnection.setRemoteDescription(answer)

Microphone audio ────────────────────────────────────────────▶ gpt-realtime-whisper
  ◀─ transcript delta events (via RTCDataChannel "oai-events")
  1. 1. Mint ephemeral token — The browser calls POST /api/transcription-session. The server reads OPENAI_API_KEY and calls OpenAI's /v1/realtime/client_secrets endpoint, returning a short-lived ek_... key.
  2. 2. Establish WebRTC connection — The browser creates an RTCPeerConnection, adds the microphone track via pc.addTrack(track, stream), creates a data channel named "oai-events", generates an SDP offer, and POSTs it to OpenAI with the ephemeral key in the Authorization header.
  3. 3. Receive transcript events — OpenAI sends conversation.item.input_audio_transcription.delta events (live chunks) and conversation.item.input_audio_transcription.completed events (final corrected text) over the data channel.

Tab 5 — WebSocket Transcription

The same live transcription as Tab 4, but the browser opens a WebSocket directly to wss://api.openai.com/v1/realtime. Because the browser WebSocket API does not allow custom HTTP headers, the ephemeral key is passed as a WebSocket subprotocol string.

// Open WebSocket with ephemeral key as subprotocol
const ws = new WebSocket("wss://api.openai.com/v1/realtime", [
  "realtime",
  `openai-insecure-api-key.${ephemeralKey}`,
]);

// Capture microphone via Web Audio API
const ctx = new AudioContext({ sampleRate: 24000 });
const processor = ctx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
  const float32 = e.inputBuffer.getChannelData(0);
  const base64 = float32ToPcm16Base64(float32);
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: base64,
  }));
};

Audio is captured via the Web Audio API, encoded as Base64 PCM16 using float32ToPcm16Base64 from lib/audio.ts, and sent as JSON messages. Transcript events arrive over the same WebSocket and are handled by the shared useTranscription hook.

Tab 6 — Server Transcription

The most secure approach: the browser sends audio to your own Node.js server via Socket.IO, and the server opens a WebSocket to OpenAI using the OPENAI_API_KEY stored securely on the server. No ephemeral token is needed, and the API key never leaves the server.

// lib/socket/serverTranscriptionHandler.ts
socket.on("server-transcription:start", () => {
  // Server opens WebSocket to OpenAI with OPENAI_API_KEY
  const openaiWs = new WebSocket(
    "wss://api.openai.com/v1/realtime?intent=transcription",
    { headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` } }
  );
  openaiWs.on("open", () => {
    socket.emit("server-transcription:connected");
  });
  openaiWs.on("message", (raw) => {
    socket.emit("server-transcription:event", raw.toString());
  });
});

socket.on("server-transcription:audio", (base64) => {
  openaiWs.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: base64,
  }));
});

The browser emits "server-transcription:start" to the server. The server opens the OpenAI WebSocket using OPENAI_API_KEY — something the browser cannot do directly. Audio chunks are forwarded from the browser to the server via Socket.IO, then from the server to OpenAI. Transcript events flow back the same way.

Key files

FileRole
app/api/transcription-session/route.tsServer: mints ephemeral OpenAI token (Tabs 4 & 5)
app/components/WebRTCTranscription.tsxTab 4: WebRTC connection, microphone capture, transcript display
app/components/WebSocketTranscription.tsxTab 5: WebSocket connection, PCM16 audio encoding, transcript display
lib/socket/serverTranscriptionHandler.tsTab 6: Server opens/manages the OpenAI WebSocket, forwards audio, relays events
app/components/ServerTranscription.tsxTab 6: Socket.IO connection, microphone capture, transcript display
lib/audio.tsShared utility: float32ToPcm16Base64 encoder (Tabs 5 & 6)

The Shared useTranscription Hook

Tabs 4, 5, and 6 all use the same useTranscription hook for event handling and transcript display. The hook accumulates delta events into a rolling text block, updating it in real time as you speak. When the completed event arrives, the in-progress segment is replaced with the corrected final text and a newline is added so the next utterance starts on a fresh line.

Each speech turn is identified by a unique item_id, which is used internally to know which portion of the text to replace when the final transcript arrives.

// hooks/useTranscription.ts
// Two event types from OpenAI:
// 1. delta  → small chunk of new text (fires continuously as you speak)
// 2. completed → full corrected text for that speech turn

handleEvent(raw: string) {
  const event = JSON.parse(raw);
  if (event.type === "conversation.item.input_audio_transcription.delta") {
    // Append delta to the in-progress segment identified by event.item_id
    updateSegment(event.item_id, (prev) => prev + event.delta);
  }
  if (event.type === "conversation.item.input_audio_transcription.completed") {
    // Replace in-progress segment with the final corrected text
    finalizeSegment(event.item_id, event.transcript + "\n");
  }
}

The hook also exposes clearTranscript and copyTranscript helpers used by all three transcription tab UIs, keeping the UI components thin and focused on rendering.

Full Comparison: All Six Tabs

The table below summarizes the key differences across all six demos — from message path and server role to scalability and API key handling.

TabMessage/Media PathServer RoleScales to N ClientsGood For
1 — WebSocket ChatBrowser → Server → All browsersAlways in the loopYesGroup chat, live feeds, notifications
2 — WebRTC ChatBrowser ↔ Browser (direct)Only during signalingNo (2 peers)Low-latency P2P text, file transfer
3 — WebRTC VideoBrowser ↔ Browser (direct)Only during signalingNo (2 peers)Video/audio calls
4 — WebRTC TranscriptionBrowser → OpenAI (direct)Only mints ephemeral tokenYes (per-user session)Live captions, voice notes, accessibility
5 — WebSocket TranscriptionBrowser → OpenAI (direct)Only mints ephemeral tokenYes (per-user session)Live captions when WebRTC is unavailable
6 — Server TranscriptionBrowser → Server → OpenAIFull proxy (manages OpenAI WS)Yes (per-user session)Secure transcription; browser can't reach OpenAI directly

WebRTC vs WebSocket vs Server-side for transcription (Tabs 4, 5 & 6)

All three tabs produce the same result — live transcription via the OpenAI Realtime API — but they differ in how audio gets there and where the API key lives:

Tab 4 — WebRTCTab 5 — WebSocketTab 6 — Server
Audio transportNative media track (browser handles encoding)Web Audio API → PCM16 → Base64 → JSONWeb Audio API → PCM16 → Base64 → Socket.IO → server WS
Connection setupSDP offer/answer exchangeSimple WebSocket handshakeSocket.IO emit; server opens WS to OpenAI
Auth mechanismEphemeral key in HTTP Authorization headerEphemeral key as WebSocket subprotocolServer uses OPENAI_API_KEY directly
Ephemeral token neededYesYesNo
ComplexityLower (browser handles audio encoding)Higher (manual PCM16 encoding in JS)Moderate (server proxy adds a hop but simplifies auth)
SecurityEphemeral key visible in browserEphemeral key visible in browserAPI key never leaves the server

Conclusion

WebSockets and WebRTC are complementary technologies that together cover the full spectrum of real-time web communication. WebSockets excel at server-mediated scenarios — group chat, live feeds, notifications — where the server needs to be in the loop. WebRTC shines for peer-to-peer scenarios — video calls, file transfer, low-latency data — where you want to minimize server involvement after the initial handshake.

The three transcription tabs show how the same AI capability can be delivered with very different architectures, each with its own trade-offs around security, complexity, and browser compatibility. Tab 6's server-proxy approach is the most secure and the most straightforward to reason about — the API key never leaves the server, and no ephemeral token dance is required.

About the Author

Wayne Cheng is the founder and AI app developer at Audoir, LLC. Prior to founding Audoir, he worked as a hardware design engineer for Silicon Valley startups and an audio engineer for creative organizations. He holds an MSEE from UC Davis and a Music Technology degree from Foothill College.

Further Exploration

Explore the complete tutorial repository and experiment with extending the demos. Consider adding a third peer to the WebRTC rooms, implementing TURN server support for stricter network environments, or building a multi-user transcription session where all participants see a shared live transcript.

For more AI-powered development tools and tutorials, visit Audoir .