WebSocket & WebRTC Tutorial: Real-Time Communication on the Web
A hands-on guide to real-time communication on the web, built with Next.js, Socket.IO, the browser's native WebRTC APIs, and the OpenAI Realtime API. Six progressive tabs take you from a simple broadcast chat all the way to live AI-powered speech transcription.
Complete Tutorial Code
Follow along with the complete source code for this WebSocket & WebRTC tutorial. Includes six self-contained demos — from a broadcast chat to peer-to-peer video calls and live AI transcription.
View on GitHubTable of Contents
- Introduction
- WebSocket vs. WebRTC: A Quick Comparison
- Tech Stack
- Project Structure
- Getting Started
- Tab 1 — Basic WebSocket: Real-Time Chat
- Tab 2 — Basic WebRTC: Peer-to-Peer Chat
- Tab 3 — WebRTC Video: Peer-to-Peer Video Call
- Tabs 4, 5 & 6 — Live AI Transcription
- The Shared useTranscription Hook
- Full Comparison: All Six Tabs
- Conclusion
Introduction
Real-time communication is at the heart of modern web applications — from live chat and collaborative editing to video calls and AI-powered voice interfaces. Two browser technologies make this possible: WebSockets, which provide a persistent bidirectional channel between a browser and a server, and WebRTC, which enables direct peer-to-peer connections between browsers for ultra-low-latency data, audio, and video.
This tutorial walks you through both technologies side by side. The app has six tabs, each a self-contained demo that teaches a different communication pattern — starting simple and building up to more advanced techniques, including live speech-to-text powered by the OpenAI Realtime API.
WebSocket vs. WebRTC: A Quick Comparison
Before diving into the code, it helps to understand when to reach for each technology.
WebSocket
- • Server is always in the loop
- • Scales to N clients via broadcast
- • Simple setup — no signaling needed
- • Great for group chat, live feeds, notifications
WebRTC
- • Server only assists during setup (signaling)
- • Ultra-low latency once connected
- • Supports data channels, audio, and video
- • Great for video calls, P2P file transfer
Tech Stack
The tutorial uses a modern TypeScript stack that keeps all six demos running on a single port:
server.ts) that mounts both Next.js and Socket.IO on port 3000gpt-realtime-whisper) for live speech transcriptionProject Structure
The project is organized around a custom server, shared hooks, and per-tab UI components:
server.ts # Custom HTTP server: Next.js + Socket.IO
lib/
webrtc.ts # Shared WebRTC constants & types (ICE servers, status labels)
audio.ts # Shared audio utility: Float32 → PCM16 Base64 encoder
socket/
chatHandler.ts # Server-side: handles "chat message" events
webrtcHandler.ts # Server-side: WebRTC data-channel signaling relay
webrtcVideoHandler.ts # Server-side: WebRTC video/audio signaling relay
serverTranscriptionHandler.ts # Server-side: proxies audio to OpenAI Realtime API via WebSocket
hooks/
useSocket.ts # Client hook: WebSocket chat state & logic
useWebRTC.ts # Client hook: WebRTC connection & data channel
useWebRTCVideo.ts # Client hook: WebRTC video/audio peer connection
useTranscription.ts # Client hook: shared transcription state & event handling
app/
api/
transcription-session/
route.ts # API route: mints ephemeral OpenAI token (server-side)
components/
BasicWebSocket.tsx # Tab 1 UI component
BasicWebRTC.tsx # Tab 2 UI component
WebRTCVideo.tsx # Tab 3 UI component
WebRTCTranscription.tsx # Tab 4 UI component
WebSocketTranscription.tsx # Tab 5 UI component
ServerTranscription.tsx # Tab 6 UI component
ui/ # Shared UI: ChatInput, MessageList, RoomPicker, etc.
types/
message.ts # Shared Message typeGetting Started
Follow these steps to run all six demos on your local machine. The full source code is available at github.com/audoir/websocket-webrtc-tutorial.
Prerequisites
- Node.js (v18 or higher)
- OpenAI API key (required for Tabs 4, 5, and 6 only)
Installation Steps
- 1Clone the repository:
git clone https://github.com/audoir/websocket-webrtc-tutorial.git - 2Install dependencies:
npm install - 3Set your OpenAI API key (for Tabs 4, 5 & 6):
echo "OPENAI_API_KEY=sk-..." > .envThe key is only used server-side and is never exposed to the browser.
- 4Start the dev server:
npm run devThis starts the custom Node.js server (
server.ts) that mounts both Next.js and Socket.IO on port 3000 — not the standard Next.js dev server. - 5Open your browser:
🔌 Tab 1 — Basic WebSocket: Real-Time Chat
A classic broadcast chat where every connected browser tab sees every message in real time. The server is the hub — all messages flow through it via Socket.IO. Open the app in two or more tabs, type a message in any tab, and watch it appear in all the others.
How it works — step by step
Browser Tab A Server (server.ts) Browser Tab B
───────────── ────────────────── ─────────────
socket.emit("chat message", io.emit("chat message", socket.on("chat message", …)
"Hello!") ───────▶ "Hello!", socketA.id) ──────▶ renders message- 1. Connection (
hooks/useSocket.ts) — When the component mounts,useSocketcallsio()to open a Socket.IO connection. Socket.IO uses a WebSocket under the hood (with HTTP long-polling as a fallback). - 2. Sending a message (
hooks/useSocket.ts → sendMessage) — When you type and hit Send, the hook callssocket.emit("chat message", text). The message is also added to local state immediately (optimistic update). - 3. Server broadcasts (
lib/socket/chatHandler.ts) — The server receives the event and callsio.emit("chat message", msg, socket.id)— broadcasting to all connected clients. - 4. Receiving a message (
hooks/useSocket.ts) — Every tab listens for"chat message". If the incomingsenderIdmatches the local socket's own ID, the message is ignored (it was already added optimistically). Otherwise it's appended to the message list.
Key code
// lib/socket/chatHandler.ts
socket.on("chat message", (msg) => {
// Broadcast to ALL connected clients, including the sender
io.emit("chat message", msg, socket.id);
});
// hooks/useSocket.ts
socket.on("chat message", (msg, senderId) => {
if (senderId === socket.id) return; // already added optimistically
setMessages((prev) => [...prev, msg]);
});Key files
| File | Role |
|---|---|
hooks/useSocket.ts | Opens the socket, listens for events, exposes sendMessage |
lib/socket/chatHandler.ts | Server: receives and broadcasts "chat message" |
app/components/BasicWebSocket.tsx | Chat UI for this tab |
📡 Tab 2 — Basic WebRTC: Peer-to-Peer Chat
A peer-to-peer chat where messages travel directly between two browser tabs — the server is only involved during the initial handshake (signaling). Once the connection is established, the server is completely out of the loop.
Phase 1 — Signaling (server-assisted)
WebRTC requires a brief setup phase called signaling before the two peers can talk directly. The server acts as a relay only during this phase.
Tab A (first joiner) Server (webrtcHandler.ts) Tab B (second joiner)
──────────────────── ───────────────────────── ─────────────────────
emit("webrtc:join", "room") ─▶ joins Socket.IO room
only 1 peer → waiting
◀─ "webrtc:waiting"
emit("webrtc:join", "room") ─▶
2 peers → ready!
◀─ "webrtc:ready" ◀─ "webrtc:ready"
{ initiator: false } (relay) { initiator: true }
Tab B creates offer:
emit("webrtc:offer", …) ──────▶ relayed to Tab A
Tab A creates answer:
emit("webrtc:answer", …) ────▶ relayed to Tab B
ICE candidates exchanged via server relay in both directions- 1. Joining a room (
hooks/useWebRTC.ts) — When you click Join,useWebRTCopens a Socket.IO connection and emits"webrtc:join"with the room name. - 2. Server assigns roles (
lib/socket/webrtcHandler.ts) — If the room has 0 peers, the server emits"webrtc:waiting". When a second peer joins, the server emits"webrtc:ready"to both tabs. The second joiner gets{ initiator: true }, the first gets{ initiator: false }. - 3. Offer/Answer exchange — The initiator (Tab B) creates an
RTCPeerConnection, opens a data channel (pc.createDataChannel("chat")), generates an SDP offer, and emits it to the server. The server relays the offer to Tab A, which creates an SDP answer and sends it back. - 4. ICE candidate exchange — As each peer discovers network paths, it emits
"webrtc:ice-candidate"to the server, which relays them to the other peer. Google's public STUN servers are used to discover public IP addresses.
Phase 2 — Direct P2P (server no longer involved)
Tab A ◀──────────────────────────────────────────────────▶ Tab B
RTCDataChannel ("chat") — direct P2POnce ICE negotiation succeeds, the RTCDataChannel fires its onopen event and the status updates to "Connected (P2P)". From this point, channel.send(text) pushes messages directly to the other peer's browser. The server never sees these messages.
Key files
| File | Role |
|---|---|
hooks/useWebRTC.ts | All client-side WebRTC logic: signaling, peer connection, data channel |
lib/socket/webrtcHandler.ts | Server: relays signaling messages between the two peers in a room |
app/components/BasicWebRTC.tsx | Chat UI for this tab (room picker + chat view) |
🎥 Tab 3 — WebRTC Video: Peer-to-Peer Video Call
A peer-to-peer video call where live camera and microphone streams travel directly between two browser tabs. Like Tab 2, the server only assists during the initial signaling handshake — once the connection is established, all media flows directly between the two browsers.
The key difference from Tab 2 is the transport mechanism: instead of an RTCDataChannel carrying text, this tab uses media tracks (RTCPeerConnection.addTrack) to stream real-time video and audio.
Phase 1 — Camera/Microphone Access
Browser
───────
navigator.mediaDevices.getUserMedia({ video: true, audio: true })
→ localStream (shown in the "You (local)" video element)Before connecting to the server, useWebRTCVideo calls navigator.mediaDevices.getUserMedia({ video: true, audio: true }) to request camera and microphone access. The resulting MediaStream is immediately attached to the local <video> element so you see your own preview right away.
Phase 2 — Signaling (server-assisted)
The signaling flow is identical to Tab 2. Both peers join the same room name, the server assigns roles (initiator / answerer), and the SDP offer/answer plus ICE candidates are relayed through the server.
// hooks/useWebRTCVideo.ts → createPC
const pc = new RTCPeerConnection(ICE_SERVERS);
// Add all local media tracks — this is what sends your video/audio to the remote peer
localStream.getTracks().forEach(track => pc.addTrack(track, localStream));
// Receive remote peer's tracks
pc.ontrack = (e) => {
remoteVideoRef.current.srcObject = e.streams[0];
};Phase 3 — Direct P2P Video (server no longer involved)
Tab A ◀──────────────────────────────────────────────────▶ Tab B
RTCPeerConnection (video + audio tracks) — direct P2POnce ICE negotiation succeeds, pc.onconnectionstatechange fires with "connected". The status updates to "Connected (P2P Video)". When the remote peer's tracks arrive, pc.ontrack attaches them to the remote <video> element. The server never sees any of this media.
Key files
| File | Role |
|---|---|
hooks/useWebRTCVideo.ts | All client-side logic: media capture, signaling, peer connection, track handling |
lib/socket/webrtcVideoHandler.ts | Server: relays signaling messages between the two peers in a room |
lib/webrtc.ts | Shared constants: ICE server config, connection status types & labels |
app/components/WebRTCVideo.tsx | Video call UI: local/remote video panels, status bar, room picker |
🎙️ Tabs 4, 5 & 6 — Live AI Transcription
The final three tabs all produce the same result — live speech-to-text powered by the OpenAI Realtime API (gpt-realtime-whisper) — but they differ in how audio gets to OpenAI and where the API key lives.
| Tab 4 — WebRTC | Tab 5 — WebSocket | Tab 6 — Server | |
|---|---|---|---|
| Audio transport | Native media track | PCM16 Base64 over WebSocket | PCM16 Base64 via Socket.IO → server WS |
| Auth mechanism | Ephemeral key in HTTP header | Ephemeral key as WS subprotocol | Server uses OPENAI_API_KEY directly |
| Ephemeral token | Yes | Yes | No |
| API key exposure | Ephemeral key in browser | Ephemeral key in browser | Key stays on server only |
Tab 4 — WebRTC Transcription
The browser establishes a WebRTC peer connection directly with OpenAI. The server's only role is to securely mint a short-lived ephemeral token so the browser never needs to hold your real API key.
Browser Your Server OpenAI Realtime API
─────── ─────────── ───────────────────
POST /api/transcription-session ──────────────────────────────▶ POST /v1/realtime/client_secrets
(uses OPENAI_API_KEY)
◀─ { value: "ek_..." } ◀────────────────────────────────────
RTCPeerConnection.createOffer()
POST https://api.openai.com/v1/realtime/calls ──────────────▶ (SDP answer)
Authorization: Bearer ek_...
◀─ SDP answer
RTCPeerConnection.setRemoteDescription(answer)
Microphone audio ────────────────────────────────────────────▶ gpt-realtime-whisper
◀─ transcript delta events (via RTCDataChannel "oai-events")- 1. Mint ephemeral token — The browser calls
POST /api/transcription-session. The server readsOPENAI_API_KEYand calls OpenAI's/v1/realtime/client_secretsendpoint, returning a short-livedek_...key. - 2. Establish WebRTC connection — The browser creates an
RTCPeerConnection, adds the microphone track viapc.addTrack(track, stream), creates a data channel named"oai-events", generates an SDP offer, and POSTs it to OpenAI with the ephemeral key in theAuthorizationheader. - 3. Receive transcript events — OpenAI sends
conversation.item.input_audio_transcription.deltaevents (live chunks) andconversation.item.input_audio_transcription.completedevents (final corrected text) over the data channel.
Tab 5 — WebSocket Transcription
The same live transcription as Tab 4, but the browser opens a WebSocket directly to wss://api.openai.com/v1/realtime. Because the browser WebSocket API does not allow custom HTTP headers, the ephemeral key is passed as a WebSocket subprotocol string.
// Open WebSocket with ephemeral key as subprotocol
const ws = new WebSocket("wss://api.openai.com/v1/realtime", [
"realtime",
`openai-insecure-api-key.${ephemeralKey}`,
]);
// Capture microphone via Web Audio API
const ctx = new AudioContext({ sampleRate: 24000 });
const processor = ctx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
const base64 = float32ToPcm16Base64(float32);
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: base64,
}));
};Audio is captured via the Web Audio API, encoded as Base64 PCM16 using float32ToPcm16Base64 from lib/audio.ts, and sent as JSON messages. Transcript events arrive over the same WebSocket and are handled by the shared useTranscription hook.
Tab 6 — Server Transcription
The most secure approach: the browser sends audio to your own Node.js server via Socket.IO, and the server opens a WebSocket to OpenAI using the OPENAI_API_KEY stored securely on the server. No ephemeral token is needed, and the API key never leaves the server.
// lib/socket/serverTranscriptionHandler.ts
socket.on("server-transcription:start", () => {
// Server opens WebSocket to OpenAI with OPENAI_API_KEY
const openaiWs = new WebSocket(
"wss://api.openai.com/v1/realtime?intent=transcription",
{ headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` } }
);
openaiWs.on("open", () => {
socket.emit("server-transcription:connected");
});
openaiWs.on("message", (raw) => {
socket.emit("server-transcription:event", raw.toString());
});
});
socket.on("server-transcription:audio", (base64) => {
openaiWs.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: base64,
}));
});The browser emits "server-transcription:start" to the server. The server opens the OpenAI WebSocket using OPENAI_API_KEY — something the browser cannot do directly. Audio chunks are forwarded from the browser to the server via Socket.IO, then from the server to OpenAI. Transcript events flow back the same way.
Key files
| File | Role |
|---|---|
app/api/transcription-session/route.ts | Server: mints ephemeral OpenAI token (Tabs 4 & 5) |
app/components/WebRTCTranscription.tsx | Tab 4: WebRTC connection, microphone capture, transcript display |
app/components/WebSocketTranscription.tsx | Tab 5: WebSocket connection, PCM16 audio encoding, transcript display |
lib/socket/serverTranscriptionHandler.ts | Tab 6: Server opens/manages the OpenAI WebSocket, forwards audio, relays events |
app/components/ServerTranscription.tsx | Tab 6: Socket.IO connection, microphone capture, transcript display |
lib/audio.ts | Shared utility: float32ToPcm16Base64 encoder (Tabs 5 & 6) |
The Shared useTranscription Hook
Tabs 4, 5, and 6 all use the same useTranscription hook for event handling and transcript display. The hook accumulates delta events into a rolling text block, updating it in real time as you speak. When the completed event arrives, the in-progress segment is replaced with the corrected final text and a newline is added so the next utterance starts on a fresh line.
Each speech turn is identified by a unique item_id, which is used internally to know which portion of the text to replace when the final transcript arrives.
// hooks/useTranscription.ts
// Two event types from OpenAI:
// 1. delta → small chunk of new text (fires continuously as you speak)
// 2. completed → full corrected text for that speech turn
handleEvent(raw: string) {
const event = JSON.parse(raw);
if (event.type === "conversation.item.input_audio_transcription.delta") {
// Append delta to the in-progress segment identified by event.item_id
updateSegment(event.item_id, (prev) => prev + event.delta);
}
if (event.type === "conversation.item.input_audio_transcription.completed") {
// Replace in-progress segment with the final corrected text
finalizeSegment(event.item_id, event.transcript + "\n");
}
}The hook also exposes clearTranscript and copyTranscript helpers used by all three transcription tab UIs, keeping the UI components thin and focused on rendering.
Full Comparison: All Six Tabs
The table below summarizes the key differences across all six demos — from message path and server role to scalability and API key handling.
| Tab | Message/Media Path | Server Role | Scales to N Clients | Good For |
|---|---|---|---|---|
| 1 — WebSocket Chat | Browser → Server → All browsers | Always in the loop | Yes | Group chat, live feeds, notifications |
| 2 — WebRTC Chat | Browser ↔ Browser (direct) | Only during signaling | No (2 peers) | Low-latency P2P text, file transfer |
| 3 — WebRTC Video | Browser ↔ Browser (direct) | Only during signaling | No (2 peers) | Video/audio calls |
| 4 — WebRTC Transcription | Browser → OpenAI (direct) | Only mints ephemeral token | Yes (per-user session) | Live captions, voice notes, accessibility |
| 5 — WebSocket Transcription | Browser → OpenAI (direct) | Only mints ephemeral token | Yes (per-user session) | Live captions when WebRTC is unavailable |
| 6 — Server Transcription | Browser → Server → OpenAI | Full proxy (manages OpenAI WS) | Yes (per-user session) | Secure transcription; browser can't reach OpenAI directly |
WebRTC vs WebSocket vs Server-side for transcription (Tabs 4, 5 & 6)
All three tabs produce the same result — live transcription via the OpenAI Realtime API — but they differ in how audio gets there and where the API key lives:
| Tab 4 — WebRTC | Tab 5 — WebSocket | Tab 6 — Server | |
|---|---|---|---|
| Audio transport | Native media track (browser handles encoding) | Web Audio API → PCM16 → Base64 → JSON | Web Audio API → PCM16 → Base64 → Socket.IO → server WS |
| Connection setup | SDP offer/answer exchange | Simple WebSocket handshake | Socket.IO emit; server opens WS to OpenAI |
| Auth mechanism | Ephemeral key in HTTP Authorization header | Ephemeral key as WebSocket subprotocol | Server uses OPENAI_API_KEY directly |
| Ephemeral token needed | Yes | Yes | No |
| Complexity | Lower (browser handles audio encoding) | Higher (manual PCM16 encoding in JS) | Moderate (server proxy adds a hop but simplifies auth) |
| Security | Ephemeral key visible in browser | Ephemeral key visible in browser | API key never leaves the server |
Conclusion
WebSockets and WebRTC are complementary technologies that together cover the full spectrum of real-time web communication. WebSockets excel at server-mediated scenarios — group chat, live feeds, notifications — where the server needs to be in the loop. WebRTC shines for peer-to-peer scenarios — video calls, file transfer, low-latency data — where you want to minimize server involvement after the initial handshake.
The three transcription tabs show how the same AI capability can be delivered with very different architectures, each with its own trade-offs around security, complexity, and browser compatibility. Tab 6's server-proxy approach is the most secure and the most straightforward to reason about — the API key never leaves the server, and no ephemeral token dance is required.
About the Author
Wayne Cheng is the founder and AI app developer at Audoir, LLC. Prior to founding Audoir, he worked as a hardware design engineer for Silicon Valley startups and an audio engineer for creative organizations. He holds an MSEE from UC Davis and a Music Technology degree from Foothill College.
Further Exploration
Explore the complete tutorial repository and experiment with extending the demos. Consider adding a third peer to the WebRTC rooms, implementing TURN server support for stricter network environments, or building a multi-user transcription session where all participants see a shared live transcript.
For more AI-powered development tools and tutorials, visit Audoir .