WebSocket & WebRTC Tutorial: Real-Time Communication on the Web
A hands-on guide to real-time communication on the web, built with Next.js, Socket.IO, the browser's native WebRTC APIs, and the OpenAI Realtime API. Six progressive tabs take you from a simple broadcast chat all the way to live AI-powered speech transcription.
Complete Tutorial Code
Follow along with the complete source code for this WebSocket & WebRTC tutorial. Includes six self-contained demos — from a broadcast chat to peer-to-peer video calls and live AI transcription.
View on GitHubIntroduction
Real-time communication is at the heart of modern web applications — from live chat and collaborative editing to video calls and AI-powered voice interfaces. Two browser technologies make this possible: WebSockets, which provide a persistent bidirectional channel between a browser and a server, and WebRTC, which enables direct peer-to-peer connections between browsers for ultra-low-latency data, audio, and video.
This tutorial walks you through both technologies side by side. The app has six tabs, each a self-contained demo that teaches a different communication pattern — starting simple and building up to more advanced techniques, including live speech-to-text powered by the OpenAI Realtime API.
WebSocket vs. WebRTC: A Quick Comparison
Before diving into the code, it helps to understand when to reach for each technology.
WebSocket
- • Server is always in the loop
- • Scales to N clients via broadcast
- • Simple setup — no signaling needed
- • Great for group chat, live feeds, notifications
WebRTC
- • Server only assists during setup (signaling)
- • Ultra-low latency once connected
- • Supports data channels, audio, and video
- • Great for video calls, P2P file transfer
Tech Stack
The tutorial uses a modern TypeScript stack that keeps all six demos running on a single port:
server.ts) that mounts both Next.js and Socket.IO on port 3000gpt-realtime-whisper) for live speech transcriptionProject Structure
The project is organized around a custom server, shared hooks, and per-tab UI components:
server.ts # Custom HTTP server: Next.js + Socket.IO
lib/
webrtc.ts # Shared WebRTC constants & types (ICE servers, status labels)
audio.ts # Shared audio utility: Float32 → PCM16 Base64 encoder
socket/
chatHandler.ts # Server-side: handles "chat message" events
webrtcHandler.ts # Server-side: WebRTC data-channel signaling relay
webrtcVideoHandler.ts # Server-side: WebRTC video/audio signaling relay
serverTranscriptionHandler.ts # Server-side: proxies audio to OpenAI Realtime API
hooks/
useSocket.ts # Client hook: WebSocket chat state & logic
useWebRTC.ts # Client hook: WebRTC connection & data channel
useWebRTCVideo.ts # Client hook: WebRTC video/audio peer connection
useTranscription.ts # Client hook: shared transcription state & event handling
app/
api/
transcription-session/
route.ts # API route: mints ephemeral OpenAI token (server-side)
components/
BasicWebSocket.tsx # Tab 1 UI component
BasicWebRTC.tsx # Tab 2 UI component
WebRTCVideo.tsx # Tab 3 UI component
WebRTCTranscription.tsx # Tab 4 UI component
WebSocketTranscription.tsx # Tab 5 UI component
ServerTranscription.tsx # Tab 6 UI component
types/
message.ts # Shared Message typeTutorial Overview
The six tabs are designed to be explored in order. Each one introduces a new concept while building on what came before:
🔌 Basic WebSocket — Real-Time Chat
A classic broadcast chat where every connected browser tab sees every message in real time. The server is the hub — all messages flow through it via Socket.IO.
📡 Basic WebRTC — Peer-to-Peer Chat
Messages travel directly between two browser tabs via an RTCDataChannel. The server only assists during the initial signaling handshake (offer/answer/ICE).
🎥 WebRTC Video — Peer-to-Peer Video Call
Live camera and microphone streams flow directly between two browser tabs using RTCPeerConnection media tracks. The server is only involved during signaling.
🎙️ WebRTC Transcription — Live Speech-to-Text via WebRTC
Microphone audio is streamed directly to the OpenAI Realtime API over a WebRTC peer connection. Transcript text appears word by word in real time. The server only mints a short-lived ephemeral token.
🎙️ WebSocket Transcription — Live Speech-to-Text via WebSocket
The same live transcription as Tab 4, but the browser opens a WebSocket directly to OpenAI. Audio is captured via the Web Audio API, encoded as Base64 PCM16, and sent as JSON messages.
🖥️ Server Transcription — Secure Proxy via Socket.IO
The browser never touches the OpenAI API directly. Audio is sent to your own Node.js server via Socket.IO, and the server opens a WebSocket to OpenAI using the API key stored securely on the server.
Getting Started
Follow these steps to run all six demos on your local machine:
Prerequisites
- Node.js (v18 or higher)
- OpenAI API key (required for Tabs 4, 5, and 6 only)
Installation Steps
- 1Clone the repository:
git clone https://github.com/audoir/websocket-webrtc-tutorial.git - 2Install dependencies:
npm install - 3Set your OpenAI API key (for Tabs 4, 5 & 6):
echo "OPENAI_API_KEY=sk-..." > .envThe key is only used server-side and is never exposed to the browser.
- 4Start the dev server:
npm run devThis starts the custom Node.js server (
server.ts) that mounts both Next.js and Socket.IO on port 3000 — not the standard Next.js dev server. - 5Open your browser:
Tab 1 — Basic WebSocket: Real-Time Chat
The first tab demonstrates the simplest real-time pattern: a broadcast chat where every connected browser tab sees every message instantly. Open the app in two or more tabs, type a message in any tab, and watch it appear in all the others.
How It Works
When the component mounts, useSocket calls io() to open a Socket.IO connection. Socket.IO uses a WebSocket under the hood (with HTTP long-polling as a fallback). When you send a message, the hook emits a "chat message" event to the server. The server receives it and calls io.emit("chat message", msg, socket.id) — broadcasting to all connected clients. Each tab listens for the event and appends the message to its list. If the incoming senderId matches the local socket's own ID, the message is ignored (it was already added optimistically).
// lib/socket/chatHandler.ts
socket.on("chat message", (msg) => {
// Broadcast to ALL connected clients, including the sender
io.emit("chat message", msg, socket.id);
});
// hooks/useSocket.ts
socket.on("chat message", (msg, senderId) => {
if (senderId === socket.id) return; // already added optimistically
setMessages((prev) => [...prev, msg]);
});Tab 2 — Basic WebRTC: Peer-to-Peer Chat
The second tab demonstrates a peer-to-peer chat where messages travel directly between two browser tabs — the server is only involved during the initial handshake. Once connected, the server is completely out of the loop.
Phase 1: Signaling (Server-Assisted)
WebRTC requires a brief setup phase called signaling before two peers can talk directly. Both tabs join the same room name. The server assigns roles: the second joiner becomes the initiator and creates an SDP offer. The server relays the offer to the first tab, which creates an SDP answer. ICE candidates are exchanged through the server relay in both directions.
// Signaling flow
Tab A emits "webrtc:join" → server assigns { initiator: false }
Tab B emits "webrtc:join" → server assigns { initiator: true }
// Tab B (initiator) creates offer:
const pc = new RTCPeerConnection(ICE_SERVERS);
const channel = pc.createDataChannel("chat");
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
socket.emit("webrtc:offer", offer);
// Tab A creates answer:
await pc.setRemoteDescription(offer);
const answer = await pc.createAnswer();
await pc.setLocalDescription(answer);
socket.emit("webrtc:answer", answer);Phase 2: Direct P2P (Server No Longer Involved)
Once ICE negotiation succeeds, the RTCDataChannel fires its onopen event and the status updates to "Connected (P2P)". From this point, channel.send(text) pushes messages directly to the other peer's browser. The server never sees these messages.
Tab 3 — WebRTC Video: Peer-to-Peer Video Call
The third tab extends the WebRTC pattern to live video and audio. The key difference from Tab 2 is the transport: instead of an RTCDataChannel carrying text, this tab uses media tracks (RTCPeerConnection.addTrack) to stream real-time video and audio.
Tab 2 — Data Channel
- •
pc.createDataChannel("chat") - • Carries text messages
- • No media permissions needed
Tab 3 — Media Tracks
- •
getUserMedia({ video, audio }) - •
pc.addTrack(track, stream) - •
pc.ontrackreceives remote video
Before connecting to the server, useWebRTCVideo calls navigator.mediaDevices.getUserMedia({ video: true, audio: true }) to request camera and microphone access. The resulting MediaStream is immediately attached to the local <video> element so you see your own preview right away. All local tracks are added to the peer connection via localStream.getTracks().forEach(track => pc.addTrack(track, localStream)). When the remote peer's tracks arrive, pc.ontrack attaches them to the remote <video> element.
Tabs 4, 5 & 6 — Live AI Transcription
The final three tabs all produce the same result — live speech-to-text powered by the OpenAI Realtime API (gpt-realtime-whisper) — but they differ in how audio gets to OpenAI and where the API key lives.
| Tab 4 — WebRTC | Tab 5 — WebSocket | Tab 6 — Server | |
|---|---|---|---|
| Audio transport | Native media track | PCM16 Base64 over WebSocket | PCM16 Base64 via Socket.IO → server WS |
| Auth mechanism | Ephemeral key in HTTP header | Ephemeral key as WS subprotocol | Server uses OPENAI_API_KEY directly |
| Ephemeral token | Yes | Yes | No |
| API key exposure | Ephemeral key in browser | Ephemeral key in browser | Key stays on server only |
Tab 4 — WebRTC Transcription
The browser establishes a WebRTC peer connection directly with OpenAI. The server's only role is to securely mint a short-lived ephemeral token so the browser never needs to hold your real API key.
// Step 1: Mint ephemeral token (server-side)
// app/api/transcription-session/route.ts
POST https://api.openai.com/v1/realtime/client_secrets
→ returns { value: "ek_..." }
// Step 2: Establish WebRTC connection (browser-side)
const pc = new RTCPeerConnection();
pc.addTrack(micTrack, stream); // send microphone audio
const dc = pc.createDataChannel("oai-events"); // receive transcript events
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
// POST SDP to OpenAI with ephemeral key
POST https://api.openai.com/v1/realtime/calls
Authorization: Bearer ek_...
Body: SDP offer
→ SDP answer
await pc.setRemoteDescription({ type: "answer", sdp: answer });
// Step 3: Receive transcript events over the data channel
dc.onmessage = (e) => {
const event = JSON.parse(e.data);
// conversation.item.input_audio_transcription.delta → live text
// conversation.item.input_audio_transcription.completed → final text
};Tab 5 — WebSocket Transcription
The browser opens a WebSocket directly to wss://api.openai.com/v1/realtime. Because the browser WebSocket API does not allow custom HTTP headers, the ephemeral key is passed as a WebSocket subprotocol string.
// Open WebSocket with ephemeral key as subprotocol
const ws = new WebSocket("wss://api.openai.com/v1/realtime", [
"realtime",
`openai-insecure-api-key.${ephemeralKey}`,
]);
// Capture microphone via Web Audio API
const ctx = new AudioContext({ sampleRate: 24000 });
const processor = ctx.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (e) => {
const float32 = e.inputBuffer.getChannelData(0);
const base64 = float32ToPcm16Base64(float32);
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: base64,
}));
};Tab 6 — Server Transcription
The most secure approach: the browser sends audio to your own Node.js server via Socket.IO, and the server opens a WebSocket to OpenAI using the OPENAI_API_KEY stored securely on the server. No ephemeral token is needed, and the API key never leaves the server.
// lib/socket/serverTranscriptionHandler.ts
socket.on("server-transcription:start", () => {
// Server opens WebSocket to OpenAI with OPENAI_API_KEY
const openaiWs = new WebSocket(
"wss://api.openai.com/v1/realtime?intent=transcription",
{ headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` } }
);
openaiWs.on("open", () => {
socket.emit("server-transcription:connected");
});
openaiWs.on("message", (raw) => {
socket.emit("server-transcription:event", raw.toString());
});
});
socket.on("server-transcription:audio", (base64) => {
openaiWs.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: base64,
}));
});The Shared useTranscription Hook
Tabs 4, 5, and 6 all use the same useTranscription hook for event handling and transcript display. The hook accumulates delta events into a rolling text block, updating it in real time as you speak. When the completed event arrives, the in-progress segment is replaced with the corrected final text and a newline is added so the next utterance starts on a fresh line.
// hooks/useTranscription.ts
// Two event types from OpenAI:
// 1. delta → small chunk of new text (fires continuously as you speak)
// 2. completed → full corrected text for that speech turn
handleEvent(raw: string) {
const event = JSON.parse(raw);
if (event.type === "conversation.item.input_audio_transcription.delta") {
// Append delta to the in-progress segment identified by event.item_id
updateSegment(event.item_id, (prev) => prev + event.delta);
}
if (event.type === "conversation.item.input_audio_transcription.completed") {
// Replace in-progress segment with the final corrected text
finalizeSegment(event.item_id, event.transcript + "\n");
}
}Key Concepts Demonstrated
WebSocket Broadcasting
Persistent bidirectional connections with Socket.IO. Server broadcasts to all connected clients simultaneously.
WebRTC Signaling
SDP offer/answer exchange and ICE candidate relay to establish direct peer-to-peer connections across NATs and firewalls.
Media Capture & Streaming
Using getUserMedia to capture camera and microphone, then streaming media tracks over a WebRTC peer connection.
Real-Time AI Transcription
Three approaches to live speech-to-text with the OpenAI Realtime API — WebRTC, WebSocket, and server proxy — each with different security and complexity trade-offs.
Secure API Key Handling
Minting short-lived ephemeral tokens server-side so the real API key never reaches the browser, and a fully server-proxied approach where the key never leaves the server at all.
Custom React Hooks
Encapsulating all connection logic — WebSocket, WebRTC, and transcription — in reusable custom hooks that expose a clean API to UI components.
Full Comparison: All Six Tabs
| Tab | Message/Media Path | Server Role | Scales to N Clients |
|---|---|---|---|
| 1 — WebSocket Chat | Browser → Server → All browsers | Always in the loop | Yes |
| 2 — WebRTC Chat | Browser ↔ Browser (direct) | Only during signaling | No (2 peers) |
| 3 — WebRTC Video | Browser ↔ Browser (direct) | Only during signaling | No (2 peers) |
| 4 — WebRTC Transcription | Browser → OpenAI (direct) | Only mints ephemeral token | Yes (per-user session) |
| 5 — WebSocket Transcription | Browser → OpenAI (direct) | Only mints ephemeral token | Yes (per-user session) |
| 6 — Server Transcription | Browser → Server → OpenAI | Full proxy (manages OpenAI WS) | Yes (per-user session) |
Learning Outcomes
By working through this tutorial, you will have gained practical experience with:
- • Building real-time broadcast chat with Socket.IO
- • Implementing WebRTC signaling (offer/answer/ICE) from scratch
- • Streaming live video and audio between browser peers
- • Connecting to the OpenAI Realtime API via WebRTC and WebSocket
- • Encoding microphone audio as PCM16 Base64 using the Web Audio API
- • Minting ephemeral tokens server-side for secure API access
- • Building a server-side proxy for environments where the browser cannot reach an external API directly
- • Encapsulating real-time logic in reusable custom React hooks
- • Comparing trade-offs between WebSocket, WebRTC, and server-proxy architectures
Conclusion
WebSockets and WebRTC are complementary technologies that together cover the full spectrum of real-time web communication. WebSockets excel at server-mediated scenarios — group chat, live feeds, notifications — where the server needs to be in the loop. WebRTC shines for peer-to-peer scenarios — video calls, file transfer, low-latency data — where you want to minimize server involvement after the initial handshake.
The three transcription tabs show how the same AI capability can be delivered with very different architectures, each with its own trade-offs around security, complexity, and browser compatibility. Tab 6's server-proxy approach is the most secure and the most straightforward to reason about — the API key never leaves the server, and no ephemeral token dance is required.
About the Author
Wayne Cheng is the founder and AI app developer at Audoir, LLC. Prior to founding Audoir, he worked as a hardware design engineer for Silicon Valley startups and an audio engineer for creative organizations. He holds an MSEE from UC Davis and a Music Technology degree from Foothill College.
Further Exploration
Explore the complete tutorial repository and experiment with extending the demos. Consider adding a third peer to the WebRTC rooms, implementing TURN server support for stricter network environments, or building a multi-user transcription session where all participants see a shared live transcript.
For more AI-powered development tools and tutorials, visit Audoir .