WebSocket Protocol

HTTP is a request-response protocol. The client speaks, the server answers, and the connection closes. This works well for fetching documents, but it is a poor fit for anything that needs the server to push data unprompted — stock tickers, chat, live dashboards, collaborative editing.

Before WebSocket, developers worked around HTTP's limitations with polling (asking the server "anything new?" on a timer), long-polling (holding a request open until the server has something to say), and chunked transfer tricks. All of these are awkward. WebSocket, standardised in RFC 6455 (2011), adds a proper full-duplex channel to the web platform.

The HTTP/1.1 handshake

A WebSocket connection begins as a plain HTTP/1.1 request. The client asks the server to upgrade the connection, and if the server agrees, they switch protocols on the same TCP socket. No new connection is needed.

Client request
GET /chat HTTP/1.1
Host: example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Server response
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

The Sec-WebSocket-Key is a random 16-byte value the client sends as a Base64 string. The server proves it understands the WebSocket protocol — rather than just forwarding the request — by appending the magic GUID

258EAFA5-E914-47DA-95CA-C5AB0DC85B11

taking the SHA-1 hash, and returning the Base64 result as Sec-WebSocket-Accept. After the 101 Switching Protocols response, the HTTP layer is discarded and both sides speak the WebSocket framing protocol directly.

The HTTP/2 handshake

HTTP/2 is a binary, multiplexed protocol — multiple request/response streams share a single TCP connection concurrently. The HTTP/1.1 upgrade mechanism does not translate: HTTP/2 forbids Connection and Upgrade headers, and the 101 Switching Protocols status code does not exist in HTTP/2.

RFC 8441 (2018) solves this with an extended CONNECT method. The client adds a :protocol pseudo-header to signal the desired application protocol, leaving the existing HTTP/2 stream open rather than hijacking the underlying TCP connection.

Before the client can use this, the server must opt in by sending SETTINGS_ENABLE_CONNECT_PROTOCOL = 1 in its HTTP/2 SETTINGS frame. Once enabled, the handshake looks like this:

Client HEADERS frame
:method = CONNECT
:protocol = websocket
:scheme = https
:path = /chat
:authority = example.com
sec-websocket-version = 13
Server HEADERS frame
:status = 200

A 200 response establishes the WebSocket. The HTTP/2 stream is then treated as if it were a TCP connection — WebSocket frames flow through it directly. Sec-WebSocket-Key and Sec-WebSocket-Accept are dropped; the :protocol pseudo-header takes over their role. However, these headers and the handling of them may still need to be performed if the HTTP/2 connection is actually an HTTP/1.1 connection that was upgraded by a loadbalancer before reaching your app server.

Benefits over HTTP/1.1

The frame format

Everything sent over a WebSocket connection — text messages, binary payloads, pings, close signals — travels inside a frame. Frames are small: their header is just two bytes in the common case. Scroll through the sections below to see each field of the header reveal itself.

A frame

Once the handshake completes, every message travels in one or more frames. A frame is the atomic unit of the wire format: a short structured header followed by arbitrary payload bytes.

The diagram to the left is that header. It starts here as an empty rectangle — by the time you finish scrolling, every field will be labelled and coloured.

32 bits per row

Network protocol diagrams conventionally show fields in 32-bit (4-byte) rows, reading left-to-right with the most-significant bit on the left — bit 0. The four columns mark byte boundaries.

This layout comes from the way protocol engineers think about memory: each row is one machine word, and the visual grouping makes it easy to see which fields cross byte boundaries and how much space each one occupies.

FIN — final fragment

The first bit is FIN. It answers the question: is this the last frame of the message?

WebSocket allows a single logical message to be split across multiple continuation frames. Intermediate frames have FIN=0. The final (or only) frame sets FIN=1. Single-shot messages — the common case — always have FIN=1.

Fragmentation lets a sender begin streaming a message whose total length is not yet known, interleaving control frames (like pings) between fragments.

RSV1 · RSV2 · RSV3

Three reserved bits follow. The spec says they must be 0 unless a WebSocket extension negotiated during the handshake has assigned them a meaning.

The most widely deployed extension is permessage-deflate: when both endpoints agree to it in their handshake headers, RSV1=1 signals that the payload has been deflate-compressed and must be inflated before use. The other two reserved bits remain available for future extensions.

Opcode — frame type

The 4-bit opcode identifies what kind of frame this is. There are two categories: data frames carry application payload; control frames manage the connection itself.

Value Meaning
0x0Continuation frame
0x1Text frame (UTF-8)
0x2Binary frame
0x8Connection close
0x9Ping
0xAPong

Values 0x3–0x7 and 0xB–0xF are reserved for future data and control frames respectively. A receiver that encounters an unknown opcode must close the connection.

MASK — payload masking

The MASK bit is asymmetric — its required value depends on which side is sending. The diagram shows both cases.

Client → server: MASK must be 1. The masking key field is present, and the payload is XOR-masked with it.

Server → client: MASK must be 0. There is no masking key field, and the payload bytes are sent as-is. A client that receives a masked server frame must close the connection.

The reason masking is mandatory from clients is defense against cache-poisoning attacks: a malicious web page could otherwise craft WebSocket traffic that looks like valid HTTP responses to an intervening proxy. Masking with a random per-frame key makes the payload look like noise to any intermediary.

Payload length

The 7-bit payload length field encodes size in three ranges:

  • 0–125 — the actual byte count of the payload
  • 126 — read the next 16 bits for the real length (up to 65,535 bytes)
  • 127 — read the next 64 bits for the real length (up to 263 bytes)

This variable-length encoding keeps the common case (small messages) minimal: most frames need only the 7-bit field. The extended fields only appear when the payload overflows the smaller encoding.

Three branches

The 7-bit field creates three distinct frame layouts depending on its value. The diagram on the left shows what comes immediately after the first two bytes for each case.

With len 0–125 the header ends after two bytes (or ten, including the masking key). With len = 126 two extra bytes carry the real length. With len = 127 eight extra bytes carry it. In all cases the masking key and payload follow in the same order — the only thing that changes is how many bytes the length occupies.

The value 126 was chosen as the break-point because 125 is the maximum payload size for control frames (close, ping, pong), which are never fragmented. That keeps the common control-frame path at the 7-bit minimum.

Extended payload length

Back to the full diagram: when payload length is 126 or 127, the extended length field expands the header. The new row that just appeared carries the 32-bit continuation of the 64-bit extended length.

In practice, WebSocket frames are kept small. Sending one enormous frame is wasteful because neither side can interleave other traffic until it completes. Implementations typically chunk large payloads into fragments of a few kilobytes so control frames (like pings) are not starved.

Masking key

When MASK is set, a 32-bit masking key immediately precedes the payload. The key is chosen randomly for every frame — never reuse it.

To apply or remove masking, XOR each payload byte with the corresponding key byte, cycling through the four-byte key:

for (let i = 0; i < payload.length; i++) {
    payload[i] ^= key[i % 4];
}

The same operation both applies and removes the mask, so the receiver runs the same loop. Note that masking provides zero cryptographic security — its sole purpose is preventing proxy cache poisoning.

Payload data

Finally, the payload. What it contains depends on the opcode:

  • Continuation frames (0x0) carry subsequent fragments of a fragmented message. The first fragment uses opcode 0x1 or 0x2; every following fragment uses 0x0 to signal "more of the same message". The receiver reassembles them in order.
  • Text frames (0x1) must carry valid UTF-8. The receiver must close the connection if it receives malformed UTF-8.
  • Binary frames (0x2) carry arbitrary bytes — your protocol defines the structure.
  • Close frames (0x8) carry an optional 2-byte (16-bit big endian) status code followed by a UTF-8 reason string.
  • Ping (0x9) and Pong (0xA) payloads are limited to 125 bytes and must be echoed verbatim in the corresponding Pong.

That's the complete frame header. Two bytes in the common case; a handful more when extended length or masking key fields are present.

Closing a connection

Either side can initiate a close by sending a Close frame (opcode 0x8). The other side must respond with its own Close frame and then close the TCP connection. Status codes mirror HTTP semantics: 1000 is a normal closure, 1001 means the endpoint is "going away" (a server restarting, a browser navigating), 1002 is a protocol error.

The close handshake is graceful by design: both sides flush their send buffers before tearing down the socket, so no data is lost on an orderly shutdown.