System Design

WebSocket Capacity Planning for Social Products

WebSocket capacity planning for social products: budget memory, file descriptors, and fan-out per connection so you scale before connections, not CPU, break first.

Part of Distributed Systems Patterns That Hold Up in Production

By Colson · Distinguished Software Engineer, Founder

July 10, 2026 12 min read

WebSocket capacity planning visualized as many persistent connections fanning out from a single server node under a memory and file-descriptor budget

WebSocket capacity planning starts with one correction: you are not planning for connections, you are planning for the work those connections generate. A server that holds a million idle sockets can fall over at a few thousand active ones, because the cost that matters is per-message work and fan-out, not the open file descriptor.

So the first question is never “how many connections can this hold.” It is “how many messages per second, fanned out to how many recipients, can this hold.” Get that model right and the connection count almost takes care of itself.

This post lays out a capacity model for social products: chat, presence, live feeds, multiplayer lobbies. The kind of system where one event has to reach many people fast, and where the failure mode is not a slow page, it is a node that silently stops delivering.

Why WebSocket capacity planning is different from request capacity

A normal HTTP service is sized on requests per second and the work per request. Connections are transient, the load balancer spreads them, and an idle client costs you nothing.

WebSockets break all three assumptions. The connection is long-lived, so an idle client still costs memory and a file descriptor. The connection is stateful, so it usually has to land on a node that knows about it. And one inbound event can trigger many outbound sends, so load is not symmetric with traffic in.

That asymmetry is the whole game. In a social product, a single message into a busy channel is not one unit of work, it is one unit times the number of subscribers. Size against requests in and you will be wrong by the average fan-out factor, which is exactly when it hurts. This post is part of the distributed systems patterns series.

How many WebSocket connections can one server hold?

A tuned server holds hundreds of thousands to low millions of idle WebSocket connections, bounded by RAM, file descriptors, and ephemeral ports, not CPU. The practical ceiling collapses once connections are active, because per-message work and fan-out, not the open sockets, are what consume the box.

The famous benchmarks (C10K, then C10M, then the WhatsApp and Phoenix “2 million connections on one node” writeups) all measure the easy axis: how many sockets a process can keep open. That number is real and impressive, and it is also the wrong number to plan against for a social product.

Holding 2M idle connections proves the runtime and kernel can do it. It says nothing about what happens when 5% of them are in active channels generating fan-out. Your honest capacity is the active ceiling, found by load-testing with realistic message rates and room sizes, not by counting open sockets.

On TYPEMUSE, the multiplayer typing platform I run, the connection layer is Elixir on the BEAM precisely because the per-connection cost is cheap and isolated; the planning effort goes almost entirely into the fan-out path, not into how many lobbies can stay open.

How much memory does a WebSocket connection use?

Budget a few kilobytes of kernel socket buffers plus your per-connection application state. The kernel part is fairly fixed and small. The application part, presence data, send buffers, subscription lists, decoded session, is the larger and more variable cost, and it is what actually sets your connection ceiling.

The trap is reasoning only about the kernel side. Yes, the socket itself is cheap. But a connection in a social product is rarely just a socket. It carries a user identity, a set of subscribed channels, possibly a buffered backlog for a slow client, and whatever per-session state your protocol needs.

Multiply that by your connection target and the application state, not the kernel, decides whether a node holds 200K or 2M. The cheapest win is to keep per-connection state lean: store subscriptions as references, not copies; bound send buffers; and push heavy state out of the connection process into shared storage.

Per-connection cost	Where it lives	Roughly fixed?	Planning note
Kernel socket buffers (rcv/snd)	Kernel	Mostly	Tunable via `net.ipv4.tcp_rmem` / `wmem`; small per socket
File descriptor	Kernel	Yes (1 each)	Hard cap via `ulimit -n` and `fs.file-max`
Connection process / goroutine / task	Runtime	Mostly	BEAM process, goroutine, or async task; cheap but not free
Application session state	App heap	No	Identity, decoded auth, protocol state; keep lean
Subscription list	App heap	No	Reference channels, do not copy membership
Send buffer / backlog	App heap	No	Slow consumers grow this without bound; cap it

The last two rows are where capacity planning lives. The first three are tuning. Get the tuning right once; revisit the application rows every time the product adds a feature that rides the connection.

What breaks first under WebSocket connection load?

While connections are idle, file descriptors, ephemeral ports, or memory break first, almost never CPU. Once traffic is active, fan-out amplification and backpressure break first: one publish becomes thousands of writes, and slow consumers fill send buffers until a node degrades. Plan for the active ceiling, not the idle one.

The idle failures are boring and fixable with tuning. You will hit the default ulimit -n of 1024 almost immediately, then fs.file-max, then ephemeral port exhaustion on the load balancer or any node that originates many outbound connections. These are config, and they are documented well in the Linux and kernel tuning guides.

The active failures are the ones that page you. The two that matter:

Fan-out amplification. One message into a large room is one inbound event and N outbound writes. CPU and the network egress, fine while channels are small, become the bottleneck the moment a channel gets popular.
Backpressure from slow consumers. A client on bad wifi cannot drain its socket fast enough. Its send buffer grows. Without a bound, one slow consumer per popular channel is enough to pin memory and stall the writer.

How do you plan capacity for WebSocket fan-out?

Model writes, not connections. A message into a room of N members costs N sends, so per-node capacity is publishes per second times average room size, against your write budget. Cap room sizes, shard or batch large rooms, and never let a single hot channel saturate a node.

Here is the worked model. It is illustrative, not a benchmark; plug in your own measured numbers.

total_writes_per_sec = publishes_per_sec × avg_room_size × active_fraction

# Illustrative, NOT a benchmark:
#   publishes_per_sec   = 2,000   (inbound messages across the system)
#   avg_room_size       = 150     (average recipients per message)
#   active_fraction     = 0.4     (rooms with someone listening right now)
#
#   total_writes_per_sec = 2,000 × 150 × 0.4 = 120,000 writes/sec
#
# Now size nodes against 120K writes/sec of fan-out, NOT 2K messages/sec inbound.
# If one tuned node sustains ~40K writes/sec at your message size, you need 3+
# nodes for headroom, even though 2K inbound looks trivial.

The discipline is to always carry the amplification factor. Inbound rate is what product and analytics talk about; amplified write rate is what your infrastructure actually does. When someone says “we only do a few thousand messages a second,” your next question is “into rooms of what size.”

Two levers keep the amplified rate sane. Cap room sizes so no single channel can produce unbounded fan-out; above the cap, switch a “room” into a broadcast or feed model with different delivery guarantees. Shard large rooms so a 50K-member channel is split across nodes and no single writer owns all the egress. Batching small messages into fewer frames helps egress too, at the cost of a little latency. This is the same backpressure discipline covered in backpressure design for real-time systems; fan-out is just backpressure you can predict in advance.

How do you scale WebSockets horizontally?

Spread connections across stateless gateway nodes behind an L4 load balancer, and move shared state, presence and pub/sub fan-out, into a backplane every node subscribes to. The hard part is not holding sockets, it is routing a message to the right connections across nodes without amplifying load further.

The architecture that holds up:

Stateless gateway tier. Each node terminates WebSocket connections and owns nothing another node needs. A connection can land anywhere. Use an L4 (TCP) load balancer; L7 buys you nothing for a long-lived connection and adds cost.
A backplane for cross-node fan-out. When a message belongs to subscribers spread across nodes, the originating node publishes once to the backplane (Redis pub/sub, a distributed process registry, Kafka for durable streams) and every node delivers to its own local subscribers. Phoenix’s PubSub and Discord’s Elixir guild-routing are both versions of this.
Sticky enough, not too sticky. You want a connection to stay on its node for its lifetime, which is natural, but you do not want session-level stickiness that pins users to a node across reconnects. Reconnect should be free to land anywhere.

The failure to avoid is sticky-session sprawl: pinning users to specific nodes so hard that you cannot drain a node for deploy without dropping connections, and cannot rebalance after a scale-up. Connections should be cheap to move. Design for a node to die or drain at any moment and clients to reconnect elsewhere within seconds.

The backplane is where horizontal scaling gets subtle. A naive broadcast-to-all-nodes design amplifies your fan-out by the node count: every message hits every node whether or not it has subscribers. At small scale that is fine. At large scale you want the backplane to route only to nodes that actually hold subscribers, which is exactly what a distributed registry (Elixir’s pg, a sharded Redis keyspace, consistent-hashed channel ownership) gives you. The timeout and retry behavior of that backplane hop matters too; see timeout budgets across service chains for why a slow backplane call should fail fast rather than stall the writer.

How do you handle presence at scale?

Treat presence as soft, eventually consistent state, not a strongly consistent database. Use a CRDT-style or per-node replicated model that tolerates churn, batch and debounce join and leave events, and accept that presence is approximate. Exact presence at scale costs far more than it is worth.

Presence (who is online, who is in this room, who is typing) feels simple and is one of the most expensive features to scale, because it is high-churn write traffic that everyone wants to read in real time. Every connect, disconnect, and reconnect is a presence event, and flaky networks generate a constant storm of them.

Three rules keep presence affordable:

Make it eventually consistent. Phoenix Presence uses a CRDT so each node tracks its own presence and replicates a conflict-free merge, with no central coordinator to bottleneck. You give up “instantly globally exact” and get a system that does not fall over.
Batch and debounce. Do not emit a global event per flap. Coalesce join/leave bursts over a short window so a client bouncing on bad wifi does not generate a fan-out storm of “joined/left/joined.”
Bound what you broadcast. “247 people online” rarely needs the exact 247 identities pushed to everyone. Send counts and deltas, materialize full lists on demand.

On TYPEMUSE, lobby presence and “who is in this match” ride this soft-state model. The product requirement is that presence feels live, not that it is transactionally exact, and that distinction is what makes it cheap. Spending strong-consistency money on presence is a classic over-engineering tax.

A WebSocket capacity model checklist

Before you trust a capacity plan, confirm each of these. Treat it as a linkable pre-launch gate.

You size against amplified writes per second (publishes × fan-out), not inbound message rate or connection count.
You have measured both the idle connection ceiling and the active ceiling under realistic message rates, and you plan against the active one.
File descriptors (ulimit -n, fs.file-max), ephemeral ports, and socket buffer sysctls are tuned and documented, not left at defaults.
Per-connection application state is lean and bounded, especially send buffers, so one slow consumer cannot pin a node.
Room sizes are capped, and channels above the cap switch to a broadcast/feed model with sharding.
Backpressure is explicit: slow consumers are dropped or throttled with bounded buffers, never allowed to grow unbounded.
The backplane routes only to nodes with subscribers, so fan-out is not multiplied by node count.
Presence is soft state, batched and debounced, with counts and deltas rather than full lists by default.
A node can drain or die and clients reconnect elsewhere within seconds; no hard session stickiness blocks deploys or rebalancing.
You have load-tested the worst realistic fan-out (the one big channel, the broadcast event), not just the average.

What I’d do differently

The mistake I have watched teams make, and made myself early on, is treating “how many connections can we hold” as the capacity question. It is the easy question with the impressive answer, and it is the wrong one. You tune the box to hold a million idle sockets, ship it, and discover the active ceiling is a tiny fraction of that the first time a channel gets popular.

If I were sizing a WebSocket layer from scratch today, I would build the fan-out model first and the connection tuning second. I would write the amplified-write formula on the wall, load-test against the worst realistic room size before the average one, and design backpressure and room-size caps in from day one instead of bolting them on after the first incident. I would also make presence soft-state on principle, because the instinct to make it exact is where a lot of real-time budgets quietly go to die.

The connection count is the headline. The fan-out is the system. Plan for the second and the first stops being scary. For the broader platform these connection tiers run on, see Kubernetes namespace strategy for SaaS platforms.

Sources

Dan Kegel, The C10K Problem (the original connection-scaling writeup): kegel.com/c10k.html
Phoenix Framework, Channels and Presence (CRDT-based presence, PubSub backplane): hexdocs.pm/phoenix/channels.html
Discord Engineering, How Discord Scaled Elixir to 5,000,000 Concurrent Users: discord.com/blog/how-discord-scaled-elixir-to-5-000-000-concurrent-users
Linux man pages, epoll(7) and socket(7) (file descriptors, buffers, event scaling): man7.org/linux/man-pages/man7/epoll.7.html

#websockets #capacity planning #real-time #scaling #connections

Frequently asked questions

How many WebSocket connections can one server hold?

A single tuned server can hold hundreds of thousands to low millions of idle WebSocket connections, bounded by RAM, file descriptors, and ephemeral ports rather than CPU. The real ceiling drops sharply once connections are active, because per-message work and fan-out, not the open sockets, consume your capacity.

How much memory does a WebSocket connection use?

Budget a few kilobytes of kernel socket buffers plus your per-connection application state, which is usually the larger and more variable cost. The kernel part is fairly fixed; the application part, presence data, buffers, and subscriptions, is what actually decides how many connections a server holds.

How do you scale WebSockets horizontally?

What breaks first under WebSocket connection load?

Usually file descriptors, ephemeral ports, or memory, not CPU, while connections are idle. Once traffic is active, fan-out amplification and backpressure break first: one publish becomes thousands of writes, and slow consumers fill send buffers until the node degrades. Plan for the active ceiling, not the idle one.

How do you plan capacity for WebSocket fan-out?

Model writes, not connections. A message into a room of N members costs N sends, so capacity is publishes per second times average room size. Size against that amplified write rate, cap room sizes, and shard or batch large rooms so a single hot channel cannot saturate a node.

How do you handle presence at scale?

Why WebSocket capacity planning is different from request capacity

How many WebSocket connections can one server hold?

How much memory does a WebSocket connection use?

What breaks first under WebSocket connection load?

How do you plan capacity for WebSocket fan-out?

How do you scale WebSockets horizontally?

How do you handle presence at scale?

A WebSocket capacity model checklist

What I’d do differently

Sources

Frequently asked questions

Liked this breakdown?

Keep reading

Backpressure Design for Real-Time Systems

Designing Leaderboards at Scale

Notification Fanout Architecture