Sharding Strategies Before You Need Them
Pick database sharding strategies before traffic forces your hand. Shard keys, hash vs range vs directory, online resharding, and the traps that bite early.
Part of Distributed Systems Patterns That Hold Up in Production
The cheapest time to choose your database sharding strategies is before you need them, while a single primary still holds your data and you have the slack to plan instead of firefight. Sharding is the one scaling decision that gets exponentially more expensive the longer you wait, because by the time you are forced into it, you are doing it on a database that is already on fire.
This is not an argument for sharding early. It is an argument for deciding early. Pick the shard key, the partitioning scheme, and the resharding path now, write them down, and then do not shard until the data actually outgrows one node. Most teams get this backwards: they shard prematurely to feel scalable, or they refuse to think about it until an outage forces a rushed, irreversible choice.
When should you shard a database?
Shard when a single primary can no longer hold your working set in memory or absorb your write throughput, and you have already exhausted vertical scaling and read replicas. Sharding is the answer to a write or data-volume ceiling, not a read ceiling. If reads are the problem, add replicas or a cache first, because those are reversible and sharding is not.
The practical signal is a primary that you cannot make bigger and cannot make faster. Replicas offload reads but every write still lands on one node, so once write IOPS or write amplification saturate that node, replication does nothing for you. The working set is the other trigger: when the rows you touch in a normal hour no longer fit in RAM and you start paying disk on the hot path, a bigger box buys you months, and sharding buys you years.
Do this math before you are out of runway. If you are at 60% of a single node’s ceiling and growing, you have time to design and rehearse. If you are at 95%, you are going to be migrating a saturated database under load, which is the worst possible condition for the riskiest data operation you will ever run.
What is the best shard key?
The best shard key spreads writes evenly across shards, keeps the rows a single query needs on the same shard, and almost never changes for a given row. For most systems that is a tenant ID, a user ID, or an account ID. Pick the entity your queries are naturally scoped to, because that is the dimension along which the data wants to split.
Three properties matter, and they pull against each other:
- Cardinality and distribution. The key must have enough distinct values to fill all your shards and beyond, and those values must be roughly uniform. A boolean is useless. A
countrycolumn buries you under the one country that is 70% of your users. - Query locality. Whatever you filter by most should be the shard key, so the common query hits one shard instead of fanning out to all of them. If your hot query is “everything for tenant X,” tenant ID is your key.
- Stability. The key determines physical location, so a row whose key changes has to physically move. Mutable shard keys turn ordinary updates into cross-shard migrations.
The classic mistake is a monotonic key: a timestamp, an auto-increment ID, a snowflake ID with time in the high bits. It looks uniform but it is not, because every new write has the largest key, so every new write lands on the same shard. You build N shards and write to one of them. On TYPEMUSE, the natural shard key for match and session data is the player or room identity, not the match timestamp, precisely so that live writes spread instead of stacking on the newest range.
What is the difference between hash, range, and directory sharding?
Hash sharding hashes the shard key and assigns the row to hash(key) % N, which spreads writes evenly but destroys the ability to range-scan. Range sharding assigns contiguous key ranges to shards, which keeps ordered data together for scans but creates hotspots on the active range. Directory sharding keeps a lookup table mapping keys to shards, which gives full control and easy rebalancing at the cost of an extra lookup and a service to maintain.
These are the three real options. Most managed systems give you one or two of them, and the choice is determined more by your query pattern than by preference.
| Strategy | How it routes | Write distribution | Range scans | Resharding | Best for |
|---|---|---|---|---|---|
| Hash | hash(key) mod N (or a hash ring) | Even by design | Bad: keys scatter everywhere | Hard with plain modulo, easier with consistent hashing | High write throughput, point lookups by key |
| Range | Contiguous key intervals per shard | Uneven, newest range is hot | Excellent: ordered data is co-located | Split a hot range into two | Time-series, ordered scans, “give me a window” queries |
| Directory / lookup | Explicit key to shard map | Whatever you choose | Depends on how you group keys | Easiest: move a key, update the map | Multi-tenant, uneven tenant sizes, surgical rebalancing |
Plain mod N hashing has a vicious property: change N and almost every key remaps, so adding one shard reshuffles the entire dataset. Consistent hashing exists exactly to fix this. It places shards and keys on a ring so that adding a shard moves only the keys near the new position, roughly 1/N of the data, instead of all of it. If you are going to hash, use consistent hashing or a virtual-bucket scheme, not raw modulo.
Directory sharding is underrated for multi-tenant systems. Real tenants are wildly uneven: a handful of whales generate most of the load. A lookup table lets you put the three biggest tenants each on their own shard and pack the long tail together, then move any one of them by updating a row in the map. The cost is real: the directory becomes a critical dependency you must keep highly available and cached, because every query goes through it first.
For more on why physical layout follows logical ownership, see bounded contexts in real microservice systems and how data boundaries track service boundaries.
How do you reshard a database without downtime?
Online resharding is a four-phase migration: dual-write to both the old and new layout, backfill historical data into the new layout, verify the two copies agree, then cut reads over shard by shard behind a feature flag. Keep the old path live until you fully trust the new one. There is no config flag that does this safely. It is a project measured in weeks.
Walk the phases:
- Dual-write. Every write goes to both the old layout and the new sharded layout. Now new data is correct in both places, and you have stopped the bleeding even though history is not migrated yet.
- Backfill. Copy existing rows into the new layout in batches, throttled so you do not starve live traffic. This is the long pole, and it is where most of the elapsed time goes.
- Verify. Compare old and new, by row count, by checksum, by sampling. Do not trust a backfill you have not reconciled. This is also where dual-write earns its keep: it makes reprocessing safe, the same way idempotency keys make distributed operations safe to retry.
- Cutover. Flip reads to the new layout one shard, one tenant, or one percentage at a time, watching error rates and latency. Keep dual-write running so you can roll back instantly by flipping reads back. Only after the new path has owned production for a while do you stop writing to the old one.
The same discipline that protects a Kafka reprocess protects a reshard: assume you will run it more than once, so make every step replay-safe. If you have read the Kafka replay strategy without duplicate events playbook, the backfill-then-verify shape will feel familiar, because it is the same idea applied to a different log.
Vitess and Citus exist largely to automate this dance. Vitess does online resharding with its VReplication engine and tablet-by-tablet cutover; Citus rebalances by moving shards between nodes while keeping the table available. If you are on a system that offers managed resharding, use it, because hand-rolling dual-write and verification correctly is genuinely hard.
What breaks when you shard too early?
Sharding removes cross-shard joins, multi-row transactions, and database-enforced unique constraints, and it adds scatter-gather latency to any query that no longer maps to a single shard. If your data still fits comfortably on one node, you have taken on all of that cost for scale you are not using. That is the trade, and early sharding gets the worst side of it.
What you actually lose:
- Joins across shards. A join that touched two tables is now an application-side fan-out and merge. Queries that were one index seek become a request to every shard plus a reduce step.
- Transactions across shards. ACID stops at the shard boundary. Spanning shards means distributed transactions (slow and complex) or redesigning so a transaction never crosses a shard, which constrains your whole data model.
- Global uniqueness. A
UNIQUEindex only holds within a shard. Enforcing “this email exists once across the whole system” now needs a separate global index or a dedicated service. - Operational surface. You went from one database to back up, monitor, patch, and fail over to N of them. Every operational task multiplies.
The honest test: if a single primary, scaled vertically with read replicas, would carry you for the next 18 to 24 months, do not shard yet. Decide your strategy, write it down, and keep running one node. You can shard the day you need to, much more calmly, because you already did the thinking.
How many shards should you start with?
Start with more logical shards than physical nodes, and map many logical shards onto each physical node. Picking a number like 1024 logical shards up front lets you grow by moving shards between nodes instead of resplitting data, which turns future scaling into a rebalance rather than a remigration.
This is the virtual-bucket pattern, and it is how the systems that do sharding well avoid the mod N trap. You hash keys into a large fixed number of buckets, then assign buckets to nodes through a map. Adding a node means reassigning some buckets, which moves a slice of data and updates the map, with no rehashing of keys. The logical shard count is effectively permanent, so make it generous; physical nodes come and go underneath it.
A pre-sharding checklist
Work through this while you still have one node and time to think:
- Confirm the ceiling is writes or data volume, not reads. If reads are the bottleneck, exhaust replicas and caching first.
- Pick the shard key by query locality, even distribution, and stability. Reject any monotonic candidate (timestamp, auto-increment, time-prefixed ID).
- Identify your hot query and confirm the shard key keeps it on a single shard.
- List every cross-shard query you will be forced into, and decide how each one survives (fan-out, denormalization, secondary index).
- Choose hash, range, or directory based on whether you need even writes, ordered scans, or surgical rebalancing.
- If hashing, choose consistent hashing or virtual buckets, never raw
mod N. - Pick a generous logical shard count so growth is rebalancing, not remigration.
- Plan global uniqueness for any constraint that must hold across shards.
- Write the resharding runbook: dual-write, backfill, verify, cutover, rollback.
- Confirm your platform supports online resharding (Vitess, Citus, managed equivalents) or budget to build it.
- Estimate runway: how many months does vertical scaling buy, and does that exceed your migration lead time?
If you cannot answer the shard-key and cross-shard-query items, you are not ready to shard, and that is fine. Knowing it now is the entire point.
What I’d do differently
The mistake I have made, and watched others make, is conflating “decide on sharding” with “do sharding.” Early in a system’s life those feel like the same act, so people either shard prematurely to look scalable or refuse to touch the topic until it is an emergency. Both are wrong for the same reason: the decision is cheap early and the execution is expensive late, and the two should not be bundled.
What I do now is write a one-page sharding plan when a data store crosses into “this matters” territory, long before it needs splitting. Shard key, scheme, the cross-shard queries I am giving up, the resharding path. Then I shelve it and keep running a single node until the data forces my hand. When that day comes, I am executing a rehearsed plan, not inventing one at 2 a.m. against a saturated primary.
The second thing: I treat the shard key as a one-way door and review it like one. On TYPEMUSE, sharding live game state on player and room identity rather than match time was a deliberate call to keep concurrent-match writes spread instead of piling onto the newest range. Get the key wrong and no amount of clever rebalancing saves you, because every fix is a full migration. The week spent arguing about the shard key is the cheapest week in the whole project. For where these data decisions sit in the larger picture, the Distributed systems patterns hub collects the related plays.
Sources
- Vitess Documentation: Resharding, how online resharding works in production with VReplication and tablet cutover.
- Citus Documentation: Choosing the Distribution Column, practical shard-key (distribution column) selection for Postgres at scale.
- MongoDB Manual: Choosing a Shard Key, shard-key cardinality, frequency, and monotonicity guidance.
- Amazon DynamoDB: Partition Key Design and Hot Partitions, why uniform key distribution matters and how hot partitions form.
Frequently asked questions
When should you shard a database?
Shard when a single primary can no longer hold your working set or absorb your write throughput, and vertical scaling plus read replicas have run out. Shard before you hit the wall, not during the outage, because the migration itself needs headroom you will not have at peak.
What is the best shard key?
The best shard key spreads writes evenly, keeps related rows for a single query on one shard, and rarely changes. Tenant ID or user ID usually wins for multi-tenant systems. Avoid monotonic keys like timestamps or auto-increment IDs, which funnel all new writes to one shard.
What is the difference between hash, range, and directory sharding?
Hash sharding spreads rows evenly but kills range scans. Range sharding keeps ordered data together but risks hotspots on the newest range. Directory sharding uses a lookup table for full control and easy rebalancing, at the cost of an extra hop and a lookup service you must keep available.
How do you reshard a database without downtime?
Dual-write to old and new layouts, backfill historical data, verify both copies match, then cut reads over shard by shard behind a flag. Keep the old path until you trust the new one. Online resharding is a migration project, not a config change, so budget weeks.
What breaks when you shard too early?
You lose cross-shard joins, transactions, and unique constraints, and you pay scatter-gather latency on queries that used to be one index lookup. Operationally you now run N databases instead of one. If a single node still fits your data, sharding adds cost and complexity for no benefit.
Can you change a shard key later?
Only by a full data migration, because the shard key determines where every row lives. Changing it means rehashing or rerouting the entire dataset into a new layout. This is why choosing the shard key is the highest-impact and hardest-to-reverse decision in any sharding plan.