Database Replication: Why Master-Slave Is No Longer Enough

Classic master-slave replication has worked brilliantly for the past 20 years. One node (master) handles writes, the rest (slaves) handle reads only. Simple, understandable, predictable.

Note: In modern terminology, master-slave is often replaced with primary-replica or leader-follower due to negative associations with the old terms. In this article, we use both sets of terms as synonyms, since the technical essence doesn't change based on naming.

But in 2025, you're launching a global application with users in Singapore, Frankfurt, and San Francisco. Latency to a single master in Virginia kills user experience. Trying to move the master to Europe kills Asian users instead. And then you realize: the model that worked for blogs and e-commerce sites doesn't scale for modern requirements.

Welcome to a world where master-slave is the baseline, not the solution.

What's Wrong with Classic Replication

Master-slave replication works elegantly simple. One node — the master — accepts all write operations. Changes are written to the transaction log and sent to slave nodes. Slaves apply changes in the same order and serve read operations. If the master fails, one of the slaves gets promoted to master.

For applications where 80-90% of requests are reads (blogs, news sites, product catalogs), this works excellently. You scale reads horizontally by adding slaves. Writes remain centralized on one node, guaranteeing consistency.

Problem One: Replication Lag

Replication lag is the time between a write on the master and its application on the slave. In an ideal world, this is milliseconds. In the real world, it can be seconds or even minutes under heavy load.

Typical scenario: a user registers, the system writes the account to the master. The user gets redirected to their profile page, the request hits a slave. The slave hasn't received the update yet. The user sees "account not found" error one second after successful registration.

This isn't a theoretical problem. Studies show that replication lag remains one of the main complaints about classic architecture. During peak load, slaves can lag by minutes, creating user-visible inconsistency.

Problem Two: Single Point of Failure for Writes

Slaves provide fault tolerance for reads. But writes? Everything goes through one node. If the master fails, the system stops accepting writes until one of the slaves gets promoted to master. This failover process can take seconds to minutes depending on configuration.

Automatic slave promotion is trickier than it seems. You need to choose the slave with the most current data, reconfigure applications to the new master, ensure the old master (if it comes back) doesn't continue accepting requests. Each step is a potential failure point. Many teams prefer manual promotion precisely because of this complexity.

Problem Three: Write Scalability Limitations

Adding slaves scales reads. But writes? Limited by one node's performance. If your application writes 10,000 transactions per second and approaches the master's capacity limit, what do you do? Vertical scaling (more CPUs, memory, faster disks) works to a certain limit. Beyond that, you hit hardware's physical constraints.

Many teams discover this problem when it's already too late. You peacefully grow for several years, adding slaves as read traffic grows. Then suddenly explosive write growth — new feature, viral content, marketing campaign — and the master becomes the bottleneck. There's no quick solution.

Problem Four: Global Latency

A user in Singapore connects to the application. The database slave is in the same region — reading is fast, 5-10 milliseconds. But for writes, the request must travel halfway around the world to the master in Virginia. 200+ milliseconds there, 200+ back. Every save, every update — half a second of delay.

Modern users expect instant response. Half-second delay on every action turns the application into torture. Especially for interactive features — collaborative editing, chats, real-time updates.

Why This Became Critical Now

Ten years ago, most applications were regional. An American startup served American users, a European company served Europeans. Master-slave worked great.

Now even small startups expect global audience from day one. SaaS applications serve users on all continents. Mobile apps work wherever there's internet. User expectations have grown — they want the same speed as Google Docs or Figma, where changes appear instantly regardless of location.

Simultaneously, the nature of applications has changed. Previously, typical web applications were mostly read — users browsed content, occasionally leaving a comment or making a purchase. Now applications are interactive. Collaborative document editing, instant messaging, real-time updates, streaming content. The read-write ratio has shifted to more balanced, sometimes even write-heavy.

Modern Replication Approaches

Multi-Master Replication

Instead of one master, create several. Each can accept writes. Changes synchronize between all masters. Sounds like the perfect solution for global applications — a master in each region, low latency everywhere.

Reality is more complex. The main problem — conflicts. Two users simultaneously update the same record on different masters. One in Europe sets status to "active", another in Asia sets it to "inactive". When nodes synchronize, which value wins?

Conflict resolution strategies exist. "Last write wins" relies on timestamps, but clocks in distributed systems aren't perfectly synchronized. "First write wins" requires a central arbiter, killing the benefits of distribution. Application-level resolution logic shifts complexity to developers.

Research shows that multi-master works well when writes can be partitioned by region or category. For example, European users write only to the European master, Asian users to the Asian master. But this requires careful application-level design and doesn't work for globally distributed data.

Conflict-Free Replicated Data Types (CRDTs)

CRDTs are mathematically designed data structures that guarantee that regardless of operation order, all replicas will converge to the same state. Sounds like magic, and in many ways it is.

Simple example: a counter. Instead of storing a single value, each replica stores a vector of values — one for each node. When a node increments the counter, it increments its position in the vector. During synchronization, nodes merge vectors by taking the maximum of each position. The final value is the sum of all positions. No matter in what order increments happened, the result is always the same.

CRDTs are used in Redis Enterprise for geo-distributed databases, in Riak for highly available storage, in collaborative editing systems like Figma and Google Docs. They solve the conflict problem elegantly — by simply not allowing conflicts to occur.

But there's a cost. CRDTs require more memory — you need to store additional metadata. Performance depends on data type and operations. Not all operations can be implemented as CRDTs — some require coordination. And critically — garbage collection is required, otherwise metadata grows infinitely.

For most applications, a hybrid approach works better: traditional database for critical consistency, CRDTs for data where temporary inconsistency is acceptable and low latency is needed.

Distributed Consensus

Protocols like Raft and Paxos allow a group of nodes to agree on operation order even during failures. This is the foundation of modern distributed databases like CockroachDB, TiDB, YugabyteDB.

The idea is simple: for each write operation, a majority of nodes must agree before the operation is considered complete. This guarantees consistency — even if some nodes fail, the remaining ones contain current data.

The cost — latency. Each write requires message exchange between nodes to achieve consensus. Within one datacenter, this adds milliseconds. With geo-distribution between regions — tens or hundreds of milliseconds. For some applications this is acceptable, for others critical.

Real Cases of Failures

GitHub and Replication Lag

In 2018, GitHub experienced a major outage due to replication issues. The master and replica disconnected for 43 seconds. During this time, writes continued to the master. When connection restored, it turned out the replica lagged so far that normal synchronization didn't work. Recovery from backup and manual data synchronization was required.

The problem was compounded by automatic monitoring systems not detecting the divergence quickly enough. Users saw different data depending on which replica handled the request. Result — several hours of service degradation and data loss.

Lesson: replication lag isn't just an inconvenience. Under certain conditions, it can lead to catastrophic data divergence. You need not only monitoring tools but processes for quick detection and response.

Figma and CRDT Transition

Figma started with traditional client-server architecture. For collaborative editing, all changes were sent to the server, which resolved conflicts and distributed updates to all clients. This worked but scaled poorly. With many simultaneous editors, the server became the bottleneck.

Transitioning to CRDTs allowed moving conflict resolution to the client side. Each change applies locally instantly, then sends to other clients. Conflicts resolve automatically thanks to CRDT properties. Result — instant response and scalability to hundreds of simultaneous editors.

But implementation took months of complex engineering work. They had to rewrite significant portions of application logic, develop efficient garbage collection algorithms for CRDT metadata, carefully test edge cases. For the Figma team this paid off, but not every startup can afford such investment.

Amazon Shopping Cart and Conflicts

A classic example from academic literature — Amazon's shopping cart in early years. With multi-master replication, a problem arose: a user removed an item from the cart on one node, but the same item was added back due to a conflict with another node.

Amazon chose the principle "add wins over remove". Better to show the user an item they supposedly removed (and they can remove again) than to lose an item they wanted to buy. This is a business decision embedded in technical architecture.

Lesson: conflict resolution isn't purely a technical task. You need to understand business context and make decisions based on what's acceptable for users and business.

When to Use Which Approach

Master-slave remains an excellent choice for many applications. If your read-write ratio is heavily skewed to reads, if users are mostly in one region, if write latency of hundreds of milliseconds is acceptable — don't overcomplicate. This model is proven by decades, tools are mature, most developers understand it.

Multi-master makes sense when you have clear geographic or functional data partitioning. European users write European data, Asian users write Asian data. Or different data categories live on different masters. The key — avoid situations where the same data can be modified on different nodes.

CRDTs shine in real-time applications where instant local response is needed and temporary inconsistency is acceptable. Collaborative editing, instant messaging, collaborative boards, games. But be ready to invest in understanding the math behind CRDTs and developing proper garbage collection.

Choose distributed consensus when you need strict consistency with geo-distribution. Financial transactions, inventory, booking systems — anywhere inconsistency is unacceptable. Pay with latency for consistency guarantees.

For most complex applications, the answer is hybrid. Different data types require different replication strategies. Critical data like payments goes through consensus. User profiles replicate master-slave. Collaborative editing uses CRDTs. Caches use lazy replication without strict consistency.

Hidden Complexities of Modern Replication

Garbage Collection in CRDTs

CRDTs accumulate metadata about every operation for correct merging. Without garbage collection, this metadata grows infinitely. A simple operation of adding an element to a set can leave a trace that's never deleted.

Garbage collection strategies exist. Time-based expiry — delete metadata older than a certain time. Checkpoints — periodically create a state snapshot and delete everything before it. Coordinated deletion — coordinate between nodes when it's safe to delete metadata.

Each strategy has tradeoffs. Time-based deletion can lose data if a node was offline longer than the expiry period. Checkpoints require coordination and can create load spikes. Coordinated deletion partially kills CRDT benefits — no need for coordination.

Causality and Delivery Order

Some CRDTs require causal order of operation delivery. If operation B depends on operation A, B must apply after A on all nodes. Ensuring causal order in a distributed system is non-trivial.

Vector clocks are the classic solution. Each node tracks how many operations it has seen from every other node. An operation includes vector clocks, and nodes can determine causal dependencies. But vector clocks grow linearly with the number of nodes, becoming a problem in large systems.

More efficient structures like interval tree clocks exist, but they're more complex to implement and understand.

Monitoring and Debugging

Debugging replication problems in traditional master-slave architecture is relatively straightforward. There's one source of truth (master), there are replication logs, there are clear lag metrics. The problem is visible and understandable.

In a distributed system with multi-master or CRDTs, everything is harder. There's no single source of truth. State can temporarily diverge between nodes. Determining why data looks a certain way requires understanding the entire operation history on all nodes.

Tools for monitoring and debugging distributed databases are less mature than for traditional ones. You need specialized metrics, synchronization state visualization, operation tracing across nodes. Many teams develop their own tools, adding complexity.

Expertise Cost

Master-slave replication is understood by most developers. Finding a PostgreSQL or MySQL replication specialist isn't a problem. Training materials abound, best practices are well documented, Stack Overflow is full of answers.

For CRDTs or distributed consensus, the expert pool is orders of magnitude smaller. Training materials are mostly academic papers. Few ready solutions, much needs to be developed independently. Hiring a specialist is more expensive and takes longer.

This isn't an argument against modern approaches, but a reality that needs consideration. If you're a startup with a five-person team, investing in mastering CRDTs may not pay off. If you're a company at Figma or Discord scale, they're absolutely justified.

Key Takeaways

Master-slave replication isn't dead. For many applications, it's still the right choice. Simple, reliable, well-understood architecture with mature tools and wide expertise pool.

But the world has changed. Global applications, real-time interactivity, expectations of instant response — all this requires rethinking replication approaches. One model no longer fits all.

Multi-master solves global latency problems but creates conflict problems. CRDTs solve conflicts elegantly but require investment in understanding and implementation. Distributed consensus provides strict consistency, but the price is latency.

For most complex applications, the answer is hybrid. Different data requires different strategies. It's critical to understand each approach's tradeoffs and choose the right tool for the specific task.

Replication has ceased being an implementation technical detail. It's an architectural decision affecting user experience, scalability capabilities, and development complexity. And like any architectural decision, it requires careful requirements analysis, understanding tradeoffs, and readiness to reconsider the choice as the application evolves.

Good news: tools and platforms are getting better. Managed databases take on replication complexity. CRDT libraries are becoming more accessible. Consensus protocols are standardizing. What required months of PhD specialist work five years ago is now accessible to a regular development team.

Bad news: fundamental distributed systems tradeoffs haven't gone anywhere. The CAP theorem still holds — you can't simultaneously have consistency, availability, and partition tolerance. Latency is limited by the speed of light. Conflicts are inevitable in a system where multiple nodes can modify data independently.

Best approach: start simple. Master-slave works great up to a certain scale. When you hit limitations — analyze specific problems and solve them purposefully. Don't try to solve global distribution, conflicts, and strict consistency simultaneously. Solve problems as they arise, investing in complexity only where it's justified by business value.