Solving Raft's practical problems in Tarantool. What, how and why

In RussianComplexity -

Tarantool is an in-memory computing platform, and technically a DBMS and application server. The database supports two storage engines: in-memory (with persistence through logging and snapshotting) and disk (based on LSM trees).

Originally, Tarantool only had asynchronous replication, and if a task required synchronous replication, the user was forced to use self-written solutions. There was a similar situation with failover: we had to rely on external systems.

The speakers decided to add synchronous replication to Tarantool and a built-in automatic failover mechanism. Having investigated possible alternatives to solve these problems, they chose Raft. And here it turned out that there is quite a big difference between the Raft implementation and the production-ready Raft implementation. The difference is that the system used on the production-ready requires more from the system than can be provided by the canonical Raft.

First, the cluster must remain writable and readable in case of a partial loss of network connectivity. Otherwise it would be like Cloudflare in 2020, when one of the replicas did not see the leader and for 6.5 hours was constantly making new elections, not allowing the leader to work at all.

Solving the cluster accessibility problem with partial loss of connectivity creates another one: under certain conditions, the cluster will not be able to select a new leader, even if there are enough live and connected nodes, while the previous leader no longer has enough live connections and is no longer able to produce a record. To fix this, it is necessary for the leader to "resign" if the quorum of live connections is lost. In addition, the voluntary resignation allows to ensure the uniqueness of the leader in the cluster: by the time a new leader is chosen, the old leader will have already resigned.

In addition, it should be possible to manually intervene in the cluster and appoint the leader from the outside. Otherwise, in the popular scheme, when the cluster is located in two data centers, with half of the nodes in each data center, the loss of the data center is guaranteed to mean the inability to select a leader and write anything.

After all, you want the cluster to be writable again (by selecting a new leader) as quickly as possible (in 10-15 seconds) after the death of the old leader.

We will talk about the Raft add-ons that the speakers applied in Tarantool to meet all of the above requirements.