When we design a distributed system we typically care about certain properties: availability, consistency, scalability, latency to name a few. But distributed systems can be complex, racey, stateful systems that can fail in complex and surprising ways. Reasoning about them can be difficult, even when you have a good grasp of the design. So how do we know that a system conforms to the properties we care about?
Jepsen was born to test these properties on implementations. These implementations typically take multiple man-years to write. Do we want to wait until then to know if our design is correct? No.
This is where modeling and formal verification come in. It's not just about verifying the correctness of the design before we spend years on implementation, it also gives us a tool for reasoning about it and communicating it to others.
TLA+ and its model checker are one such tool. It has been used to model and verify the designs of many well-known open source data systems: Apache Kafka, Apache BookKeeper, Elasticsearch, MongoDB, RabbitMQ, and many more, as well as being used heavily at AWS and Azure.
People always ask about how you go from an abstract TLA+ specification to code implementation and many avoid TLA+ altogether because it is abstract and can be difficult to learn.
Jepsen.io has a tool called Maelstrom, it allows you to build programs in any language you want, where the network layer is stdin and stdout. It performs the routing of messages between nodes and uses Jepsen to test some of the properties you care about.
Could you implement the most minimalist version of your distributed protocol in real code and use Maelstrom/Jepsen to model check it?
In this talk, we compare modelling the complexities of the Apache BookKeeper replication protocol in both TLA+ and Java with Maelstrom/Jepsen.
- How did the process of modelling the protocol in both technologies differ?
- Can we get the same benefits of TLA+ but with real code?
- By modelling with real code, does it make the jump to writing the real system easier?
- Does TLA+ offer anything that our real code solution cannot?
- Do both offer easy to use analysis tooling such as visualizations?