Much of Yandex Infrastructure revolves around the issues of scale and performance.
Our primary storage and compute system is called YT; currently it stores over 1EB of data and utilizes about 1M of compute cores. It runs a mixture of production and ad-hoc workloads, all within a shared fleet of servers located in a number of geo-distributed datacenters.
At this scale, storing and handling data efficiently becomes the matter of survival. Also with such a multitude of servers, individual hardware faults are no longer considered anomalies, rather they are regarded as ordinary events happening each day at certain predictable rates.
Disk drive faults are usually handled via data replication, i.e. making physical copies of the same data piece and placing these copies on a number of machines (typically located in disjoint failure domains such as racks or even whole datacenters). The most common replication factor used nowadays is 3. RF=3 enables losing any two drives simultaneously while preserving data integrity and availability. This, however, incurs x3 overhead in disk storage space.
Since 2013 we adopted a well-known technique called Erasure Coding, which helps to reduce this overhead factor to lower numbers (such as 4/3) while maintaining the same level of fault tolerance. CPU and network bandwidth consumption, however, are typically higher for Erasure Coding (compared to the usual data replication). Hence designing and scaling such a storage system requires more insights and certain care.
The standard Erasure Coding approach is directly applicable to immutable data (typically called "blob chunks" in YT) The evolution of highly scalable KV-storages (sometimes reaching many PBs in size and consuming tens of GBs of writes per second) prompted us to apply these ideas to so-called "journal chunks", i.e. append-only structures used to implement WAL journalling of RAFT-like fault-tolerant data shards.
We have devised a scheme combining the notions of (read and write) quorums and Erasure Coding to reduce the disk and network bandwidth consumption thus saving on expensive NVME disks and cross-DC traffic while maintaining high throughput and low write latencies, even at higher quantiles.
In this talk Maxim will give a brief overview of Erasure Coding schemes that we employ and discuss various real-world scenarios and lessons learned while operating these systems at scale.