This book is definitely a classic. The book basically touches upon the surface of a wide array of topics related to handling data in a distributed environment; ranging from basic database theory, ACID, replication and partitioning to more complex (and “modern”) topics like stream and batch processing on the cloud.
“Data outlives code.”
- Martin Kleppman
Martin Kleppman lays down all the needed information on what every engineer needs to know about designing systems that deal with any kind of data.
The first part is about basic database concepts:
- Relational vs. NoSQL
- Different query languages
- How data is actually persisted on storage devices. (B-Trees / LSM Trees)
- Encoding and serialization/deserialization
The second part goes deeper and discusses the following concepts (and their issues):
- CAP Theorem
- Replication
- Partitioning
- Transactions
- Consistency
The final chapter discusses derived data and aims to tackle the issues discussed in the previous part, as well as introducing more “modern” concepts like:
- Batch and Stream processing (MapReduce/Spark)
- Eventual consistency and “Change Data Capture”
Although the book doesn’t dive into any deep technical or implementation details, it has a very good bibliography and footnotes that leads you to all of the academic papers you need. Overall, the book is an essential read for any software/data engineer in 2019.