Book

Making Reliable Distributed Systems in the Presence of Software Errors

📖 Overview

Making Reliable Distributed Systems in the Presence of Software Errors presents foundational principles and techniques for building fault-tolerant distributed systems. The book draws from Armstrong's experience developing Erlang and implementing large-scale telecommunications systems at Ericsson. The text covers core distributed systems concepts including supervision trees, process isolation, and message passing between concurrent processes. Technical chapters examine error handling, system recovery mechanisms, and methods for upgrading live systems without interruption. Armstrong provides concrete examples and case studies from real industrial applications throughout the book. The implementation patterns and architectural approaches are illustrated through practical code samples in Erlang. This work stands as both a technical manual and a philosophical treatise on the nature of reliability in complex distributed systems. The principles outlined transcend specific programming languages to offer broader insights about designing robust software systems that can gracefully handle inevitable failures.

👀 Reviews

Readers highlight the book's practical approach to building fault-tolerant systems, with detailed examples from Erlang's development at Ericsson. The technical depth and real-world focus sets it apart from theoretical distributed systems texts. Liked: - Clear explanation of supervision trees and error handling patterns - Concrete implementation examples from telecom systems - Balance of theory and practice - Coverage of OTP design principles Disliked: - Available only as a PhD thesis, making it hard to obtain - Some sections feel dated (written in 2003) - Dense academic writing style in certain chapters - Limited coverage of modern distributed systems challenges Ratings: Goodreads: 4.4/5 (89 ratings) Amazon: Not commercially available One reader noted: "The best practical guide to building reliable distributed systems I've found, though obtaining a copy is difficult." Another mentioned: "The academic format makes some valuable content less accessible than it could be."

📚 Similar books

Designing Data-Intensive Applications by Martin Kleppmann The book presents distributed systems concepts through real-world database implementations and system architectures.

Release It! by Michael Nygard The text examines patterns and anti-patterns for building resilient distributed systems that survive production environments.

Distributed Systems by Andrew S. Tanenbaum This work provides foundational principles of distributed systems with focus on reliability, consistency, and fault tolerance.

Distributed Algorithms by Nancy Lynch The book delivers mathematical foundations and formal proofs for distributed computing algorithms and protocols.

Building Microservices by Sam Newman The text explains distributed system architecture through the lens of microservices, with emphasis on system reliability and deployment.

🤔 Interesting facts

🔹 The book began as Joe Armstrong's PhD thesis at the Royal Institute of Technology in Stockholm and later evolved into a comprehensive guide for building fault-tolerant systems using Erlang. 🔸 Joe Armstrong was one of the creators of Erlang, developing it at Ericsson in the late 1980s to handle telecommunications applications that required extreme reliability - such as telephone switches that needed to operate with "nine nines" (99.9999999%) availability. 🔹 The systems described in the book have been proven in real-world applications: Ericsson's AXD301 switch, built using these principles, achieved reliability levels of 99.9999999% availability, with some installations running for years without failure. 🔸 The book introduces the "Let it crash" philosophy - instead of trying to prevent errors, systems should be designed to recover from them gracefully. This counter-intuitive approach has become a fundamental principle in modern distributed systems design. 🔹 Though published in 2003, the book's principles have become increasingly relevant with the rise of cloud computing and microservices architecture, influencing modern platforms like WhatsApp, which used Erlang to handle millions of concurrent connections.