The key to high availability is the ability to isolate faults, and to recover from them rapidly. Systems that provide multiple services, each with high availability requirements, must also ensure that the failure or dynamic replacement of one service doesn’t compromise the availability of the remaining services. This paper examines the steps the designer must take to achieve these goals: partitioning interacting components, analyzing fault domains, and implementing fault-tolerance through design principles such as active-spare redundancy, load balancing, and clustering.