🖥️
Sunil Notebook
Interview Preparation
  • 📒Notebook
    • What is this about ?
  • System Design
    • 💡Key Concepts
      • 🌐Scalability
      • 🌐Latency Vs Throughput
      • 🌐Databases
      • 🌐CAP Theorem
      • 🌐ACID Transactions
      • 🌐Rate limiting
      • 🌐API Design
      • 🌐Strong Vs eventual consistency
      • 🌐Distributed tracing
      • 🌐Synchronous Vs asynchronous Communication
      • 🌐Batch Processing Vs Stream Processing
      • 🌐Fault Tolerance
    • 💎Building Blocks
      • 🔹Message
      • 🔹Cache
      • 🔹Load Balancer Vs API Gateway
    • 🖥️Introduction to system design
    • ⏱️Step By Step Guide
    • ♨️Emerging Technologies in System Design
    • ☑️System design component checklist
      • 🔷Azure
      • 🔶AWS
      • ♦️Google Cloud
    • 🧊LinkedIn feed Design
    • 🏏Scalable Emoji Broadcasting System - Hotstar
    • 💲UPI Payment System Design
    • 📈Stock Broker System Design - Groww
    • 🧑‍🤝‍🧑Designing Instagram's Collaborative Content Creation - Close Friends Only
    • 🌳Vending Machines - Over the air Systems
    • Reference Links
  • DSA
    • Topics
      • Introduction
      • Algorithm analysis
        • Asymptotic Notation
        • Memory
      • Sorting
        • Selection Sort
        • Insertion Sort
        • Merge Sort
        • Quick Sort
        • Quick'3 Sort
        • Shell Sort
        • Shuffle sort
        • Heap Sort
        • Arrays.sort()
        • Key Points
        • Problems
          • Reorder Log files
      • Stacks and Queues
        • Stack Implementations
        • Queue Implementations
        • Priority Queues
        • Problems
          • Dijkstra's two-stack algorithm
      • Binary Search Tree
        • Left Leaning Red Black Tree
          • Java Implementations
        • 2-3 Tree
          • Search Operation - 2-3 Tree
          • Insert Operation - 2-3 Tree
        • Geometric Applications of BST
      • B-Tree
      • Graphs
        • Undirected Graphs
        • Directed Graphs
        • Topological Sort
      • Union Find
        • Dynamic Connectivity
        • Quick Find - Eager Approach
        • Quick Find - Lazy Approach
        • Defects
        • Weighted Quick Union
        • Quick Union + path comparison
        • Amortized Analysis
      • Convex Hull
      • Binary Heaps and Priority Queue
      • Hash Table vs Binary Search Trees
  • Concurrency and Multithreading
    • Introduction
    • Visibility Problem
    • Interview Questions
    • References
      • System design
  • Design Patterns
    • ℹ️Introduction
    • 💠Classification of patterns
    • 1️⃣Structural Design Patterns
      • Adapter Design Pattern
      • Bridge Design Pattern
      • Composite Design Pattern
      • Decorator Design Pattern
      • Facade Design Pattern
      • Flyweight Design Pattern
      • Private Class Data Design Pattern
      • Proxy Design Pattern
    • 2️⃣Behavioral Design Patterns
      • Chain Of Responsibility
      • Command Design Pattern
      • Interpreter Design Pattern
      • Iterator Design Pattern
      • Mediator Design Pattern
      • Memento Design Pattern
      • Null Object Design Pattern
      • Observer Design Pattern
      • State Design Pattern
      • Strategy Design Pattern
      • Template Design Pattern
    • 3️⃣Creational Design Patterns
      • Abstract Factory Design Pattern
      • Builder Design Pattern
      • Factory Method Design Pattern
      • Object Pool Design Pattern
      • Prototype Design Pattern
      • Singleton Design Pattern
    • Java Pass by Value or Pass by Reference
  • Designing Data-Intensive Applications - O'Reilly
    • Read Me
    • 1️⃣Reliable, Scalable, and Maintainable Applications
      • Reliability
      • Scalability
      • Maintainability
      • References
    • 2️⃣Data Models and Query Languages
      • Read me
      • References
    • Miscellaneous
  • Preparation Manual
    • Disclaimer
    • What is it all about?
    • About a bunch of links
    • Before you start preparing
    • Algorithms and Coding
    • Concurrency and Multithreading
    • Programming Language and Fundementals
    • Best Practices and Experience
  • Web Applications
    • Typescript Guidelines
  • Research Papers
    • Research Papers
      • Real-Time Data Infrastructure at Uber
      • Scaling Memcache at Facebook
  • Interview Questions
    • Important links for preparation
    • Google Interview Questions
      • L4
        • Phone Interview Questions
      • L3
        • Interview Questions
      • Phone Screen Questions
  • Miscellaneous
    • 90 Days Preparation Schedule
    • My Preparation for Tech Giants
    • Top Product Based Companies
  • Links
    • Github
    • LinkedIn
Powered by GitBook
On this page
  • Hardware Faults
  • Software Errors
  • Human Errors

Was this helpful?

Edit on GitHub
  1. Designing Data-Intensive Applications - O'Reilly
  2. Reliable, Scalable, and Maintainable Applications

Reliability

Tolerating hardware & software faults human error

PreviousReliable, Scalable, and Maintainable ApplicationsNextScalability

Last updated 10 months ago

Was this helpful?

For Software, typical expectations include:

  • The application performs the function that the user expected

  • It can tolerate the user making mistakes or using the software in unexpected ways

  • Its performance is good enough for the required use case, under the expected load and data volume

  • The system prevents any unauthorized access and abuse

Continuing to work correctly, even when things go wrong.

Fault VS Failure

A Fault is not the same as failure. The things that can go wrong are faults. and systems that anticipate faults and can cope with them are called fault-tolerant or resilient. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops proving the required service to the user.

It's impossible to reduce the probability of faults to zero, thus, it's useful to design fault-tolerance mechanisms that prevent faults from causing failures.

Fault-Tolerant System:

The Netflix Simian Army -

Hardware Faults

The Disk may be set up in a , servers may have dual power supplies and hot-swappable CPU's and data centers may have batteries and diesel generators for backup power. When one component dies, the redundant component can take its place while the broken component is replaced.

As data volumes and applications' computing demands have increased, there's been a shift towards using software fault-tolerance techniques in preference or in addition to hardware redundancy. An advantage of these software fault-tolerant systems is that: For a single server system, it requires planned downtime if the machine needs to be rebooted (e.g. to apply operating system security patches). However, for a system that can tolerate machine failure, it can be patched one node at a time (without the downtime of the entire system - a rolling upgrade)

Software Errors

Software failures are more correlated than hardware failures. There is no quick solution to the problem of systematic faults in the software. Below are things that can help.

  • Carefully thinking about the assumptions and interactions in the system.

  • Thorough testing

  • Process isolation

  • Allowing the process to crash and restart

  • Measuring, Monitoring, and analyzing system behavior in production.

Human Errors

Humans are unreliable.

How do we make our system reliable, in spite of unreliable humans?

  • Design Systems in a way that minimizes the opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do "the right thing" and discourage "the wrong thing." However, if the interfaces are too restrictive people will work around them, negating their benefits, so this is the tricky balance to get right.

  • Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting the real users.

  • Test thoroughly at all levels, from unit tests to whole system integrations tests and manual tests. Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.

  • Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For instance, make it fast to roll back configuration changes, roll our new code gradually and provide the tools to recompute the data if required.

  • Setup detailed and clear monitoring such as performance metrics and error rates.

We have a responsibility to our users, and hence reliability is very important. There are situations in which we may choose to sacrifice reliability in order to reduce development costs or operational costs.

1️⃣
https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
RAID configuration