System Design
Scalability, resilience, rate limiting, idempotency, and recovery goals.
What is idempotency and why is it important in distributed systems?
Idempotency means repeating the same operation produces the same result without unintended side effects.
What is a single point of failure?
A single point of failure is a component whose failure causes the entire system or critical service to fail.
What is the difference between RTO and RPO?
RTO is how quickly service must be restored, while RPO is how much data loss is acceptable.
What is scalability?
Scalability is the ability of a system to handle increased load.
Why is caching used?
Caching improves performance by storing frequently accessed data closer to users.
What is rate limiting?
Rate limiting controls how many requests a client can make in a given time.
What is system design?
System design is the process of planning how components work together to build a scalable and reliable system.
What is availability?
Availability is the ability of a system to remain accessible and operational.
What is reliability?
Reliability is the ability of a system to perform correctly over time.
What is fault tolerance?
Fault tolerance is the ability of a system to continue working even when parts of it fail.
What is the difference between horizontal and vertical scaling?
Vertical scaling adds power to one machine, while horizontal scaling adds more machines.
Why should single points of failure be avoided?
Because failure of that one component can bring down the whole system.
Why is load balancing important in system design?
Load balancing spreads traffic across multiple instances and improves availability.
What is the difference between stateless and stateful services?
Stateless services do not depend on local session state, while stateful services do.
Why is caching used in system design?
Caching reduces latency and lowers load on slower backend systems.
Why is cache invalidation difficult?
Because cached data can become stale and must be refreshed or removed at the right time.
What is consistency in distributed systems?
Consistency means all users or nodes see the same data state according to the system's guarantees.
What is eventual consistency?
Eventual consistency means replicas may differ temporarily but become consistent over time.
What is replication?
Replication means storing copies of data or services in multiple places.
What is sharding?
Sharding splits data across multiple databases or nodes.
Why are queues used in system design?
Queues decouple components and allow asynchronous processing.
What is backpressure?
Backpressure is a mechanism to slow down or control producers when consumers cannot keep up.
Why should retries be designed carefully?
Bad retry logic can amplify failures and overload already unhealthy systems.
What is a circuit breaker pattern?
A circuit breaker stops repeated calls to a failing dependency to prevent wider damage.
Why are timeouts important in distributed systems?
Timeouts prevent requests from hanging indefinitely when a dependency is slow or broken.
What is the bulkhead pattern?
The bulkhead pattern isolates failures so one part of a system does not take down another.
Why is rate limiting used in system design?
Rate limiting protects systems from overload and abuse.
What is the CAP theorem?
The CAP theorem says a distributed system cannot fully guarantee consistency, availability, and partition tolerance at the same time.
Why are RPO and RTO important in disaster recovery design?
They define acceptable data loss and recovery time after failures.
Why would a system use multi-region architecture?
Multi-region design can improve resilience, reduce latency, and support disaster recovery.
What is graceful degradation?
Graceful degradation means a system keeps delivering partial value instead of fully failing.
Why must observability be considered in system design?
Because systems cannot be operated reliably if they cannot be understood and debugged.
Why use an API gateway?
An API gateway centralizes routing, authentication, rate limiting, and request handling for APIs.
Why is idempotency important for retries and distributed workflows?
Because retries should not create duplicate side effects.
How does system design differ for read-heavy and write-heavy workloads?
Read-heavy systems often prioritize caching and replicas, while write-heavy systems focus on ingestion throughput and durability.
Why is blast radius reduction an important system design principle?
Because failures should be contained so they do not impact the whole platform.