Loading...

System Design - Key Concepts and Best Practices

Tushar Yadav

22:15 21 May 2025

## Client-Server Model

- **Definition:** A _distributed architecture_ where **clients** (user devices or programs) request services and **servers** (powerful machines) provide them [geeksforgeeks.org](https://www.geeksforgeeks.org/client-server-architecture-system-design/#:~:text=Client,enhances%20performance%2C%20scalability%2C%20and%20security). Clients (e.g. web browsers or mobile apps) send requests (HTTP, RPC, etc.) to a server, which processes data and returns responses.

- **Advantages:** Centralizes data and business logic on servers, making updates and maintenance easier; clients can be “thin” (lightweight) and focus on the user interface. Servers can be replicated or scaled to handle many clients.

- **Key point:** Separating roles (client vs. server) improves performance and scalability[geeksforgeeks.org](https://www.geeksforgeeks.org/client-server-architecture-system-design/#:~:text=Client,enhances%20performance%2C%20scalability%2C%20and%20security). The server can cache or pre-compute data, while clients handle user interaction.

## DNS and IP Addressing

- **DNS = “Internet Phonebook”:** Domain Name System (DNS) maps human-friendly domain names (e.g. `example.com`) to machine-friendly IP addresses[cloudflare.com](https://www.cloudflare.com/learning/dns/what-is-dns/#:~:text=The%20Domain%20Name%20System%20,browsers%20can%20load%20Internet%20resources). Every device on the Internet has a unique IP (IPv4 or IPv6); DNS lets users use memorable names instead of numeric addresses[cloudflare.com](https://www.cloudflare.com/learning/dns/what-is-dns/#:~:text=The%20Domain%20Name%20System%20,browsers%20can%20load%20Internet%20resources).

- **Hierarchical Design:** DNS is a distributed hierarchy (root → TLD → authoritative servers). Clients (resolvers) cache DNS responses (with TTL) for speed. Tools: public DNS (e.g. Cloudflare DNS, Google DNS, AWS Route 53) or private DNS for internal networks.

- **Practical uses:** You can use DNS for load balancing (multiple A records in round-robin), failover (primary/secondary name servers), and geo-routing (GeoDNS/CDN). For example, Amazon’s Route 53 and Cloudflare DNS can direct users to the closest data center.

- **Interview tip:** Understand **A vs. CNAME records**, TTL, and differences between IPv4 (numeric) vs IPv6 (hexadecimal) addressing. Note that DNS translates names to IPs behind the scenes[cloudflare.com](https://www.cloudflare.com/learning/dns/what-is-dns/#:~:text=The%20Domain%20Name%20System%20,browsers%20can%20load%20Internet%20resources) so clients don’t need to know actual IPs.

## Scaling: Vertical vs. Horizontal

- **Vertical Scaling (Scale-Up):** Increase the capacity of a single machine by adding CPU, RAM, or disk[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=Vertical%20scaling%2C%20also%20known%20as,software%20component%20within%20a%20system). Easy to implement (just upgrade the server), but limited by hardware and can cause downtime during upgrades[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=Vertical%20scaling%2C%20also%20known%20as,software%20component%20within%20a%20system). Example: upgrading an EC2 instance from 4 CPUs/8 GB RAM to 8 CPUs/16 GB RAM.

- **Horizontal Scaling (Scale-Out):** Add more machines/servers to handle load[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=Horizontal%20scaling%2C%20also%20known%20as,larger%20number%20of%20individual%20units). Distributes workload across multiple nodes, enabling huge capacity and high availability[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=Horizontal%20scaling%2C%20also%20known%20as,larger%20number%20of%20individual%20units)[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=,than%20managing%20a%20single%20node). For instance, spin up additional web servers behind a load balancer. No single upgrade point; you keep adding new instances as needed.

- **Comparison:** Vertical scaling simplifies architecture initially, but has limits (e.g. a machine can only have so much RAM). Horizontal scaling is more complex (requires distributed coordination) but avoids single points of failure and is the norm for large systems[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=,than%20managing%20a%20single%20node). **Best practice:** start with vertical (for simplicity) and design for later horizontal growth (e.g. stateless servers, database sharding).

- **Interview note:** Mention **auto-scaling** (e.g. AWS Auto Scaling Groups) for horizontal scaling and the trade-offs (cost, complexity, fault tolerance)[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=,than%20managing%20a%20single%20node). Highlight that horizontal scaling (scale-out) is often necessary for global high-availability services (e.g. Netflix, Amazon)[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=,than%20managing%20a%20single%20node).

## Load Balancers

- **Purpose:** A load balancer (LB) distributes incoming client requests across multiple servers[geeksforgeeks.org](https://www.geeksforgeeks.org/load-balancer-system-design-interview-question/#:~:text=A%20load%20balancer%20is%20a,of%20servers%2C%20and%20high%20performance). This prevents any single server from becoming a bottleneck or point of failure, and improves overall throughput[geeksforgeeks.org](https://www.geeksforgeeks.org/load-balancer-system-design-interview-question/#:~:text=A%20load%20balancer%20is%20a,of%20servers%2C%20and%20high%20performance).

- **How it works:** The LB acts as a reverse proxy between clients and servers. It uses health checks to monitor servers and routes traffic only to healthy nodes[geeksforgeeks.org](https://www.geeksforgeeks.org/load-balancer-system-design-interview-question/#:~:text=Load%20balancers%20minimize%20server%20response,remove%20the%20number%20of%20servers). If a server goes down, the LB automatically stops sending requests to it, ensuring high availability and uptime.

- **Algorithms:** Common LB strategies include round-robin (even distribution), least-connections, IP-hash, and weighted routing. Advanced LBs (Layer 7) can route based on HTTP attributes (paths, headers, cookies).

- **Types & Tools:**

- **Layer 4 (Network) Balancer:** Works at TCP/UDP level (e.g. AWS Network Load Balancer, F5 hardware). Fast and handles millions of connections.

- **Layer 7 (Application) Balancer:** Understands HTTP(S) (e.g. NGINX, HAProxy, AWS Application Load Balancer). Can route on URLs or do SSL termination.

- **Global DNS Balancing:** Services like AWS Route 53 or Cloudflare can act as DNS LBs across regions (geo-routing, latency-based routing).

- **Benefits:** Ensures high availability (removes single server failure) and lowers latency by spreading load[geeksforgeeks.org](https://www.geeksforgeeks.org/load-balancer-system-design-interview-question/#:~:text=Load%20balancers%20minimize%20server%20response,remove%20the%20number%20of%20servers). Also enables **zero-downtime scaling** (add/remove servers seamlessly). Example tools: **NGINX/HAProxy** (software), **AWS ELB/ALB** (managed), **Kubernetes Ingress** controllers (for clusters).

- **Important:** Do not make the LB a single point of failure – use redundant/multi-zone LBs. Also consider session affinity (“sticky sessions”) only if absolutely needed; ideally design servers to be stateless so any LB target can handle any request.

## Microservices Architecture

- **Definition:** Break a large monolithic application into many _small, independent services_, each responsible for a specific business capability[geeksforgeeks.org](https://www.geeksforgeeks.org/microservices/#:~:text=Microservices%20are%20an%20architectural%20approach,into%20smaller%2C%20loosely%20coupled%20services). Each microservice has its own codebase and can be developed, deployed, and scaled on its own[geeksforgeeks.org](https://www.geeksforgeeks.org/microservices/#:~:text=Microservices%20are%20an%20architectural%20approach,into%20smaller%2C%20loosely%20coupled%20services).

- **Benefits:** Improves modularity and scalability. Teams can work on different services simultaneously using different languages or databases. Individual services fail independently (fault isolation), and only the impacted service needs to be scaled or redeployed.

- **Communication:** Microservices interact via lightweight mechanisms (usually RESTful HTTP APIs, gRPC, or messaging). A **service discovery** mechanism (e.g. Consul, DNS-based, or Kubernetes DNS) helps services find each other at runtime.

- **Deployment:** Often run in containers (Docker) on orchestrators like Kubernetes or AWS ECS. Containers ensure isolation and consistent environments. Kubernetes can auto-scale pods of a service based on load.

- **Drawbacks:** Increases complexity (network calls, data consistency challenges). Must handle distributed concerns: retries, timeouts, circuit-breakers, and data transactions across services. Monitoring and logging become more involved (use centralized logging, distributed tracing).

- **Example:** In an e-commerce site, one microservice might handle user accounts, another handles orders, another handles inventory. An API Gateway often fronts them all. Large companies (Netflix, Amazon, Uber) successfully use microservices at scale.

## API Gateways

- **Definition:** An API Gateway is a reverse-proxy **single entry point** for client requests in a microservices system[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-api-gateway-system-design/#:~:text=One%20service%20that%20serves%20as,to%20the%20appropriate%20backend%20services). Clients send all requests to the gateway, which then routes them to the appropriate backend service[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-api-gateway-system-design/#:~:text=One%20service%20that%20serves%20as,to%20the%20appropriate%20backend%20services).

- **Responsibilities:** The gateway handles common tasks like authentication/authorization (using JWT, OAuth, API keys), rate limiting/throttling, request routing, load balancing, caching, and metrics collection[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-api-gateway-system-design/#:~:text=One%20service%20that%20serves%20as,to%20the%20appropriate%20backend%20services)[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-api-gateway-system-design/#:~:text=,and%20monitoring%20systems%20for%20centralized). It may also translate protocols (e.g. HTTP → gRPC) and aggregate responses from multiple services into one.

- **Benefits:** Simplifies the client (it only needs to know one endpoint). Hides the complexity of the microservices topology and provides a unified API surface. Improves security by centralizing cross-cutting concerns[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-api-gateway-system-design/#:~:text=,access%20a%20variety%20of%20services).

- **Tools:** AWS API Gateway (often paired with Lambda), **NGINX Kong** or **Tyk** (self-hosted API Gateways), **Istio**/Envoy (service-mesh style gateway).

- **Caution:** The gateway itself can be a bottleneck or single point of failure, so it must be scaled (e.g. run multiple instances) and monitored. It can introduce extra latency, so use caching and efficient routing.

## Asynchronous Processing (Queues & Workers)

- **Concept:** Instead of handling tasks synchronously (client waits for each to complete), use _asynchronous messaging_. The client enqueues a job in a queue and the server immediately returns a response (e.g. “Job received”)[geeksforgeeks.org](https://www.geeksforgeeks.org/asynchronous-processing-in-system-design/#:~:text=,the%20need%20for%20constant%20polling). Background **worker** processes then pull jobs from the queue and execute them independently.

- **Message Queues:** Systems like **RabbitMQ**, **Apache Kafka**, **AWS SQS**, or **Google Cloud Pub/Sub** store and forward messages. Queues decouple the request layer from the processing layer[geeksforgeeks.org](https://www.geeksforgeeks.org/asynchronous-processing-in-system-design/#:~:text=,the%20need%20for%20constant%20polling), enabling smooth load handling.

- **Use Cases:** Long-running or resource-intensive tasks (sending emails, video processing, report generation, machine learning jobs) are offloaded to queues. This keeps the system responsive to users and enables **batch or delayed processing**.

- **Advantages:** Improves throughput and fault tolerance. If worker services crash, queued tasks are not lost (as long as the queue is durable). Workers can be scaled independently based on queue depth.

- **Patterns:** Producer-Consumer is common: the web/app service produces (enqueues) tasks, and multiple consumers (workers) process them. Implement retries and dead-letter queues for failed tasks.

- **Tools:** _Celery_ (Python) with RabbitMQ/Redis, _Sidekiq/Resque_ (Ruby) with Redis, _Java JMS_ (ActiveMQ), _AWS SQS + Lambda/EC2_, etc.

## Publish/Subscribe (Pub/Sub) Model

- **Definition:** In Pub/Sub, **publishers** emit messages to a _topic_ without specifying recipients, and **subscribers** receive messages from topics they subscribe to[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-pub-sub/#:~:text=The%20Pub%2FSub%20,of%20the%20Pub%2FSub%20model%20include). A message broker routes each published message to all interested subscribers[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-pub-sub/#:~:text=The%20Pub%2FSub%20,of%20the%20Pub%2FSub%20model%20include).

- **Decoupling:** Publishers and subscribers do not know about each other (loose coupling). This enables horizontal scaling and flexibility. You can add more subscribers or publishers without changing others.

- **Typical Flow:** A publisher tags messages with a topic (e.g. “order_placed”) and sends to the broker. The broker (e.g. Kafka, Google Pub/Sub, AWS SNS) delivers the message to all subscribers listening on that topic.

- **Use Cases:** Event-driven architecture: e.g. when a user places an order, publish an “order placed” event. One service can charge the user, another can update inventory, another can send a confirmation email – all triggered by the same event.

- **Systems:** **Apache Kafka** (distributed log with pub/sub semantics), **Google Pub/Sub**, **AWS SNS** or **Kinesis**, **Redis Pub/Sub** (simple), **RabbitMQ** (topics/exchanges). Kafka is widely used for high-throughput streaming and retention.

- **Key point:** Pub/Sub enables one-to-many communication and high scalability. It’s essential for real-time data pipelines, notifications, and decoupled microservices.

## Real-World Scenarios and Best Practices

- **Data Stores:** Use the right database for each need. Relational DBs (PostgreSQL, MySQL) for transactional data; NoSQL (MongoDB, Cassandra, DynamoDB) for flexible schemas or huge scale. Always replicate (master–slave, or multi-master) and consider sharding/partitioning large tables to scale reads/writes.

- **Caching:** Heavily cache read-heavy data. In-memory caches like **Redis** or **Memcached** can drastically reduce DB load[medium.com](https://medium.com/@devcorner/mastering-system-design-key-rules-to-guide-your-interviews-cc582169a609#:~:text=Mastering%20System%20Design%3A%20Key%20Rules,). For example, cache user sessions or product catalog entries. Use CDNs (e.g. CloudFront, Cloudflare) for static assets (images, JS/CSS) to serve users globally with low latency.

- **API and Load Testing:** Always estimate expected load (QPS) and simulate high traffic. Identify bottlenecks under stress and iterate design.

- **Monitoring & Logging:** Instrument every component (use Prometheus/Grafana, ELK stack, or AWS CloudWatch) to track latency, error rates, and resource usage. Set up alerts (e.g. on high error rate or CPU). Detailed logs and distributed tracing (Jaeger, Zipkin) help diagnose issues.

- **Fault Tolerance:** Assume failures. Use retries with exponential backoff, circuit breakers, and bulkheads. Deploy critical services across multiple availability zones or regions for disaster recovery. Use health checks (in load balancers or orchestrators) to auto-replace failed nodes.

- **Scalability Practices:**

- **Stateless Services:** Keep servers stateless (no sticky sessions) so they are replaceable and scalable. Store user sessions in a shared cache or database.

- **Async and CQRS:** Where possible, separate reads and writes (Command Query Responsibility Segregation) and use asynchronous processing to handle spikes.

- **12-Factor App:** Follow principles (config via environment, backing services as attached resources, logs as event streams, etc.) for cloud-native designs.

- **Security:** Encrypt data in transit (TLS) and at rest (disk encryption, KMS). Authenticate and authorize all requests (use OAuth, API keys). Protect against common attacks (SQL injection, XSS) at the application level. Use WAF/CDN for DDoS protection.

- **DevOps/Automation:** Use Infrastructure-as-Code (Terraform, CloudFormation) for reproducible environments. Automate CI/CD pipelines so deployments are reliable and rollback-able.

- **Example Scenario:** For a scalable web service, you might use DNS (Route 53) → Load Balancer (ELB) → Auto-scaled EC2/Kubernetes pods running the app → Redis cache and RDS/MySQL cluster → RabbitMQ for background jobs → CloudWatch/Grafana for monitoring. Every piece (LB, web tier, DB, queue) can scale independently.

- **Interview Tip:** When solving a design problem, outline requirements (load, data size), define APIs, choose components (DNS, LB, cache, DB, queues) and justify each. Discuss trade-offs (consistency vs. availability, SQL vs. NoSQL, monolith vs. microservices) and highlight fault tolerance and scaling strategies throughout.

**Key Takeaways:** System design is about building **scalable, reliable, and maintainable** architectures. Think in layers (client, API/gateway, service, data), use abstraction (DNS/IP, load balancers, gateways) to decouple components, and leverage asynchronous/messaging patterns for resilience. Always back design choices with use-case needs (throughput, latency, complexity) and industry best practices (monitoring, automation, security) to create real-world robust systems.

**Sources:** Based on industry-standard concepts and best practices[geeksforgeeks.org](https://www.geeksforgeeks.org/client-server-architecture-system-design/#:~:text=Client,enhances%20performance%2C%20scalability%2C%20and%20security)[cloudflare.com](https://www.cloudflare.com/learning/dns/what-is-dns/#:~:text=The%20Domain%20Name%20System%20,browsers%20can%20load%20Internet%20resources)[geeksforgeeks.org](https://www.geeksforgeeks.org/system-design-horizontal-and-vertical-scaling/#:~:text=What%20is%20Horizontal%20Scaling%3F)[geeksforgeeks.org](https://www.geeksforgeeks.org/load-balancer-system-design-interview-question/#:~:text=A%20load%20balancer%20is%20a,of%20servers%2C%20and%20high%20performance)[geeksforgeeks.org](https://www.geeksforgeeks.org/microservices/#:~:text=Microservices%20are%20an%20architectural%20approach,into%20smaller%2C%20loosely%20coupled%20services)[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-api-gateway-system-design/#:~:text=One%20service%20that%20serves%20as,to%20the%20appropriate%20backend%20services)[geeksforgeeks.org](https://www.geeksforgeeks.org/asynchronous-processing-in-system-design/#:~:text=,the%20need%20for%20constant%20polling)[geeksforgeeks.org](https://www.geeksforgeeks.org/what-is-pub-sub/#:~:text=The%20Pub%2FSub%20,of%20the%20Pub%2FSub%20model%20include), with examples from AWS, Kubernetes, NGINX, etc. (For further reading, see GeeksforGeeks, Cloudflare, AWS documentation, and system design guides.)