My personal learning notes and architecture lessons

Day 13: AWS Solutions Architect Professional Prep — Serverless architectures

Today’s lesson was all about Serverless architectures. Serverless is a method where you simply write your code and let Amazon Web Services (AWS) handle all the maintenance, scaling, and heavy lifting automatically. With serverless you can focus purely on creating amazing features and not managing infrastructure. You don’t have to manage complicated servers and pay for expensive computing power that just sits idle. You only pay when your code is actually running.

Serverless computing can be compared to a vending machine. You just pay for the soda (code execution) right when you need it instead of buying a whole store (a dedicated server) and hiring staff. AWS handles the machine maintenance and stocking.

I delved into three core services namely, Lambda, API Gateway, and EventBridge that require zero server management.

I found answers to 3 key questions:

1. Where does my code actually run? (AWS Lambda)

2. How do customers access my application? (API Gateway)

3. How do different parts of my application talk to each other without breaking? (EventBridge)

1. Where does my code actually run? (AWS Lambda)

AWS Lambda is like a professional chef working on call. The chef is not paid to stand around waiting. The chef only shows up and cooks when someone places an order (the trigger). Likewise, Lambda executes your function code when it is triggered, meaning you don’t have to manage any dedicated computers (EC2 instances or containers).

When configuring Lambda, key parameters include Memory, which proportionally affects CPU and network performance, and Timeout, which can be up to 15 minutes.

But there is a trick to increase performance – Provisioned Concurrency. This is akin to keeping the chef on standby. Since the chef might take a minute to grab ingredients (a “cold start”) Provisioned Concurrency means you pay a bit extra to keep the chef ready at the stove so the meal starts instantly. It keeps the function “warm,” reducing the delay caused by initialization (cold starts).

There are other optimization tips like keeping dependencies small and tuning memory settings. Lambda supports various runtimes like Node.js, Python, and Java, as well as custom runtimes. Like a trustworthy employee, each Lambda function is given only the specific permissions it needs to do its job, following the principle of least privilege.

2. How do customers access my application? (API Gateway)

API Gateway is the official entrance for all incoming requests to an application. It manages all the people coming in and directs their orders to the right Lambda chef.  It allows you to build standard REST APIs, faster HTTP APIs, or WebSocket APIs for real-time communication. It helps manage crowds by using Throttling and Quotas to control traffic bursts, so the system isn’t overwhelmed. It can use caching for faster responses  to reduce latency and decrease the number of times a Lambda function needs to be invoked.

It is comparable to The maître d’ at a restaurant. This service stands at the front entrance, managing the traffic, checking IDs (security), setting limits on how many people can enter (throttling), and directing guests (requests) to the right kitchen (Lambda function).

3. How do different parts of my application talk to each other without breaking? (EventBridge)

EventBridge is an invisible communication system. Think of the postal service or mail sorter. When one service (like an Order Service) finishes a job, it drops a letter (the event) into the system. EventBridge reads the address (the rule) and routes it efficiently to any other service (the target, like a Payment Service) that needs to know about it. This makes services independent of each other.

If Lambda is the engine that runs your code and AP Gateway is the front door of the application then Eventbridge can be called The Decoupling Glue. EventBridge creates decoupled microservices, that is, if the Shipping Department breaks, the rest of your system (like the Order Entry system) can keep running.Instead of Service A sending a message directly to Service B (creating a tight connection), Service A sends a message to the Event Bus. EventBridge uses Rules to automatically route that message to Service B (the Target).

If the target service is temporarily unavailable, EventBridge will retry the delivery for up to 24 hours to prevent data loss.

To sum it all up, the three services I studied today are said to form the backbone of modern, event-driven, cost-effective serverless architectures. By linking these tools—API Gateway (front door) to Lambda (task runner) to a database like DynamoDB (storage), and using EventBridge (communicator) it becomes possible to build a complete, powerful system without worrying about server maintenance.

Day 12: AWS Solutions Architect Professional Prep – Caching and Performance Optimization

Have you ever slammed your laptop shut after waiting five seconds for a webpage to load?  Latency, indeed, kills user experience. Worse yet, it can translate into overloaded databases and skyrocketing operational costs.

Thankfully, there’s a solution for this: Caching and Content Delivery Networks (CDNs).Caching means keeping frequently used data closer to the user or application so it can be served instantly with no waiting and no heavy lifting from your database.

AWS tools like ElastiCache and CloudFront make it possible to build systems that respond fast, scale easily, and cost less while giving users a smooth experience anywhere in the world.

1. How does AWS use caching?

To illustrate caching, we might compare your website or app to a busy restaurant. Every time a customer (user) asks for a menu item (data), the chef (database) has to cook it from scratch. This takes time, creates long wait lines, and costs a lot in ingredients and labor.

Caching aims at solving this by saving popular items so they can be served instantly. AWS uses two powerful tools to do this: one for speeding up the application, and one for speeding up the globe.

AWS uses caching at two main layers:

Application Cache (ElastiCache): Think of this as your app’s short-term memory. It keeps data like database query results or user sessions ready to grab — cutting database reads by up to 90%.

To illustrate, ElastiCache can be compared to an Application’s Scratchpad. Imagine a scratchpad right next to a librarian (the application).When the librarian looks up a fact (like a user’s session ID or a frequent query result), he writes it on the scratchpad first. Next time, he checks the scratchpad before going back to the library and this can reduce the effort needed by the library (database reads) by 80–90%

Edge Cache (CloudFront): This is a global network of servers that store your website content closer to users, so pages and images load almost instantly anywhere.

2. Redis or Memcached — which should I pick?

ElastiCache offers you a choice between two powerful engines and your choice depends on whether you need features or just raw speed.

Redis has rich features and reliability. It supports persistence, failover, clustering, and even Pub/Sub messaging. Use Redis when you need durability or advanced features. Best use cases include: Complex tasks, session stores, leaderboards

Memcached, on the other hand, has pure speed and simplicity. It is great for lightweight key-value caching and horizontal scaling when you just need fast lookups. Best use cases include: Simple, basic key-value caching.

3. How does CloudFront deliver content so quickly?

CloudFront is a Content Delivery Network (CDN). CloudFront stores your static and dynamic content in edge locations around the world. It is a worldwide network of mini-warehouses (called edge locations) that hold copies of your files (images, videos, even website code) close to your users.

When a user requests content, CloudFront first checks the nearest edge cache:If it’s already there, it serves it instantly. So when a user in London requests an image, CloudFront serves it from the closest European warehouse, not from the main server in America.

If it is not available in the edge cache, it fetches it from the the main server (the Origin) which can be S3 or an ALB, caches it, and serves future requests from the edge.This is how global companies achieve what appears to be instant websites and apps.

4. How do I protect private files while still using CloudFront?

If your files live in an S3 storage bucket, you can use Origin Access Control (OAC). OAC locks down the S3 bucket so no one can bypass the CloudFront network to access the files directly. For extra layers of protection against attacks, you can even add AWS WAF and AWS Shield to ensure your content is both fast and secure.

To sum up, the ultimate strategy is multi-tier caching: combining CloudFront for global delivery and ElastiCache for application-level performance. This results in incredible global speed and massive cost savings by keeping the database rested. Also remember:for reliability, choose Redis when you need persistence and automatic failover and for security,  always secure S3 origins with OAC.

Day 11: AWS Solutions Architect Professional Prep – Database Migration & Global Databases

In today’s digital world, customers won’t wait for slow pages, and businesses can’t afford downtime or data loss. The bar of expected speed, uptime, and reliability are at an all-time high.

To that end AWS provides powerful, fully managed tools to help you move data safely, serve users globally, and recover instantly when things go wrong.

This lesson aims at breaking down how AWS makes database migration and global replication simple.

Here are four key questions (and answers) that explain how AWS helps organizations handle data movement, migration, and global access without the headaches.

1. How do we move an existing database to the cloud without interrupting the business?

We can use the “digital moving truck”, AWS Database Migration Service (DMS). How does it work? DMS first copies everything from your existing database (a “Full Load”).

Then it switches to Change Data Capture (CDC) to constantly track new transactions, sign-ups, or orders as they happen. The CDC mode allows the old database to keep running while the new one is built. DMS simply copies every new change and transaction in real-time until you’re ready to switch over.The real-time replication ensures that when you switch to the cloud version, all your data is up to date and downtime is close to zero. It is simple, smooth, and reliable. Think of the moving truck taking the furniture to your new house and then watching the old house and sending over any new mail or package deliveries in near real-time.

DMS  can move data between homogeneous databases ie. the exact same type of database (like moving from one type of MySQL to another) or between heterogeneous databases ie. different types (like moving from an Oracle database to a PostgreSQL database)

2. What if our new cloud database speaks a different “language”?

The AWS Schema Conversion Tool (SCT) handles the translation for heterogeneous migration.SCT converts the structure, code, functions, and rules (the schema) from one type of database to another during a heterogeneous migration. It also gives you a report card on how hard the translation will be. To illustrate, if you move from a foreign country (like Oracle) to a new one (like Aurora PostgreSQL), you don’t just move your furniture; you need to translate your legal documents and instructions.

So, the 2 main tools we need for moving the house (migration) are AWS Database Migration Service (DMS), The Data Mover and AWS Schema Conversion Tool (SCT) The Language Translator. Together, they make cloud migration seamless.

3. How do we make our database fast for users all over the world?

We can use Aurora Global Database, the global relational powerhouse. It is best for relational data (like account records or complex sales transactions). The idea is : One primary region handles all new writes and up to five secondary regions hold live read-only copies. These replicas are updated with sub-second latency, so users in Europe, Asia, or America get lightning-fast responses.

If our main region fails, a secondary one can be promoted in under a minute (RTO < 1 minute), thus keeping global operations alive.

4. What if users need to write data from any region, anytime?

We can use DynamoDB Global Tables. It is best used for high-speed, simple data (like user profiles or gaming scores). How does it work? You set up servers in multiple regions, and all of them can accept new transactions simultaneously (active-active). In other words, every region can handle both reads and writes. Each change made anywhere is instantly copied everywhere.DynamoDB automatically keeps the most recent one, the “last writer wins” rule which is the perfect setup for real-time, global applications like gaming, finance, or user profiles where every millisecond counts.

Expressed differently, If two regions try to change the same record at the exact same moment, the system resolves it automatically by accepting the change that arrived last and there is no manual failover needed for disaster recovery; it’s built-in since all regions can handle writes.

The bottomline is that mastering database migration and global design on AWS is about knowing which tool fits your need. Need to move existing databases with minimal downtime? Use AWS DMS. Need to translate database code and structure? Use AWS SCT Need Global relational performance? Use Aurora Global Database. Need Global NoSQL with multi-region writes? Use DynamoDB Global Tables. Thanks to these tools an organization can move data smoothly and build highly resilient, super-fast applications accessible to anyone, anywhere in the world.

Day 10: AWS Solutions Architect Professional Prep — Databases Deep Dive

Have you ever wondered how the fastest websites manage billions of data requests without crashing? The secret…Psst…is  managed databases like Amazon RDS and Amazon Aurora.

With these services you can say goodbye to sleepless nights of configuring servers or worrying about backups. These services take care of setup, operation, and scaling so you can focus on what truly matters — your applications and your learning.

Mastering RDS and Aurora is a non-negotiable part of understanding AWS cloud architecture. The goal of this post is to simplify the inner workings of these services.

1. How do I choose between high availability and read speed?

You actually need both, because they solve different problems.

If your goal is High Availability (HA) / Disaster Recovery, you must use Multi-AZ. This is your safety net. It creates an identical standby copy of your database in a separate location (AZ).

Every single data change is copied instantly (synchronously) to the standby. If your main database fails, the standby takes over automatically. You might call this. This option provides Disaster insurance and can be called The Safety Net option.

For Read Speed (Performance Scaling), you use Read Replicas. If thousands of users are just reading data (like checking products or running reports), you send them to these extra copies to take the pressure off your primary database. Data is copied over shortly after (asynchronously).Aurora allows up to 15 replicas. This option provides traffic relief and can be called The Speed Boost option.

Simply put: Multi-AZ = uptime. Read Replicas = speed.

2. Why is Aurora so much faster than a standard database?

Aurora is 5 times faster than MySQL and 3 times faster than PostgreSQL.Aurora is a next-generation database built specifically for the cloud. It has a unique, distributed architecture.

Key features that make it fast:

1. Aurora separates compute (the DB instance) from storage.

2. Storage is distributed, fault-tolerant, and self-healing. It has 6 copies of your data spread across 3 AZs.

3.Aurora only needs 4 of those 6 copies to confirm a write and that makes it lightning-fast.

4.Because compute and storage are decoupled, Aurora can recover from crashes in under 30 seconds.

It has enterprise-grade speed and resilience built right into its architecture.

3. How does Aurora handle unpredictable traffic without wasting money?

This is handled by Aurora Serverless, which automatically adjusts its power based on how many people are using the application.It starts, stops, and scales automatically based on the workload. Compute capacity is measured in Aurora Capacity Units, or ACUs.

Serverless v1 is perfect for testing or infrequent workloads. It scales between 0.5 and 128 ACUs, and can automatically pause when idle.

Serverless v2 is made for production environments. It scales instantly and precisely , with sub-second adjustments up to hundreds of ACUs. You only pay for the compute you actually use.

There is no over-provisioning, no waste; it is just the right power at the right time.

4. What are Aurora’s cool recovery and scaling features?

Aurora has tools that make recovery and global scaling feel almost magical.

Backtrack: This feature is like an undo button for your database. You can quickly roll back data to a previous point in time without a long, full restoration from a backup. You might call this Aurora’s “time machine”.

For global reach, The Aurora Global Database feature lets you keep fast, low-latency copies of your data running in different regions around the world for massive global applications or disaster recovery. It replicates data across regions with under 1-second latency which is perfect for low-latency apps and cross-region disaster recovery.

Aurora’s features that make recovery quick and easy include:

Point-in-Time Recovery (PITR): This feature is available in both RDS and Aurora. Thanks to PITR, you can restore your database to any specific second within your retention window (up to 35 days). This is possible Since AWS stores continuous backups in S3’

Some of the big take-aways from today’s lesson are:

  1. Multi-AZ is not the same as Read Replicas. Multi-AZ keeps you online; Read Replicas keeps you fast.
  2. Aurora’s architecture has  6 copies across 3 AZs and that is what gives it unbeatable speed and reliability.
  3. Always secure your data. Use KMS encryption for storage and snapshots.

Day 9: AWS Solutions Architect Professional Prep — Storage Deep Dive

The right choice of digital storage service determines your speed, your ability to share data, and ultimately, your monthly bill so it is a very critical decision. Amazon Web Services (AWS) offers several primary ways to store data and today’s lesson examines how to pick the right one

Here are four key questions that can guide us in selecting and optimizing AWS storage:

Q1: What are the three main types of storage, and when do I use each one?

There are three primary services, differentiated by the kind of data access they offer:

1. EBS (Elastic Block Store): It is like a personal hard drive. It is best for the computer’s operating system (boot volumes) and crucial, constantly changing data, like customer databases. Only one virtual computer can use it at a time

2. EFS (Elastic File System): This is comparable to a shared network folder. It is best for sharing data among many virtual computers simultaneously. It grows and shrinks automatically as you add files. EFS is described as being fully managed and elastic, meaning it handles all the growing and shrinking on its own.

Because it is a regional service, the data is spread across multiple locations (AZs) for high availability.

3. FSx Family: FSx is like a specialized enterprise server. It provides tailored file systems for specific enterprise tasks.It is best for handling complex business needs, like running specific Windows applications, Active Directory integration, or doing heavy-duty data crunching (HPC).

It  offers specialized, fully managed File systems (like SMB or Lustre protocols).

For Windows: FSx for Windows File Server works perfectly if your company uses Microsoft’s protocols (SMB) and Active Directory. It supports high availability across multiple locations (Multi-AZ).

For Supercomputers: FSx for Lustre is built for extreme speed, great for things like machine learning and big data analytics. It can even automatically sync data with S3 storage

Q2: How do these services handle growth and sharing?

The key differences are in scaling and shared access:

EBS generally allows only a single instance to access the volume – athough there is a Multi-Attach feature for io1/io2 volumes in the same AZ. A manual process is required to scale.

EFS is designed for multiple instances and uses automatic, elastic scaling so it can grow and shrink automatically as you add files. It is a regional service so it is not limited to one Availability Zone (AZ).

FSx also supports multiple instances but uses managed scaling.

Q3: How can I save money on my cloud storage?

You can implement lifecycle management and choose cost-effective volume types: gp3 and EFS Infrequent Access.

For EBS, the recommendation is to always choose the gp3 volume type over the older gp2. The gp3 volume provides custom IOPS and throughput, and, for the same performance, using it can be about 20% cheaper.

EFS Lifecycle Management is the biggest cost-saver. You can enable policies to automatically move files that haven’t been touched after a certain number of days (say, 30 days) into the Infrequent Access (IA) storage class. This move can cut your storage costs for that cold data by up to 92%.

General Optimization: Automate EBS snapshot creation using the Data Lifecycle Manager (DLM) and regularly delete unused snapshots.

Q4: Which storage options are best for the absolute highest performance?

To optimize performance, faster drive types like io2 can be used.

For extreme speed for specialized or critical workloads, there are specific options:

For Databases, use the io2 or io2 Block Express types, which are designed for high-performance, latency-critical workloads and support up to 256,000 IOPS.

For Supercomputing, use FSx for Lustre. This file system is purpose-built for high-performance computing (HPC), machine learning (ML), and analytics. It can integrate directly with S3 buckets and automatically sync changes.

To conclude, EBS is block storage and is best used for Databases, OS, Boot Volumes. EFS is file storage and is best used for shared application data. FSx is file storage and is best used for Enterprise Workloads, High Performance Computing (HPC).

Day 8: AWS Solutions Architect Professional Prep — Load Balancing Deep Dive

How do major websites stay fast even when millions of people are clicking at the same time? They avoid virtual traffic jams using something called an AWS Load Balancer (ELB).

These tools might be compared  to a specialized traffic manager who has the job to efficiently distributing incoming requests across multiple servers and send incoming requests to the best available server and ensure no single server gets overloaded.Better speed, better stability, and better customer experience is the value they provide

Q1: What are the three main types of AWS Load Balancers?

AWS Elastic Load Balancing, or ELB, offers three main types of load balancers, each specialized for a different job: the Application Load Balancer (ALB), the Network Load Balancer (NLB), and the Gateway Load Balancer (GWLB).

  1. Application Load Balancer (ALB): This is the Smart Router because it does not just look at the physical address of the request but actually reads the content. It handles web protocols like HTTP, HTTPS, and WebSockets. Operates at Layer 7 (Application).
  1. Network Load Balancer (NLB): This is the High-Speed Express Lane; offers ultra-low latency. It handles basic network traffic like TCP, UDP, and TLS. Operates at Layer 4 (Transport)
  1. Gateway Load Balancer (GWLB): This is the Security Checkpoint; It is designed specifically for security and network appliances. It operates at Layer 3 (Network)

Q2: Which Load Balancer should I use for a standard website or API?

You should choose the Application Load Balancer (ALB). It is ideal for modern web apps, microservices, and API routing.

The ALB operates at Layer 7 (the Application layer). This means it is sophisticated enough to read the URL path or HTTP headers of a request. It does not just look at where the traffic is going but actually reads the content of the request

If you ask the ALB for /images, it knows to send you to the server that only handles photos. If you ask for /checkout, it sends you to the server that handles payments. This is known as path-based routing.The ALB is also the only type that supports advanced features like maintaining sticky sessions using cookies so a user stays on the same server.

Q3: Which Load Balancer should I use  if I need extreme speed or I am handling gaming traffic?

The NLB operates at Layer 4 (the Transport layer). It handles TCP, UDP, and TLS traffic, and does not understand HTTP content. This means that it does not look at the content of the request like an ALB does but just looks at the basic port and protocol.

It is built for extreme performance and ultra-low latency so it can handle millions of requests per second. This makes it ideal for high-volume TCP/UDP microservices, gaming, and real-time applications. You may call it  the High-Speed Express Lane. A key feature is that it preserves the client’s original source IP address, which is helpful for backend security and logging.

Q4: How do Load Balancers know if a server is healthy?

Regardless of which Load Balancer you choose, it needs a way to constantly monitor the servers they are directing traffic to know if the backend servers are actually ready to receive traffic. It does this through health checks

An ALB sends an HTTP or HTTPS request and expects a successful status code back. The return of a status code within the range 200–399 usually indicates success. A best practice is to use a lightweight path, like /health, for this check.

An NLB supports TCP, HTTP, or HTTPS health checks. A common check is a simple TCP check to see if the server port is open and listening. If a server fails too many checks, it is quickly removed from the rotation to prevent users from seeing errors. The thresholds can be adjusted for faster failover in critical systems.

GWLB relies on target group health checks, using TCP or custom ports

Q5: When should I use the Gateway Load Balancer (GWLB)?

The Gateway Load Balancer (GWLB) is specialised. It is used when network traffic must first pass through a security scanner (like a firewall or intrusion detection system) before reaching its final destination

It might be called an appliance traffic orchestrator. It operates at Layer 3 (Network) but uses a special protocol (GENEVE encapsulation) to ensure all traffic passes through the security appliance before moving forward. This allows the security gear to scale independently without complex routing changes. It is the best choice for security inspection architectures.

To summarize the crucial differences, ALB is Layer 7 (Application) and is the advanced router that is necessary for path-based routing and features like sticky sessions.NLB is Layer 4 (Transport) and it is built for raw speed, ultra-low latency, and preserving the source IP. GWLB is the Appliance Handler and Its job is to manage traffic for integrated security systems.Health Checks are critical for maintaining the stability of applications and enabling fast failure detection.

Day 7: AWS Solutions Architect Professional Prep — Auto Scaling & High Availability Patterns

How do major online services stay fast even when millions of people log in at once? How do they survive a major regional power outage without skipping a beat? Today’s lesson examines two foundational architectural principles that make it possible, namely, Auto Scaling and High Availability

The key questions I examined today were:

What are the fundamental components of Auto Scaling that enable automatic capacity management?

Which scaling policy is best suited for handling predictable, recurring spikes in user traffic?

How does the system ensure fault tolerance and automatically replace instances that fail?

What is the required architectural pattern for designing highly available applications?

Question 1: What are the fundamental components of Auto Scaling that enable automatic capacity management?

Auto Scaling is the process of automatically adjusting server capacity by adding or removing EC2 instances based on demand. It is like having a smart manager for your fleet of servers who automatically adds servers when you get busy and removes servers to save money when traffic slows down.

It has three core components:

1. The Launch Template. It defines how new instances are configured (e.g., AMI, type, security groups). The recommendation is to  use Launch Templates for modern designs because they support versioning and multiple instance types.

2. The Auto Scaling Group (ASG). This is the logical unit managing the group of EC2 instances. Key ASG parameters include a min size(the smallest number of servers always running), a max size (the absolute capacity limit) and desired capacity (the target number of instances)

3. Scaling Policies. They are the rules which dictate exactly when and how capacity should change.

Question 2: Which scaling policy is best suited for handling predictable, recurring spikes in user traffic?

There are four main scaling policy types; three are critical to remember

  1. Predictive Scaling: This uses machine learning to forecast demand so it is ideal for predictable, recurring traffic spikes, such as 9-to-5 office hours or nightly batch jobs. The system scales out capacity before the traffic actually arrives.

2. Target Tracking Scaling is the simplest and most common policy. It is like setting a thermostat. You set the goal eg. keep the server’s CPU utilization exactly at 50%. If the temperature (CPU) goes up, the system automatically adds capacity until the goal is met.

3. Step Scaling: This scales incrementally based on specific threshold breaches.E.g., add 1 instance if CPU hits 60%; add 2 instances if it hits 80%.

Question 3: How does the system ensure fault tolerance and automatically replace instances that fail?

Fault tolerance relies heavily on Health Checks. Health Checks determine if an instance is operational. The ASG is constantly checking the health of servers. If a server fails its check – freezes up or stops responding – the ASG does not try to fix it. It immediately fires that instance and spins up a brand new, healthy replacement automatically.

While the default is the EC2 status check, it is advanced practice to combine this with a check through the load balancer (ELB check) to ensure it’s serving traffic correctly (end-to-end resilience)

Lifecycle Hooks are like a pause button that  allow you to pause an instance briefly during transition to ensure everything is handled gracefully. When a new server starts up or an old one shuts down, it might need a minute for running configuration scripts, attaching necessary monitoring agents, or gracefully deregistering the instance before shutdown. These hooks pause the instance during this phase.

Question 4: What is the required architectural pattern for designing highly available applications?

To ensure service never goes down, redundancy is key. Never put all your eggs in one basket, as the expression goes.

Multi-AZ Architecture: This means you don’t put all your servers in one data center building (Availability Zone). If there is a power outage in that building, the whole application goes down.

Using a Multi-AZ pattern involves spreading servers across two separate geographical areas so that if one fails, the other keeps running, which is required for high availability.This pattern usually uses an Application Load Balancer (ALB) combined with an ASG spanning the AZs to ensure fault-tolerant load balancing and redundancy.

Multi-Region Active-Passive: In this design, a primary region handles traffic, while a secondary region waits while maintaining minimal standby resources, using Route 53 Failover Routing for switchover.

Multi-Region Active-Active: This global strategy requires that both regions serve traffic simultaneously, requiring global data replication and often using Route 53 Latency Routing to send users to the closest region

To sum up, here are a few highlights from the lesson:

The Auto Scaling Group (ASG) manages both automatic EC2 scaling and automated replacement of unhealthy instances

Use Launch Templates over Launch Configurations for modern designs

Choose Predictive Scaling for managing forecasted or cyclical demand.

For high availability, use a Multi-AZ architecture

See you tomorrow!

Day 6 AWS Solutions Architect Exam Prep: Advanced EC2 Design — Placement Groups, Instance Types, AMIs, and Spot Instances

Advanced EC2 Design — Placement Groups, Instance Types, AMIs, and Spot Instances

Today I covered the foundational pillars of advanced EC2 design. I reviewed the answers to the following questions:

1. How do I choose the single best type of machine for my unique job?

Choosing the best type of machine or instance type is like buying a computer from a store. There may be different departments and what you choose will likely be based on factors like the purpose of the purchase and the price. AWS has 5 main “departments” so you don’t overspend or buy what is not suited for you.

  1. General Purpose (T/M family): It is good for most things. Itoffers a balanced mix of computing power, memory, and networking.
  1. Compute Optimized (C family): This is great for heavy math or scientific work that needs a fast brain (CPU). When you hear scientific modeling or CPU-bound workloads, think of compute optimised.
  1. Memory Optimized (R/X family): This server family is the best choice If your job needs to hold massive amounts of data in quick-access memory (like caching).
  1. Storage Optimized (e.g., I4i, D3): This type is optimized for reading and writing large files very quickly (high sequential I/O).You may call it the filing expert.When you hear in-memory databases and caching applications, think of memory optimised. When you hear high sequential I/O (Input/Output), think of storage optimised.
  1. Accelerated Computing (e.g., P5, G6, F1):These instances are often used for machine learning (ML) or advanced  graphics and use specialized hardware like GPUs or FPGAs.

To immediately save 20–40% on costs, look into using Graviton-based instances (e.g., t4g, c7g). Using the newer Graviton chips is like switching from Intel to a newer, more efficient chip.

2. How can I make my servers talk to each other as fast as humanly possible?

Where AWS physically places your servers in its data centers affects speed and reliability. Placement Groups let you control the physical organization of your EC2 instances to optimize for either speed or fault tolerance.We might compare placement groups to a seating charts.There are different types

Cluster placement group is the type in which instances are placed close together in one AZ like everyone sitting shoulder-to-shoulder (same rack in one location) for instant communication. This is the best choice for ML or high-performance computing (HPC) in which every millisecond counts. This is the low latency, high bandwidth solution

Spread Group is the resilient, maximum fault Isolation option. It is the type in which instances are placed on distinct racks. Everyone gets their own separate desk so if one desk breaks, the others are fine. This provides maximum fault isolation for small, critical applications (up to 7 instances per location). In fact, this is the best way to protect small, critical applications from hardware failure. High availability, small sets.

Partition is the scalable isolation option. It is comparable to dividing the room into several isolated groups. This is used for Big Data ((Hadoop, Kafka, Cassandra) where you need hundreds of machines, but you still want some hardware separation between those large groups

3. How do I guarantee that every new server I launch is perfectly identical?

By using an Amazon Machine Image (AMI). An AMI is a pre-configured template that serves as a blueprint, containing the OS, software, permissions, and necessary configurations. Using Custom AMIs built by the team ensures consistency across all deployed servers.

Types of AMIs: AWS Managed, Marketplace, and Custom (built by the user)

Sharing Blueprints: AMIs could be easily shared across different AWS accounts, but this requires shared snapshots.

Going Multi-Region: If you need to use that identical setup in a different region (eg. moving from US-East to EU-West), you must first copy the AMI to the new region.

Best Practices: 1. Tag AMIs with version numbers  2.Use automation tools like Packer or EC2 Image Builder 3. Encrypt AMIs for compliance

4. Where can I get cloud capacity for up to 90% off, and what’s the risk?

By using Spot instances. Spot Instances let you rent AWS’s spare computing power for up to 90% less than usual. The risk? If AWS suddenly needs that spare power back, they will kick your job off the server with only a 2-minute notice before termination.

While they should never be relied on solely for critical applications, they are great for flexible jobs that don’t need to run 24/7, like CI/CD, batch processing or rendering animations, where it is okay if they stop and start again later.

The best way to use them is to mix them with standard-priced servers (On-Demand) so that the whole system does not crash if the Spot capacity runs out

In conclusion, to build truly advanced, resilient, and cost-efficient cloud architectures, we must: 1. select the perfect Instance Type, 2. place them correctly with Placement Groups, 3.standardize with AMIs, and 4.optimize costs with Spot Instances

Day 5: AWS Solutions Architect Professional Prep : Advanced VPC Networking

Today’s lesson revolved around how to set up a network in a way that is scalable, secure, and highly available (HA). These were the big questions I delved into:

1. How do Security Groups (SGs) and NACLs differ in function?

2. Which type of Endpoint should I choose for private AWS connectivity?

3. What is the scalable solution for connecting many VPCs?

4. How should NAT Gateways be deployed for High Availability (HA)?

First, a brief overview of key concepts. Your VPC is your entire network setup in the cloud. We might compare a Virtual Private Cloud (VPC) to a large building or campus. If your VPC is your own private property , you would no doubt organise them into rooms; subnets are the specialized rooms within that building. Continuing that analogy, the public Subnet would be the front door since it has a direct route to the street, which is the Internet Gateway (IGW). Any resource here needs a public address to be accessible from the internet. We might call the Private Subnet the back office- this room has no direct route to the street, or IGW. If resources here need to communicate with the internet (e.g., for updates), they must use a dedicated, shared payphone, or NAT Gateway located in a public subnet. An isolated subnet is the vault – a room that is completely locked down and has no route to the street (IGW) or payphone (the NAT Gateway). It is ideal for highly sensitive data like internal databases and backups .Every room has a Route Table – a map that tells the traffic leaving that room where to go.

1. How do Security Groups (SGs) and NACLs differ in function?

We have two main tools or security layers to protect our resources- SGs and NACLs.

A Security Group (SG) is a bodyguard for a single resource like an EC2 instance or ENI. SGs are stateful: If the bodyguard allows traffic in based on an inbound rule, it automatically remembers and allows the return traffic. They only use allow rules.

A NACL (Network Access Control List) is a fence around the entire subnet. NACLs are stateless: you must explicitly write rules for traffic in and traffic out. The NACL (Fence) is checked before the Security Group (Bodyguard). NACLs are best used for coarse perimeter filtering, especially to implement deny rules (e.g., blocking known malicious IP ranges).

Remember: whereas SGs are applied per resource (instance, ENI), NACLs are applied per subnet.

2. Which type of Endpoint should I choose for private AWS connectivity?

If your private resources need to use AWS services without traversing the internet or NAT Gateway, you use Endpoints. The type depends on the service:

When accessing S3 or DynamoDB, use a Gateway Endpoint . This is a free, dedicated route defined in your route table via a prefix list.

When accessing private service APIs (like KMS, SSM, or third-party SaaS services), use an Interface Endpoint (PrivateLink) . This works by creating a dedicated network card (ENI) inside your subnet.

3. What is the scalable solution for connecting many VPCs?

VPC Peering is like a secret tunnel joining two houses. It is a simple, low-latency, point-to-point connection which is great for connecting two neighboring VPCs. The downside is that it does not scale well. If you have 10 VPCs, you need dozens of connections.  Peering is non-transitive—traffic cannot pass through one peer to reach a third VPC.

Transit Gateway (TGW), on the other hand, is the scalable, centralized hub-and-spoke solution. All VPCs attach to the TGW, thus allowing traffic to flow between any of them. Use TGW for enterprise scale involving many VPCs and connections back to your on-premises network.

4. How should NAT Gateways be deployed for High Availability (HA)?

Remember the payphone analogy? NAT Gateway, the payphone is AZ-specific. So If you only install one payphone in one specific geographic area (Availability Zone), two things happen:

1. If that area fails, all your private rooms in all other areas lose their outbound calling ability.

2. All traffic crossing to use that single payphone costs you extra money.

The solution? For robust design, you need a payphone in every area where you have private rooms. In other words, in each Availability Zone where you have private subnets  you must deploy a separate NAT Gateway to ensure high availability and avoid cross-AZ data transfer costs.

These were my big takeaways from today’s lesson:

  1. SGs are stateful and applied at the instance level while NACLs are stateless and checked first at the subnet level, allowing you to use deny rules for blocking traffic.
  2. Use the free Gateway Endpoint for S3 and DynamoDB, and use the Interface Endpoint for all other services and APIs.
  3. Use Transit Gateway for enterprise scenarios involving many VPCs (hub-and-spoke). VPC peering is non-transitive.
  4. For High Availabilty (HA)  in production, deploy one NAT gateway per AZ. Do not assume a NAT Gateway is multi-AZ

See you tomorrow!

Day 4: AWS Solutions Architect Professional Prep – Disaster recovery

Today’s  lesson is all about how to create a digital safety net. A technical failure, a fire, or even a simple human error- any of these could cause a digital system to stop working, with dire consequences for a business. Every minute an e-commerce site is down during peak shopping season means lost revenue and unhappy customers. Losing even a few seconds of data could cost a financial trading platform millions.

What digital insurance policy, or plan can a business design to get back up and running as quickly as possible while losing the minimum amount of data? In my lessons, I examined the following fundamental questions about Disaster Recovery (DR):

1. What are the two measurements that dictate our entire recovery strategy?

2. What is the fastest, most resilient recovery plan, and when is it necessary?

3. If we need to save money, what is the simplest, low-cost recovery plan?

4. How do we find a balance between speed and cost?

1. What are the two measurements that dictate our entire recovery strategy?

There are two golden rules of recovery, or 2 key questions to ask when things go wrong:

How quickly can we get the service back up and running?, and

How much data are we willing to lose?

The two questions relate to the 2 fundamental metrics of DR: Recovery Time Objective, or RTO, and Recovery Point Objective, or RPO

  1. Recovery Time Objective, or RTO: If your car breaks down, the RTO is how fast the tow truck gets there and fixes it. For a major shopping website during a sale, the RTO might be 5 minutes—you fix it almost instantly, or you lose money. By contrast, for a system that only archives old files, 24 hours might be fine.
  1. Recovery Point Objective (RPO): RPO measures the gap between the incident and your last successful backup. If you are typing a paper and the power goes out, the RPO is the last time you saved. If your RPO is 20 minutes, you might lose 20 minutes of work. A financial trading system needs “near-zero” RPO, meaning they can’t lose any transactions ; real-time replication is required. For analytics data, you might be able to tolerate an RPO of 4 hours

2. What is the fastest, most resilient recovery plan, and when is it necessary?

There are four fundamental DR strategies. Just like in choosing a car, you ask yourself: Do you need a Ferrari (fastest, most expensive) or  will a dependable commuter car (slowest, cheapest) suffice?

The fastest plan is the Multi-Region Hot Site Strategy. You might call this the The Identical Twin Factory in which you have two full factories running at the same time. This requires running a complete, full-scale, mirrored copy of your entire application in a second region, potentially in an Active-Active setup where both regions handle traffic simultaneously. Data is copied between the two sites in real-time using services like Aurora Global Database or DynamoDB Global Tables.

This is the gold standard for resiliency and the highest cost. Although the investment is the highest, the returns are near-instantaneous recovery.

RTO/RPO: This high-cost strategy provides an RTO of seconds to minutes (via automated failover) and a Near-zero RPO.

This strategy is best for  mission critical systems, such as financial trading platforms or high-traffic e-commerce during peak seasons.

3. If we need to save money, what is the simplest, low-cost recovery plan?

Backup & Restore Strategy is the simplest and lowest-cost approach. It can be called the Archive Approach, or storing copies in a safe deposit box.

It involves relying entirely on automated backups of data  (using services like AWS Backup or S3 Cross-Region Replication) and storing them cheaply in services like S3 or Glacier Deep Archive. When a disaster occurs, new infrastructure must be built first, and then the data is restored. If the primary region fails, you build new infrastructure from scratch and load the data back in

RTO/RPO: The cost is lowest, but the RTO is the longest, ranging from 8 to 24+ hours. The RPO is also large, typically 1 to 24 hours, depending on how often backups occur.

This strategy is best for non-critical systems, such as monthly financial reports or development environments.

4. How do we find a balance between speed and cost?

To achieve better speed without paying the highest cost, we can choose between two intermediate strategies

Pilot Light: This can be called the minimal engine approach and involves keeping the engine block assembled. This strategy involves keeping the “bare minimum” running in the backup region—the core components like database replicas and critical data replication. It is a Low-Medium cost strategy. During a disaster, you “scale up” the application servers and load balancers around that ready core. This offers a reasonable RTO of 30 minutes to 2 hours and an RPO of 5–15 minutes. This is best for important data that needs reasonable coverage.

Warm Standby: This can be called the reduced kitchen approach and involves having a smaller, but ready-to-use, second location. With this strategy, you maintain a fully functional, scaled-down replica of your production environment (perhaps 25–50% capacity) in the DR region. This is a Medium-High cost strategy where the DR region is always running at a reduced capacity . The application is functional and ready to take traffic immediately, only requiring a scale-up to handle the full load. This achieves a fast RTO of 5–30 minutes and a strong RPO of 1–5 minutes. This is best for critical e-commerce platforms that cannot endure extended downtime.

The key Services that help with DR are:

S3 (Storage): Great for backing up files, especially using Cross-Region Replication (CRR) to automatically copy files to another region.

RDS/Aurora (Databases): You use cross-region read replicas or the super-fast Aurora Global Database to keep a live copy of your data in the backup region.

Route 53 (“The Traffic Cop”): This is what directs users. It uses Health Checks (checking if your application is alive) and automatically reroutes traffic to the healthy region during a failure

To sum it all up, business criticality is the key driver of Disaster Recovery strategy. In one extreme, for mission critical applications requiring RTO less than 5 minutes and RPO less than 1 minute, the recommended strategy is Hot Site. In the other,  For non-critical systems that can tolerate an RTO up to 24 hours and an RPO up to 4 hours, Backup & Restore is the low-cost solution. Simply put, don’t build a Ferrari when a Toyota will do but don’t drive a bicycle on the highway either.

See you tomorrow!