Skip to main content

AWS Interview Questions (2026)

100 real interview questions with in-depth answers — 30 basic, 40 intermediate, 30 advanced. Updated April 2026.

Preparing for a AWS / Cloud Engineer role?

AWS (Amazon Web Services) is a cloud computing platform offering 200+ services from data centers worldwide. The infrastructure is organized into Regions (geographically isolated groups of data centers, e.g., us-east-1), Availability Zones (AZs — one or more discrete data centers within a Region with redundant power/networking), and Edge Locations (Points of Presence used by CloudFront and Route 53 for low-latency content delivery). As of 2025 AWS has 34 launched regions, 108 AZs, and 600+ edge locations. Choosing the right Region matters for latency, compliance, and data residency.

EC2 instance families are grouped by workload: General Purpose (t4g, m7i) for balanced CPU/memory; Compute Optimized (c7g, c7i) for CPU-intensive apps like batch processing; Memory Optimized (r7g, x2idn) for in-memory databases and large datasets; Storage Optimized (i4i, d3) for high sequential read/write and NVMe SSD; Accelerated Computing (p5, g5, inf2) for ML training/inference and GPU workloads. The t-family uses burstable CPU credits — good for dev environments but risky for sustained load. Always benchmark with actual workloads before committing to a family.

You need the private key (.pem file) from the key pair specified at launch, the public IP or DNS of the instance, and port 22 open in the security group. The default username depends on the AMI: `ec2-user` for Amazon Linux/RHEL, `ubuntu` for Ubuntu, `admin` for Debian.

bash
chmod 400 my-key.pem
ssh -i my-key.pem ec2-user@<public-ip-or-dns>

For instances in private subnets use a bastion host or AWS Systems Manager Session Manager (no open port 22 required, preferred for production).

S3 stores data as objects (up to 5 TB each) inside buckets. An object consists of the data, metadata, and a key — the full path-like name (e.g., `photos/2025/cat.jpg`). There is no real folder hierarchy; the slash is just part of the key. Versioning, when enabled on a bucket, keeps every version of every object instead of overwriting. Deleting an object inserts a delete marker rather than permanently removing it; you can restore by removing the marker. Versioning protects against accidental deletion and overwrites but increases storage costs.

S3 Standard is for frequently accessed data with low latency and high throughput. S3 Intelligent-Tiering automatically moves objects between access tiers based on usage patterns — ideal when access is unpredictable. S3 Standard-IA (Infrequent Access) costs less per GB stored but charges a retrieval fee — good for backups accessed monthly. S3 One Zone-IA is cheaper still but stores data in only one AZ (no AZ redundancy). S3 Glacier Instant Retrieval suits archival data needing millisecond access. Glacier Flexible Retrieval (formerly Glacier) offers minutes-to-hours retrieval at very low cost. Glacier Deep Archive is the cheapest tier with 12–48 hour retrieval for long-term compliance archives.

An IAM user is a permanent identity representing a person or service, with long-term credentials (password + access keys). A group is a collection of users that share the same permissions — you attach policies to the group, not individual users. A role is a temporary identity assumed by a trusted entity (EC2 instance, Lambda function, federated user) — it has no long-term credentials and is accessed via short-lived tokens from STS. A policy is a JSON document defining permissions (Allow/Deny on Actions/Resources). Best practice: avoid long-term keys; use roles for everything that runs on AWS.
A Virtual Private Cloud (VPC) is a logically isolated section of the AWS network where you define your own IP address range, subnets, route tables, and gateways. Every AWS account gets a default VPC in each Region so you can launch resources immediately. Custom VPCs let you control network topology, isolate workloads, apply security controls, and connect securely to on-premises networks. Without a VPC, you cannot have private IP addressing, subnet segmentation, or fine-grained network ACLs.
A public subnet has a route in its route table pointing `0.0.0.0/0` to an Internet Gateway, and instances can have public IPs — they are directly reachable from the internet. A private subnet has no route to an Internet Gateway; instances are not directly reachable from the internet. Resources in private subnets (databases, app servers) access the internet for outbound traffic through a NAT Gateway placed in a public subnet. The NAT Gateway translates private IPs to its own Elastic IP for outbound requests but blocks unsolicited inbound connections.
An Internet Gateway (IGW) is attached to a VPC and enables bidirectional internet access for resources with public IPs — it routes both inbound and outbound traffic. A NAT Gateway resides in a public subnet and provides outbound-only internet access for private subnet resources, translating their private IPs to the NAT Gateway's Elastic IP. Inbound connections from the internet to private resources are blocked by the NAT. NAT Gateways are managed, highly available within an AZ, and billed per hour plus per GB processed.
Security Groups (SGs) are stateful firewalls attached to individual EC2 instances (or ENIs). If you allow inbound traffic, the return traffic is automatically allowed. Rules are allow-only (no explicit deny). Network ACLs (NACLs) are stateless firewalls at the subnet level — you must explicitly allow both inbound and outbound traffic for a connection. NACLs support both allow and deny rules and are evaluated in numerical order. Use SGs for instance-level control and NACLs for broad subnet-level blocks (e.g., blocking a specific IP range).
Amazon RDS is a managed relational database service that handles provisioning, patching, backups, Multi-AZ failover, and read replicas automatically. Running a database on EC2 gives you full control (OS tuning, custom DB versions, uncommon engines) but requires you to manage everything yourself — backups, HA, patching, storage expansion. RDS is preferred for standard workloads because it reduces operational burden. Use EC2-hosted databases only when you need specific OS/DB features unavailable in RDS or need licenses you already own.
Lambda is a serverless compute service that runs code in response to events without provisioning or managing servers. You upload a function, configure triggers (API Gateway, S3 events, SQS, etc.), and AWS handles scaling automatically — down to zero when idle. Billing has two dimensions: number of requests ($0.20 per 1M requests) and duration (rounded up to 1 ms) multiplied by allocated memory. A 128 MB function running 100 ms costs almost nothing. The free tier covers 1M requests and 400,000 GB-seconds per month indefinitely.
Amazon SNS (Simple Notification Service) is a pub/sub messaging service that pushes messages to multiple subscribers (Lambda, SQS, HTTP, email) simultaneously — one-to-many fan-out. Amazon SQS (Simple Queue Service) is a message queue where a producer enqueues messages and a consumer polls and processes them — one-to-one or one-to-few. SQS decouples producers from consumers and buffers traffic spikes. A common pattern is SNS fan-out to multiple SQS queues so different services each process their own copy of an event independently.
CloudFront is AWS's Content Delivery Network (CDN). It caches content at 600+ edge locations worldwide so users download from a nearby edge node rather than the origin server (S3, ALB, EC2, etc.), reducing latency and origin load. It supports both static (images, JS, CSS) and dynamic content, HTTPS with custom SSL certificates via ACM, field-level encryption, Lambda@Edge for serverless logic at the edge, and integrates with AWS WAF for DDoS and bot protection. CloudFront also accelerates API calls and can compress content on the fly.
Route 53 is AWS's DNS service. Key record types: A record maps a hostname to an IPv4 address. AAAA maps to IPv6. CNAME maps one hostname to another (cannot be used on the zone apex/root domain). Alias record is AWS-specific — it maps a hostname to an AWS resource (ALB, CloudFront, S3 website) and works at the zone apex with no extra DNS hop. MX records route email. TXT records hold arbitrary text (SPF, domain verification). Route 53 also supports routing policies: Simple, Weighted, Latency, Failover, Geolocation, Geoproximity, and Multivalue.
CloudWatch Metrics are numerical time-series data points emitted by AWS services (CPU utilization, request count, latency) or your custom code. They are stored for up to 15 months and drive alarms and dashboards. CloudWatch Logs capture raw log streams (application logs, VPC flow logs, CloudTrail logs) as text. You can run Logs Insights queries against logs for ad-hoc analysis, create metric filters to extract numeric signals from logs (e.g., count of ERROR lines), and export logs to S3 for archival. Metrics are for monitoring KPIs; Logs are for debugging and auditing.
CloudFormation is AWS's Infrastructure as Code (IaC) service. You define your infrastructure in YAML or JSON templates and CloudFormation provisions and manages the lifecycle of those resources as a stack. It handles dependency ordering, rollback on failure, change sets (preview diffs before applying), and stack drift detection. Templates are reusable and versionable in source control. The alternative AWS-native IaC tool is CDK (Cloud Development Kit), which generates CloudFormation under the hood from TypeScript, Python, or other languages.
Elastic Beanstalk is a Platform-as-a-Service (PaaS) that lets you deploy web applications and services without managing the underlying infrastructure. You upload your code (ZIP, WAR, Docker image) and Beanstalk automatically provisions EC2 instances, a load balancer, Auto Scaling, and monitoring. It supports Node.js, Python, Java, .NET, PHP, Ruby, Go, and Docker. You retain full control of the underlying resources and can customize via `.ebextensions` config files. It's best for teams that want a quick deployment path without DevOps overhead, though ECS/EKS or App Runner are preferred for container-native workloads.
Amazon Elastic Container Service (ECS) is a fully managed container orchestration service for running Docker containers. You define tasks (one or more containers with CPU/memory) and services (desired count, load balancer integration, auto-scaling). ECS runs on two launch types: EC2 (you manage the underlying instances) and Fargate (serverless — AWS manages the compute). ECS integrates natively with IAM, ALB, CloudWatch, ECR, and Secrets Manager. It's simpler than Kubernetes for teams that don't need multi-cloud portability.
Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes control plane service. AWS runs and upgrades the Kubernetes API server and etcd across multiple AZs; you manage the worker nodes (EC2 managed node groups, self-managed nodes, or Fargate). EKS is compatible with standard Kubernetes tooling (kubectl, Helm, Argo CD). Choose EKS over ECS when you need Kubernetes-native features (custom operators, CRDs, multi-cloud portability) or already have Kubernetes expertise. It integrates with IAM via IRSA (IAM Roles for Service Accounts) for pod-level permissions.
EC2 Auto Scaling automatically adjusts the number of EC2 instances in an Auto Scaling Group (ASG) based on demand. You define minimum, maximum, and desired capacity. Scaling policies trigger on CloudWatch alarms (e.g., CPU > 70%) using Target Tracking (maintain a metric at a target value), Step Scaling (scale by X based on alarm breach thresholds), or Scheduled Scaling (predictable load patterns). Auto Scaling replaces unhealthy instances automatically. It works with ALB to distribute traffic across healthy instances and is fundamental to HA architectures.
On-Demand instances are billed by the second/hour with no commitment — maximum flexibility, highest price. Reserved Instances (Standard or Convertible) commit to 1 or 3 years and save up to 72% vs On-Demand; Standard RIs lock to instance type/AZ, Convertible RIs allow flexibility in exchange for a smaller discount. Savings Plans are a flexible alternative covering EC2, Lambda, and Fargate. Spot Instances bid on unused AWS capacity and can save up to 90%, but AWS can reclaim them with a 2-minute warning — suitable for batch jobs, stateless workloads, and fault-tolerant applications.
An AMI is a template containing the software configuration (OS, application server, application) needed to launch an EC2 instance. It includes one or more EBS snapshots (or a template for instance store), launch permissions, and a block device mapping. You can use AWS-provided AMIs (Amazon Linux 2023, Ubuntu, Windows Server), AWS Marketplace AMIs, or create custom AMIs from a running instance after configuring it. Custom AMIs are the fastest way to launch pre-configured instances at scale, and are essential for immutable infrastructure patterns.
An Elastic IP (EIP) is a static, public IPv4 address you allocate in your account and can associate with an EC2 instance or network interface. Unlike a standard public IP (which changes on stop/start), an EIP persists until you release it. This lets you remap a fixed IP to a new instance after failure. EIPs are free when associated with a running instance but cost $0.005/hour when allocated but unassociated — AWS charges idle EIPs to discourage hoarding of scarce IPv4 space. For most architectures, using a DNS name (Route 53 Alias to an ALB) is preferred over Elastic IPs.
Classic Load Balancer (CLB) is the legacy load balancer operating at both Layer 4 and 7 — avoid for new workloads. Application Load Balancer (ALB) operates at Layer 7 (HTTP/HTTPS/WebSocket) and supports path-based routing, host-based routing, header routing, and Lambda targets — ideal for microservices and web applications. Network Load Balancer (NLB) operates at Layer 4 (TCP/UDP/TLS) with ultra-low latency, static IPs, and the ability to handle millions of requests per second — ideal for gaming, IoT, financial trading, and preserving client IP. ALB integrates with Cognito and WAF; NLB does not.
DynamoDB is a fully managed, serverless NoSQL key-value and document database that delivers single-digit millisecond performance at any scale. Data is stored in tables; each item is identified by a primary key (partition key alone, or partition key + sort key). DynamoDB automatically partitions and replicates data across three AZs. It offers two capacity modes: provisioned (predictable workloads) and on-demand (pay-per-request for variable traffic). It is schema-less (except for primary key attributes) and supports strong or eventual read consistency. Ideal for session stores, leaderboards, IoT telemetry, and high-scale web apps.
ElastiCache is a managed in-memory caching service supporting Redis and Memcached engines. It reduces database load by caching frequently read data (session data, query results, leaderboards) in memory, with sub-millisecond latency. Redis supports persistence, replication, Cluster Mode for horizontal scaling, pub/sub, Lua scripting, and complex data structures (sorted sets, streams). Memcached is simpler — pure cache with multi-threading, no persistence. ElastiCache for Redis also supports read replicas and automatic failover via ElastiCache Global Datastore for cross-region replication.
CodeDeploy is a deployment service that automates application deployments to EC2 instances, on-premises servers, Lambda functions, and ECS services. It supports deployment strategies including rolling, blue/green, and canary. CodePipeline is a fully managed CI/CD pipeline service that orchestrates source (CodeCommit, GitHub, S3), build (CodeBuild), test, and deploy stages into an automated release workflow. Together they form the AWS Developer Tools suite alongside CodeCommit and CodeBuild. Most modern teams use them with GitHub Actions or integrate with third-party tools like Jenkins or GitLab CI.
The Shared Responsibility Model divides security duties between AWS and the customer. AWS is responsible for security OF the cloud — physical infrastructure, hardware, hypervisor, managed service software, global network, and AZ/Region redundancy. The customer is responsible for security IN the cloud — OS patching on EC2, network/firewall configuration, IAM policies, data encryption, application code, and customer data. For managed services like RDS or Lambda, AWS takes on more responsibility (OS, runtime patching), shifting more burden to AWS and less to the customer. Understanding this split is essential for compliance and audit conversations.
The Free Tier has three types: Always Free (no expiry), 12-month free (from account creation), and short-term trials. Key always-free limits include: Lambda — 1M requests/month and 400,000 GB-seconds; DynamoDB — 25 GB storage and 25 WCUs/RCUs; CloudWatch — 10 custom metrics; SNS — 1M publishes. Key 12-month limits include: EC2 — 750 hours/month of t2.micro or t3.micro; S3 — 5 GB standard storage; RDS — 750 hours of db.t2.micro/db.t3.micro; CloudFront — 1 TB data transfer out. Exceed these and charges apply — set up billing alerts immediately after account creation.
User data is a script (Bash on Linux, PowerShell on Windows) you provide at instance launch. It runs once automatically on first boot as the root user via cloud-init (Linux) or EC2Launch (Windows). Common uses: installing packages, pulling application code, configuring services. It does NOT run on subsequent reboots unless you explicitly configure cloud-init to run it on every boot. You can pass user data via the console, CLI, or launch template. Logs go to `/var/log/cloud-init-output.log`. ```bash #!/bin/bash yum update -y yum install -y nginx systemctl enable --now nginx ```
Instance store volumes are physically attached NVMe SSDs on the host machine — they provide extremely high IOPS and throughput but are ephemeral: data is lost when the instance stops, terminates, or fails. Instance store is ideal for temporary buffers, caches, or scratch space. EBS (Elastic Block Store) volumes are network-attached persistent storage that survive instance stop/start. They can be detached and reattached to different instances, snapshotted to S3, and encrypted. For production data always use EBS. Instance store's key advantage is raw performance (some i4i instances deliver millions of IOPS at sub-100µs latency).
gp3 is the default general-purpose SSD — 16,000 IOPS and 1,000 MB/s throughput configurable independently of size, suitable for boot volumes and most workloads. gp2 is the older general-purpose SSD where IOPS scales with size (3 IOPS/GB) — migrate to gp3 for cost savings. io2 Block Express is provisioned IOPS SSD for I/O-intensive databases (SQL Server, Oracle) needing >16,000 IOPS or sub-millisecond latency; up to 256,000 IOPS per volume. st1 (Throughput Optimized HDD) is for large sequential workloads like data warehouses and log processing — high throughput at lower cost. sc1 (Cold HDD) is the cheapest option for infrequently accessed data. Never use HDD types for boot volumes.
EBS snapshots are incremental backups stored in S3 (managed by AWS, not visible in your S3 console). The first snapshot copies all data; subsequent snapshots only copy blocks changed since the last snapshot, reducing cost and time. You can create volumes from snapshots in any AZ within the same Region, or copy snapshots cross-region for DR. Amazon Data Lifecycle Manager (DLM) automates snapshot creation, retention, and deletion via policies. Example policy: take a snapshot daily, retain 7, cross-copy to us-west-2 for DR. ```bash aws ec2 create-snapshot --volume-id vol-0abc123 --description "nightly backup" ```
S3 lifecycle policies automate transitioning objects between storage classes and expiring (deleting) objects after a specified time. You define rules scoped by prefix or tag. Example: transition objects to Standard-IA after 30 days, to Glacier Instant Retrieval after 90 days, delete after 365 days. Lifecycle rules also manage incomplete multipart uploads and noncurrent versions when versioning is enabled. This is critical for cost control — without lifecycle rules, objects in Standard storage accumulate indefinitely. ```json {"Rules":[{"Status":"Enabled","Filter":{"Prefix":"logs/"},"Transitions":[{"Days":30,"StorageClass":"STANDARD_IA"}],"Expiration":{"Days":365}}]} ```
S3 replication automatically copies objects from a source bucket to a destination bucket asynchronously. CRR (Cross-Region Replication) replicates to a bucket in a different AWS Region — used for disaster recovery, latency reduction for geographically distributed users, and data sovereignty compliance. SRR (Same-Region Replication) replicates within the same Region — used for log aggregation, production-to-test data sync, or compliance requirements for in-region copies. Both require versioning enabled on source and destination. Replication copies new objects only (existing objects require the S3 Batch Replication feature). Delete markers are not replicated by default.
A pre-signed URL is a temporary URL that grants time-limited access to a specific S3 object without requiring the requester to have AWS credentials. The URL is generated server-side using the credentials of an IAM principal that has access to the object. It embeds an expiry time (up to 7 days for IAM user credentials, 15 minutes for STS tokens by default). Use cases: allowing users to download private files directly from S3 bypassing your server (reduces egress cost), or allowing users to upload directly to S3 without exposing credentials. ```bash aws s3 presign s3://my-bucket/report.pdf --expires-in 3600 ```
IAM policy conditions restrict when a policy statement applies. The Condition block contains condition operators mapped to keys and values. `StringEquals` requires an exact match; `StringLike` allows wildcards (`*` and `?`). `ArnLike` matches ARN patterns with wildcards. `IpAddress` restricts to a CIDR range. `Bool` checks true/false (e.g., `aws:SecureTransport`). `DateLessThan` and `DateGreaterThan` create time windows. Conditions are AND-ed within a single condition block key, and OR-ed across multiple keys. ```json {"Condition":{"StringLike":{"s3:prefix":["home/${aws:username}/*"]},"Bool":{"aws:SecureTransport":"true"}}} ```
A permissions boundary is an IAM managed policy set as the maximum permissions that an IAM entity (user or role) can have. Even if an identity-based policy grants broader access, the effective permissions are the intersection of the identity-based policy and the boundary. This prevents privilege escalation: a developer with IAM admin rights cannot create a role more powerful than their own boundary. Common use case: delegate IAM management to developers within guardrails — they can create roles for their Lambda functions but cannot create roles with S3 full access if their boundary excludes it.
SCPs are a feature of AWS Organizations that set the maximum available permissions for accounts in an organizational unit (OU) or the entire organization. Unlike IAM policies, SCPs do not grant permissions — they restrict what IAM policies can allow. An SCP denying EC2 `TerminateInstances` prevents any IAM policy from allowing it, even AdministratorAccess. SCPs apply to all principals in the account except the management account itself. They are the primary tool for preventative controls in a multi-account strategy, ensuring no account can deviate from organizational governance (e.g., forbid leaving the organization, require encryption, restrict regions).
VPC Peering creates a direct, private network connection between two VPCs (same or different accounts/regions) — non-transitive (A↔B and B↔C does not mean A↔C) and requires CIDR ranges not to overlap. Transit Gateway is a hub-and-spoke network transit hub — attach hundreds of VPCs and on-premises networks; routing is transitive and managed centrally. It scales better than a mesh of peering connections. PrivateLink exposes a specific service endpoint (an NLB-backed service) to consumer VPCs privately without routing all traffic across a VPC — no CIDR overlap issue, no transitive routing needed. Use PrivateLink for SaaS service exposure; Transit Gateway for full network connectivity.
A Site-to-Site VPN connects your on-premises network to a VPC over the public internet using IPsec encryption — quick to set up (minutes), low cost, but variable latency and bandwidth limited by internet congestion. AWS Direct Connect is a dedicated physical network connection from your data center to an AWS Direct Connect location — consistent low latency, high bandwidth (1–100 Gbps), and reduced data transfer costs, but takes weeks to provision and is more expensive. Direct Connect does not encrypt traffic by default; pairing it with a VPN provides both dedicated bandwidth and encryption for compliance requirements.
An ALB listener forwards traffic to target groups based on rules evaluated in priority order. Each rule has conditions (path, host header, HTTP method, query string, source IP) and an action (forward, redirect, fixed-response, authenticate). Path-based routing sends `/api/*` to an ECS service target group and `/static/*` to an S3 origin target group from the same listener port. Target groups contain EC2 instances, IP addresses, Lambda functions, or other ALBs, with health checks to route around unhealthy targets. This enables a single ALB to front an entire microservices application. ```bash aws elbv2 create-rule --listener-arn <arn> --priority 10 \ --conditions Field=path-pattern,Values="/api/*" \ --actions Type=forward,TargetGroupArn=<tg-arn> ```
Choose ALB when you need Layer 7 features: content-based routing (path, host, header, query string), WebSocket support, HTTP/2, gRPC, native authentication via Cognito or OIDC, WAF integration, or Lambda as a target. Choose NLB when you need Layer 4 performance: ultra-low latency (sub-millisecond), static IP addresses (important for firewall whitelisting), TCP/UDP/TLS passthrough, source IP preservation without X-Forwarded-For, or extremely high throughput. NLB also supports PrivateLink endpoint services. NLB cannot inspect HTTP headers or route based on content.
Lifecycle hooks pause an EC2 instance during launch (Pending:Wait) or termination (Terminating:Wait) so you can run custom actions before the instance becomes active or is deleted. Use cases during launch: register the instance with a service mesh, configure monitoring agents, warm up caches. Use cases during termination: drain connections, deregister from service discovery, copy logs to S3. The instance remains paused until you send `complete-lifecycle-action` or the heartbeat timeout expires (default 1 hour, max 48 hours). Hooks publish events to SNS or SQS, or trigger EventBridge rules. ```bash aws autoscaling complete-lifecycle-action --lifecycle-hook-name my-hook \ --auto-scaling-group-name my-asg --lifecycle-action-result CONTINUE \ --instance-id i-0abc123 ```
Launch configurations are the legacy way to specify instance settings (AMI, instance type, key pair, SGs, user data) for an ASG — immutable once created, cannot be modified, and do not support all current EC2 features. Launch templates are the modern replacement — versioned, editable, support multiple instance types and Spot/On-Demand mix (mixed instances policy), T3/T4g unlimited credit mode, placement groups, and all current EC2 features. AWS recommends migrating all ASGs from launch configurations to launch templates. Launch templates can also be used directly with `RunInstances` outside of ASGs.
A cold start occurs when Lambda must initialize a new execution environment: download the code package, start the runtime (JVM, Node.js, Python interpreter), and run initialization code outside the handler. This adds latency ranging from ~100ms (Python/Node) to 500ms–2s+ (Java/C#). Cold starts happen on first invocation, after idle periods, and during scaling bursts. Mitigation strategies: use Provisioned Concurrency (keeps environments warm, eliminates cold starts entirely at a cost), prefer lightweight runtimes (Node.js, Python) over JVM, keep deployment packages small, move initialization (DB connections, SDK clients) outside the handler to the module level so it reuses across invocations, and use Lambda SnapStart for Java.
Lambda Layers are ZIP archives containing libraries, custom runtimes, or configuration that you attach to functions. Layers are mounted at `/opt` in the execution environment. Benefits: share common code (SDKs, utility libraries) across multiple functions without including them in each deployment package; keep function packages small, speeding up deployment and cold starts; version and manage dependencies independently from function code. A function can have up to 5 layers. AWS and third parties publish public layers (e.g., AWS SDK for Pandas a.k.a. AWSDataWrangler). Layers are not useful for large ML models — use container images (up to 10 GB) instead.
Lambda event sources differ in polling vs push: SQS — Lambda polls the queue using long polling (event source mapping), processes batches of messages, deletes on success, returns failed messages to the queue via partial batch response. SNS — SNS pushes events directly to Lambda (asynchronous invocation); no polling. DynamoDB Streams — Lambda polls the shard, processes records in order per shard, controlled by batch size and bisect-on-error. API Gateway — synchronous invocation (request/response); Lambda must return within 29 seconds. Kinesis — Lambda polls shards, processes records in order. S3 — S3 pushes asynchronous events on object operations. Each model has different retry semantics and error handling patterns.
HTTP API is the newer, cheaper (~70% cheaper), lower-latency API Gateway offering. It supports JWT authorizers (Cognito, Auth0), Lambda proxy integrations, CORS, and basic routing. REST API supports more advanced features: API keys and usage plans, request/response transformation (Velocity templates), request validation, WAF integration, private endpoints (VPC endpoint), AWS service proxies (direct DynamoDB/SQS integration), caching, and custom domain names with more control. Choose HTTP API for straightforward Lambda or HTTP backend proxying. Choose REST API when you need request transformation, usage plans, or AWS service proxy integrations.
The partition key determines how data is distributed across DynamoDB's internal partitions. A bad partition key causes "hot partition" problems where one partition receives disproportionate traffic, throttling requests. A good partition key has high cardinality (many unique values) and uniform access patterns. User ID, order ID, device ID, and timestamp-based IDs (ULID/UUIDv4) are typically good. Avoid: small enum values (status: active/inactive), dates alone (all today's writes go to one partition), or sequential IDs that create monotonic hot partitions. For write-heavy scenarios, add a random suffix and aggregate results (write sharding). Access patterns must be known upfront since adding indexes later is the only way to query differently.
A Local Secondary Index (LSI) shares the same partition key as the base table but uses a different sort key, enabling range queries on a different attribute within the same partition. LSIs must be defined at table creation, share the table's read/write capacity, and are limited to 10 GB per partition key value. A Global Secondary Index (GSI) has a completely different partition key and optional sort key — it creates a new distributed table projection and has its own read/write capacity settings. GSIs can be added or deleted at any time. GSIs are far more flexible and are the standard solution for supporting additional access patterns. Use LSIs only when you need strong consistency on the indexed query.
DynamoDB Streams captures a time-ordered sequence of item-level modifications (INSERT, MODIFY, REMOVE) in a DynamoDB table, retaining records for 24 hours. Stream records contain the old image, new image, or both depending on the stream view type. Primary use cases: trigger Lambda functions for event-driven processing (sending notifications, updating a search index in OpenSearch, invalidating a cache), implement cross-region replication (Global Tables uses streams internally), build audit logs, or implement CQRS by projecting changes to a read model. Lambda event source mappings poll the stream shards and process records in order per shard.
Provisioned capacity requires you to specify Read Capacity Units (RCUs) and Write Capacity Units (WCUs) in advance. One RCU = 1 strongly consistent read or 2 eventually consistent reads per second for items ≤4 KB. One WCU = 1 write per second for items ≤1 KB. Auto Scaling adjusts capacity based on utilization targets. On-Demand mode charges per request ($1.25/M WRUs, $0.25/M RRUs) with no capacity planning — ideal for unpredictable or spiky workloads and new tables where access patterns are unknown. On-Demand is typically 5-7× more expensive at sustained load vs right-sized provisioned. Switch from on-demand to provisioned once you have 30+ days of usage data and predictable patterns.
DAX (DynamoDB Accelerator) is a fully managed, in-memory cache for DynamoDB that delivers microsecond read latency vs DynamoDB's single-digit milliseconds. It is a write-through cache — writes go to both DAX and DynamoDB. DAX is API-compatible with DynamoDB (minimal code change, just swap the client). Use DAX when read latency is the bottleneck (gaming leaderboards, real-time bidding, ad serving) or when you need to absorb read traffic spikes without over-provisioning RCUs. Do NOT use DAX for: strongly consistent reads (DAX only serves eventually consistent), write-heavy tables where cache hit rate will be low, or tables with complex relational query patterns better served by ElastiCache.
Multi-AZ is a high-availability feature that maintains a synchronous standby replica in a different AZ. If the primary instance fails, RDS automatically fails over to the standby with ~60–120 seconds downtime — no data loss (synchronous replication). The standby cannot serve reads; it exists only for failover. Read Replicas use asynchronous replication to one or more replica instances that serve read traffic, reducing load on the primary. They can be in the same Region, a different Region (cross-region RR for DR and latency), or promoted to standalone instances. You can have up to 15 Aurora replicas or 5 for standard RDS. Combine both: Multi-AZ for HA, Read Replicas for scale-out.
Aurora is AWS's cloud-native relational database engine compatible with MySQL and PostgreSQL. Key architectural differences: Aurora uses a shared distributed storage layer (6-way replication across 3 AZs, self-healing) separate from the compute layer — storage auto-grows in 10 GB increments up to 128 TB. Standard RDS attaches EBS volumes to individual instances. Aurora offers up to 15 read replicas sharing the same storage (no replication lag from storage) vs 5 RDS replicas with async replication. Aurora failover is faster (~30s vs ~60-120s for Multi-AZ). Aurora is typically 3-5× faster than standard MySQL/PostgreSQL and costs ~20% more than RDS but often cheaper at scale due to storage efficiency.
Aurora Serverless v2 scales Aurora compute capacity instantly and automatically in fine-grained increments (0.5 ACU steps) between a minimum and maximum you set, responding to workload changes within seconds. Unlike v1 which had noticeable scaling delays and cold starts from zero, v2 can scale from near-zero (0.5 ACU ≈ ~1 GB RAM) to 128 ACUs without pausing queries. Ideal use cases: development/test environments that need to scale to zero when idle, multi-tenant SaaS with variable per-tenant loads, applications with unpredictable spikes. Not ideal for absolutely steady high-load workloads where provisioned capacity is more cost-efficient.
Redis supports persistence (RDB snapshots, AOF), replication (primary + replicas), automatic failover (ElastiCache for Redis Multi-AZ), Cluster Mode (horizontal sharding), pub/sub, Lua scripting, transactions, and complex data types (sorted sets, lists, hashes, streams, HyperLogLog). Memcached is simpler: pure in-memory cache, multi-threaded for better CPU utilization on multi-core servers, no persistence, no replication, horizontal scaling by adding nodes with consistent hashing. Choose Redis for: session storage, leaderboards, rate limiting, pub/sub, or when you need durability. Choose Memcached for: simple object caching where you want maximum CPU throughput and can tolerate node failure data loss.
When a consumer receives an SQS message, it becomes invisible to other consumers for the visibility timeout period (default 30s, max 12h). If the consumer processes and deletes the message before timeout, it's gone. If processing fails and the consumer doesn't delete it, the message becomes visible again for retry. The `ReceiveCount` attribute increments each time. A dead-letter queue (DLQ) receives messages that exceed the `maxReceiveCount` threshold — isolating poison-pill messages for inspection. Configure DLQ alarm to alert when messages arrive. For Lambda-SQS integrations, configure a bisect-on-error batch split to isolate failing messages efficiently.
The fan-out pattern uses a single SNS topic to push one message to multiple SQS queues simultaneously, allowing different services to independently process the same event. Example: an e-commerce order event published to SNS triggers a fulfillment SQS queue, an email notification SQS queue, and an analytics SQS queue — each consumed by different services at their own pace without coupling. Benefits: each consumer processes independently, can retry without affecting others, can be scaled separately. SNS also supports message filtering so each SQS subscription only receives relevant events based on message attributes, reducing unnecessary processing.
Kinesis Data Streams is a real-time streaming platform you manage — you set shard count (1 MB/s in, 2 MB/s out per shard), choose consumer type (standard polling or enhanced fan-out), and retain data up to 365 days. It's used for custom real-time processing. Kinesis Data Firehose (now Amazon Data Firehose) is a fully managed delivery service that buffers and loads streaming data into destinations (S3, Redshift, OpenSearch, Splunk) without managing shards — easiest path for streaming ETL. Kinesis Data Analytics (now Amazon Managed Service for Apache Flink) runs SQL or Apache Flink on streaming data for real-time analytics. Typical pipeline: Producers → Data Streams → Lambda/Flink → Firehose → S3.
EventBridge is a serverless event bus with content-based routing via event patterns (JSON rules matching event source, detail-type, and payload fields). It natively receives events from 100+ AWS services and SaaS partners (Datadog, Zendesk, PagerDuty) without custom code. EventBridge has built-in schema registry, event archiving/replay, and pipes (point-to-point integration with filtering and enrichment). SNS is simpler pub/sub with filtering on message attributes — excellent for high-throughput push notifications. SQS is a durable queue for load leveling and decoupling. EventBridge is ideal for orchestrating complex event-driven architectures between AWS services and SaaS; SNS/SQS are better for internal high-throughput messaging.
A CloudWatch Alarm watches a single metric over a time period and changes state (OK, ALARM, INSUFFICIENT_DATA) when the metric crosses a threshold for a specified number of evaluation periods. Actions on ALARM state: notify SNS, trigger Auto Scaling, stop/terminate/recover an EC2 instance, invoke a Systems Manager OpsItem. Composite alarms combine multiple alarms using Boolean logic (AND/OR/NOT) into a single alarm — used to reduce alert noise. Example: page on-call only if BOTH CPU > 90% AND error_rate > 5% simultaneously. This prevents false pages when one metric spikes transiently. ```bash aws cloudwatch put-composite-alarm --alarm-name high-cpu-and-errors \ --alarm-rule "ALARM(cpu-alarm) AND ALARM(error-alarm)" ```
CloudTrail records API calls made to AWS services — who called what action on which resource and when, from what IP. It is the primary audit and governance tool. Logs go to S3 and optionally CloudWatch Logs. CloudWatch is an observability service for monitoring operational metrics (CPU, request count, latency), collecting logs from applications and AWS services, creating alarms, and building dashboards. Think of it this way: CloudTrail answers "who did what to my AWS resources?" (security/audit); CloudWatch answers "how is my system performing right now?" (operations/alerting). CloudTrail is enabled by default for 90-day event history; creating a Trail sends events to S3 for longer retention.
X-Ray provides end-to-end distributed tracing for applications. The X-Ray SDK instruments your code to generate trace segments containing timing data, annotations (key-value pairs for filtering), and metadata. Downstream calls (DynamoDB, S3, SQS, downstream services) are captured as subsegments. Traces are assembled into a service map showing latency and error rates between components. Lambda, API Gateway, and App Runner have native X-Ray integration (enable active tracing in config). X-Ray sampling rules prevent overwhelming trace volume in production — by default it samples 5% of requests plus 1/s guaranteed. Use X-Ray to find bottlenecks, trace the root cause of errors, and understand service dependencies.
Secrets Manager is purpose-built for secrets (database credentials, API keys, OAuth tokens) with built-in automatic rotation — it can rotate RDS, Redshift, DocumentDB, and custom secrets on a schedule by calling a Lambda function. It charges $0.40/secret/month. Parameter Store (SSM Parameter Store) stores configuration data and secrets; Standard tier is free, Advanced tier supports up to 8 KB values and parameter policies. SecureString parameters are encrypted with KMS. Parameter Store lacks native rotation but is cheaper for non-rotating config. Use Secrets Manager for credentials that rotate; Parameter Store for environment-specific config, feature flags, and any secret that doesn't need rotation.
AWS WAF (Web Application Firewall) protects ALBs, CloudFront, API Gateway, and AppSync by inspecting HTTP requests against rule sets called Web ACLs. Rules evaluate request components (URI, headers, query string, body, IP, country) and take actions: Allow, Block, or Count. Rule types include: AWS Managed Rules (pre-built rules for OWASP Top 10, bot control, known bad inputs), rate-based rules (block IPs making more than X requests in 5 minutes — rate limiting and DDoS mitigation), and custom rules (regex matching, geo-block). WAF logs can be sent to S3, CloudWatch Logs, or Kinesis Firehose for security analysis. Associate a Web ACL with CloudFront for globally distributed protection.
Envelope encryption is a two-layer encryption approach. A Customer Master Key (CMK/KMS key) in AWS KMS never leaves KMS. When you need to encrypt data: call KMS `GenerateDataKey` to receive a plaintext data key and an encrypted data key (ciphertext). Encrypt your data locally with the plaintext key (using AES-256-GCM), then discard the plaintext key. Store the encrypted data key alongside the ciphertext. To decrypt: call KMS `Decrypt` with the encrypted data key to get the plaintext key back, decrypt your data, then discard the plaintext key again. This avoids sending large amounts of data to KMS (API limit) while keeping KMS as the root of trust. Used by S3 SSE-KMS, EBS encryption, RDS encryption, and most AWS encryption services.
CloudWatch Logs Insights is an interactive query service for log groups stored in CloudWatch Logs. It uses a purpose-built query language (not SQL) with commands like `filter`, `stats`, `sort`, `limit`, and `parse`. Queries scan log data in place — no ETL or schema definition required — making it ideal for ad-hoc troubleshooting (finding all 5xx errors in the last hour, top error messages by Lambda function, P99 latency across API Gateway). Results appear in seconds to minutes depending on log volume. Use Insights when your logs are already in CloudWatch and you need fast operational queries with no setup. Use Athena when logs are exported to S3, you need SQL joins across multiple data sources, or you want to run complex analytics at lower cost on very large historical datasets. CloudWatch Logs Insights bills per GB of data scanned; exporting to S3 + Athena is cheaper for bulk historical analysis.
A multi-account strategy isolates workloads, teams, and environments into separate AWS accounts to contain blast radius, enforce billing boundaries, and meet compliance requirements. The recommended structure uses AWS Organizations with OUs (Organizational Units): a Root OU containing a Management account (billing only), a Security OU (Log Archive + Audit accounts), and workload OUs (Production, SDLC, Sandbox). AWS Control Tower automates landing zone setup — it provisions the account vending machine (Account Factory), applies guardrails (SCPs + AWS Config rules) across OUs, and aggregates logs in the Log Archive account. Landing zones define the account baseline: VPC structure, CloudTrail, Config, SSO/IAM Identity Center, and security tooling. Without Control Tower you must implement all this manually.
Service Control Policies should be written as deny-list policies (start with FullAWSAccess at root, add targeted deny SCPs at lower OUs/accounts) rather than allow-list policies, which break access. Common guardrails: deny leaving the organization (`organizations:LeaveOrganization`), deny disabling CloudTrail or Config, deny creating IAM users with access keys in production, require encryption for all S3 puts (`s3:x-amz-server-side-encryption` condition), restrict regions to approved ones only, prevent creation of public S3 buckets. SCPs should be applied to OUs, not individual accounts where possible — account-level SCPs are hard to audit at scale. ```json {"Effect":"Deny","Action":["cloudtrail:StopLogging","cloudtrail:DeleteTrail"],"Resource":"*"} ```
AWS Config continuously records resource configuration changes and evaluates them against Config rules (managed or custom Lambda-backed). Each rule produces a compliance result (COMPLIANT, NON_COMPLIANT) per resource. Remediation actions (SSM documents) can auto-remediate violations. Conformance packs are collections of Config rules and remediation actions packaged as a single entity deployable across an organization via CloudFormation StackSets. AWS provides pre-built conformance packs for CIS Benchmarks, PCI DSS, HIPAA, NIST 800-53. Aggregators collect compliance data across all accounts/regions into a single account for centralized reporting. Config is the detective control layer that complements SCPs (preventative) and GuardDuty (threat detection).
Security Hub aggregates findings from multiple AWS security services (GuardDuty, Inspector, Macie, IAM Access Analyzer, Firewall Manager, Systems Manager Patch Manager) and partner tools into a single pane using the AWS Security Finding Format (ASFF). It runs automated security best practice checks against CIS Benchmarks, AWS Foundational Security Best Practices, and PCI DSS controls. Findings are scored by severity (CRITICAL, HIGH, MEDIUM, LOW, INFORMATIONAL). Cross-region and cross-account aggregation requires designating an aggregation region and enabling in member accounts via Organizations integration. EventBridge can trigger automated response workflows (Lambda, Step Functions) on specific finding types — e.g., auto-isolate an instance with a public security group.
GuardDuty is a threat detection service that uses ML, anomaly detection, and threat intelligence to identify malicious activity. It analyzes: VPC Flow Logs (unusual network traffic, port scanning, C2 communication), CloudTrail logs (API anomalies, credential exfiltration, privilege escalation attempts), DNS logs (queries to known malicious domains), and optionally EKS audit logs, EBS volumes, RDS login events, Lambda network activity, and S3 data events. Findings are categorized: Backdoor, CryptoCurrency, Persistence, Stealth, Trojan, UnauthorizedAccess. It does not require enabling Flow Logs or CloudTrail in your account — GuardDuty maintains its own copy. Integrate with Security Hub for centralized management and EventBridge for automated response.
Macie is a data security service that uses ML to discover, classify, and protect sensitive data in S3. It scans S3 object contents (not metadata) for personally identifiable information (PII: names, addresses, SSNs, passport numbers, credit cards), credentials (AWS keys, private keys), financial data, and PHI under HIPAA. Macie generates findings categorized as Policy findings (bucket made public, encryption disabled, replication disabled) and Sensitive data findings (specific data types and locations). You can define custom data identifiers using regex and keyword patterns for proprietary sensitive data. Macie is useful for discovering shadow data stores, validating DLP policies, and meeting GDPR/HIPAA data discovery requirements.
VPC Flow Logs capture metadata about accepted and rejected IP traffic at ENI, subnet, or VPC level. Fields include: version, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action (ACCEPT/REJECT), log-status. To troubleshoot: filter for `REJECT` action on the destination IP and port to confirm a security group or NACL is blocking. Query in CloudWatch Logs Insights or Athena (partition logs in S3 by account/region/date for cost-efficient querying). Common patterns: look for `REJECT` with the expected source port to find asymmetric SG rules; look for short-duration high-volume flows from unexpected IPs as lateral movement indicators. VPC Reachability Analyzer is the proactive complement — it validates network path configuration without sending actual traffic. ```sql SELECT srcaddr, dstaddr, dstport, action, COUNT(*) as c FROM vpc_flow_logs WHERE action = 'REJECT' GROUP BY 1,2,3,4 ORDER BY c DESC ```
Transit Gateway (TGW) supports multiple route tables, enabling complex routing policies across attached VPCs and VPNs. Each attachment (VPC, VPN, Direct Connect Gateway) can be associated with one route table (determines what destinations it can reach) and propagated into one or more route tables (makes it reachable). Security pattern — "inspection VPC" architecture: place a firewall appliance VPC as the hub. Production VPCs have routes defaulting to the inspection VPC attachment. The inspection VPC routes between VPCs after traffic inspection, and routes internet-bound traffic through its own egress. TGW route tables replace the need for transitive VPC peering or complex routing via transit VPCs using NAT. Blackhole routes explicitly drop traffic matching a CIDR, useful for blocking ex-peered networks.
PrivateLink creates a one-way, private connection from a consumer VPC to a service endpoint in a provider VPC without exposing the provider VPC to the consumer or traversing the internet. The provider creates an endpoint service backed by an NLB. Consumers create interface VPC endpoints (elastic network interfaces with private IPs in their VPC subnets) that connect to the endpoint service. Traffic flows over AWS's private network — never public internet. Benefits over VPC peering: no overlapping CIDR issues, no transitive routing, provider VPC stays completely private, and consumers cannot initiate connections back to the provider VPC. PrivateLink is used by 100+ AWS services (S3 interface endpoint, SSM, Secrets Manager) and by SaaS providers for private connectivity.
AWS Network Firewall is a managed, stateful network firewall and IPS/IDS service deployed in a VPC subnet. It provides capabilities beyond SGs and NACLs: deep packet inspection, domain-name filtering (allow/block by FQDN, not just IP), protocol-specific rules (HTTP, TLS, DNS), Suricata-compatible IPS rules for signature-based threat detection, and stateful traffic inspection across flows. Architecture: deploy Network Firewall in a dedicated firewall subnet with a route table directing traffic through it before reaching the internet or spoke VPCs. Use it when you need: outbound domain filtering (block crypto miners), IDS/IPS signatures, protocol filtering, or compliance requirements (PCI, HIPAA). SGs and NACLs handle IP/port-based rules adequately for most workloads.
RPO (Recovery Point Objective) is the maximum acceptable data loss (time). RTO (Recovery Time Objective) is the maximum acceptable downtime. AWS DR strategies by cost and recovery speed: Backup & Restore — cheapest, RPO hours, RTO hours; replicate data to S3/Glacier and restore manually on failure. Pilot Light — replicate data continuously, keep a minimal skeleton (DNS pointing nowhere, DB replica running), scale up compute on failure; RPO minutes, RTO 10–60 minutes. Warm Standby — run a scaled-down version of the full production stack in the DR region at all times; failover is fast scale-out; RPO seconds, RTO minutes. Multi-Site Active/Active — full-capacity stack in two regions serving traffic simultaneously via Route 53 health checks; near-zero RPO/RTO but 2× cost. Route 53 Application Recovery Controller (ARC) provides granular failover control for active/active.
S3 Object Lock implements WORM (Write Once Read Many) storage. Objects are locked in one of two modes: Governance mode — users with `s3:BypassGovernanceRetention` permission can override the lock; useful for testing policies. Compliance mode — no user including root can delete or overwrite the object until the retention period expires; meets SEC 17a-4(f), FINRA, CFTC, and other financial compliance requirements. Retention can be set with a retain-until date or a legal hold (no expiry). Object Lock requires versioning enabled. Deleting a locked object creates a delete marker, but the locked version remains. Combined with S3 Replication to a separate account-owned bucket provides additional protection against accidental or malicious deletion.
Glacier Vault Lock applies a lockable access policy (enforcing WORM) to a Glacier vault with a two-step process: initiate the lock (creating a 24-hour window to test and validate), then complete the lock (policy becomes immutable — even AWS cannot override it). Once locked, the vault policy cannot be changed or deleted for the lifetime of the vault. This provides the most stringent compliance guarantee for long-term archives. S3 Object Lock applies at the individual object level within S3, supports both Governance and Compliance modes, and integrates with S3 lifecycle policies. For modern workloads prefer S3 Object Lock in Compliance mode with Glacier Instant Retrieval or Glacier Flexible Retrieval storage class, as it provides object-level granularity.
Aurora Global Database spans multiple AWS Regions with a primary cluster in one Region and up to 5 read-only secondary clusters. Replication uses dedicated infrastructure in the storage layer (not binlog-based), achieving sub-second RPO (~1s lag across regions). Secondary regions serve local reads with low latency. Planned failover (used for maintenance or region migration) takes under a minute and promotes a secondary with zero data loss. Unplanned failover (using managed planned failover or detach-and-promote): detach the secondary from the global database and promote it, taking 60-120 seconds but limited to the 1-second replication lag as RPO. Use Aurora Global Database for globally distributed applications, cross-region read scaling, and achieving very low RPO for relational workloads across regions.
DynamoDB Global Tables v2 (2019+) replicates data across multiple AWS Regions using multi-master active-active replication. All replicas accept reads and writes. Writes are asynchronously propagated to all replicas — typically within 1 second under normal conditions. This means reads from a non-local replica can return stale data (eventual consistency across regions). Last-writer-wins conflict resolution is used based on the write timestamp — no application-level conflict resolution is exposed. For strong consistency, reads must target the specific region where the write occurred. Design for eventual consistency: avoid patterns where a write followed immediately by a read-from-different-region must see the write. Version attributes and optimistic locking reduce conflict probability.
Lambda performance is not proportional to memory alone — CPU power, network bandwidth, and memory are all allocated proportionally. The AWS Lambda Power Tuning open-source tool (Step Functions state machine) runs your function at different memory settings (128 MB to 10 GB) and measures cost and duration, plotting a cost-optimal and speed-optimal configuration. Often, increasing memory paradoxically reduces cost because faster execution more than compensates for the higher per-ms price. ARM64 (Graviton2) Lambda functions deliver 20% better price/performance than x86 — same price per GB-second but approximately 20% faster execution in compute-bound workloads. Migration from x86 to ARM is a simple configuration change; code must be compiled for ARM64 if using native extensions.
Lambda SnapStart (supported for Java 11, 17, 21 on Corretto) eliminates cold start initialization time by taking a snapshot of the initialized execution environment after the JVM starts and your init code runs. When a new execution environment is needed, Lambda restores from the snapshot instead of starting the JVM fresh — reducing Java cold starts from 1-8 seconds to under 500 ms (and often under 200 ms). Enable it in the function configuration; Lambda automatically creates the snapshot on publish. Use `CacheInterface` hooks (`beforeCheckpoint` / `afterRestore`) to handle state that must not be cached (randomness, network connections, temporary credentials). Provisioned Concurrency works with SnapStart for sub-100ms P99 latency.
Step Functions Standard workflows support executions up to 1 year, are durable (execution state survives failures), support all service integrations (including .waitForTaskToken), and provide full audit history queryable via the console and APIs. Billing is per state transition. Choose Standard for: human-approval workflows, long-running orchestration, business-critical processes requiring audit trails. Express workflows have a maximum duration of 5 minutes, are designed for high-throughput (100K executions/second), billed per execution duration and memory, and use CloudWatch Logs for history (not persisted in Step Functions). Synchronous Express workflows return results to the caller inline. Choose Express for: event processing pipelines, IoT data ingestion, real-time data transformation, microservice orchestration where each step is fast.
In `awsvpc` network mode, each ECS task gets its own elastic network interface (ENI) with a dedicated private IP from the VPC subnet — the same networking model as EC2 instances. Security groups are attached directly to task ENIs (not to the underlying host). This enables fine-grained security group rules per task/service, VPC Flow Log visibility per task, and no port-mapping conflicts on shared hosts. Implications: each task consumes an ENI, and each subnet has a finite ENI limit based on instance type (for EC2 launch type). For Fargate, each task also consumes one Fargate ENI slot. In high task-count environments, use multiple subnets across multiple AZs. For private tasks that need internet access, place them in private subnets with a NAT Gateway; the task's ENI private IP is translated at the NAT.
EKS managed node groups provision EC2 instances into Auto Scaling Groups with a fixed instance type per group. Scaling is reactive: cluster-autoscaler detects unschedulable pods, scales the ASG, and waits for the new node to join (~2-4 minutes). Karpenter is an open-source Kubernetes node provisioner that replaces cluster-autoscaler. It watches for unschedulable pods and directly calls EC2 APIs to provision the best-fit instance (considering requirements like CPU, memory, GPU, Spot/On-Demand, architecture) in ~30-60 seconds. Karpenter consolidates underutilized nodes (binpacking) by rescheduling workloads and terminating empty nodes, reducing waste. It supports multi-instance type provisioning and deep AWS integration (EFA, Graviton, Spot interruption handling). Use Karpenter for dynamic, cost-optimized EKS clusters.
ECR (Elastic Container Registry) lifecycle policies automate the expiry and deletion of container image versions from a repository. Without them, repositories accumulate thousands of image tags consuming storage and increasing the attack surface (old images may have unpatched vulnerabilities). Lifecycle rules match images by tag prefix pattern (`untagged`, `semver-*`, `dev-*`), count-based rules (keep the last N images by push date), or age-based rules (delete images older than X days). Rules are evaluated in priority order. Best practice: keep the last 5 production images (rollback ability), delete all untagged images after 1 day, and delete feature-branch images after 30 days. ```json {"rules":[{"rulePriority":1,"selection":{"tagStatus":"untagged","countType":"sinceImagePushed","countUnit":"days","countNumber":1},"action":{"type":"expire"}}]} ```
When AWS needs Spot capacity back, it sends a 2-minute interruption notice via EC2 instance metadata (poll `http://169.254.169.254/latest/meta-data/spot/termination-time` for non-empty response) and an EventBridge event (`EC2 Spot Instance Interruption Warning`). Handling patterns: for ASG Spot — configure mixed instances policy with `lowest-price` or `capacity-optimized` allocation strategy and at least 3 instance pools; ASG replaces interrupted instances. For individual Spot fleets — use Spot Fleet with `maintainCapacity` and a rebalance recommendation signal (2+ minutes earlier). For containers — configure ECS Spot interruption handling (DRAINING lifecycle hook) or Karpenter on EKS (automatic pod rescheduling). Application-level: checkpoint work to S3, design for stateless processing, use SQS to requeue in-flight work on shutdown signal.
SQS FIFO queues guarantee exactly-once processing and strict message ordering within a message group (identified by `MessageGroupId`). Messages with the same group ID are delivered in order; different group IDs can be processed in parallel by different consumers (up to the concurrency limit of 300 TPS/3000 TPS with high throughput mode). Deduplication uses a 5-minute deduplication interval: if two messages with the same `MessageDeduplicationId` arrive within 5 minutes, the second is discarded. Deduplication ID can be content-based (SHA-256 of body) or explicitly provided. Limitation: FIFO queues do not support per-message delays (only queue-level delay). A single consumer can only receive messages from one group at a time — use multiple consumers with distinct group IDs for parallel processing at scale.
EventBridge Pipes provide point-to-point integration between a source and a target with optional filtering and enrichment in between, without writing glue code. Source → [Filter] → [Enrichment (Lambda/Step Functions/API GW)] → Target. Supported sources include SQS, Kinesis, DynamoDB Streams, MSK, MQ. Supported targets include Lambda, Step Functions, SQS, SNS, EventBridge, API Gateway, and more. Before Pipes, connecting DynamoDB Streams to EventBridge required a Lambda polling function. Pipes eliminates that Lambda, reducing operational overhead and cost. Filtering reduces the data passed to enrichment and target, saving compute. Use Pipes for: event-driven pipelines from streaming sources, CDC (change data capture) workflows, and any fan-in/fan-out pattern needing content-based filtering.
Cost Explorer provides cost and usage visualizations, forecasts (12-month ML-based), and rightsizing recommendations (EC2 instances running at <40% CPU for 14 days). Key analysis steps: group by service, then by usage type, then by tag to find cost drivers. RI/Savings Plans coverage reports show what percentage of eligible usage is covered. Savings Plans have two types: Compute Savings Plans (66% off On-Demand, apply to EC2 any region/size/OS, Lambda, and Fargate — most flexible) and EC2 Instance Savings Plans (72% off, locked to specific instance family in a region — least flexible). Commit to 1-year no-upfront as the starting point. Use Cost Anomaly Detection for ML-based spend alerts. Tag governance is prerequisite: enforce tags via SCPs and Config rules to enable meaningful cost allocation.
AWS CDK uses familiar programming languages (TypeScript, Python, Java, C#, Go) with full type safety, IDE completion, loops, conditionals, and reusable constructs — it generates CloudFormation under the hood. Strengths: AWS-native integrations, L2/L3 constructs with opinionated defaults, CDK Pipelines for self-mutating CI/CD. Weaknesses: CloudFormation stack size limits (500 resources), AWS-only (no Azure/GCP), and slower feedback loop (synthesize → deploy). Terraform uses HCL, is multi-cloud, has a massive ecosystem of providers and modules, stateful plan/apply workflow, and handles non-AWS resources (Cloudflare, Datadog, GitHub). Strengths: declarative state management, large community, provider breadth. Weaknesses: no native language constructs without frameworks (Pulumi/CDKTF). Use CDK for AWS-only teams wanting developer-friendly IaC; Terraform for multi-cloud or when existing org standards mandate it.
CDK constructs have three levels of abstraction. L1 (Cfn constructs) are auto-generated 1:1 mappings of CloudFormation resource types (e.g., `CfnBucket`, `CfnInstance`) — full control, maximum verbosity, no defaults. Use when you need a resource property not yet exposed by an L2. L2 (curated constructs, the default "happy path") wrap L1 resources with sensible defaults, higher-level APIs, and helper methods (e.g., `Bucket`, `Function`, `Table`) — they hide CloudFormation boilerplate and add IAM grant methods (`bucket.grantRead(fn)`). L3 (patterns) compose multiple L2 constructs into a common architecture pattern (e.g., `ApplicationLoadBalancedFargateService` from `aws-ecs-patterns`). Start with L3 for common patterns, L2 for most resources, and drop to L1 only when needed.
Every AWS service has default quotas (limits) per account and Region — for example: 5 VPCs per Region, 1000 Lambda concurrent executions per Region, 100 S3 buckets per account. Quotas prevent runaway spending and protect shared infrastructure. View quotas in the Service Quotas console or via CLI (`aws service-quotas list-service-quotas`). To increase: most quotas have a "Request quota increase" button — file a support ticket with business justification and required value. Critical quotas to increase early: Lambda concurrent executions (default 1000, request 10K+ before launch), EC2 vCPU limits by instance family, API Gateway requests/second, and DynamoDB table count. Set CloudWatch alarms on quota utilization (available via Service Quotas console) to proactively catch approaching limits.
A production-grade serverless data pipeline: (1) Producers send events to Kinesis Data Streams (sized by shard = 1 MB/s in / 2 MB/s out). (2) An enhanced fan-out consumer Lambda processes records in real time — validates schema, enriches (reverse geocode, user lookup from DynamoDB), and batches records. (3) Lambda writes Parquet files to S3 using a date-partitioned prefix (`year=/month=/day=/hour=`) for Athena query efficiency. (4) Alternatively, use Kinesis Data Firehose with a Lambda transformation function for the S3 delivery — Firehose buffers, compresses (GZIP/Snappy), and partitions automatically. (5) AWS Glue Crawler (or manual DDL) creates/updates the Athena table schema. (6) Athena queries the Parquet data with SQL, scanning only relevant partitions. (7) CloudWatch alarms monitor IteratorAge (Kinesis lag) and Lambda error rates. ```bash # Partition projection reduces Glue Crawler dependency CREATE TABLE events PARTITIONED BY (dt string) LOCATION 's3://bucket/events/' TBLPROPERTIES ('projection.enabled'='true') ```
CloudFront signed URLs grant time-limited access to a single object — embed the expiry, allowed IP range, and the URL path in the signature generated with a CloudFront key pair (RSA-2048). Signed cookies grant time-limited access to multiple files or entire path patterns without changing each URL — the signed cookie contains policy, signature, and key pair ID and is sent in HTTP headers. Use signed URLs for: distributing individual files (SaaS file downloads, one-time links). Use signed cookies for: protecting entire sections of a site (media streaming, subscriber content). Implementation: create a CloudFront Key Group (replacing legacy CloudFront Key Pairs), use a Lambda or server-side signing function with the private key (stored in Secrets Manager), and configure the CloudFront distribution behavior to require signed requests. Set short expiry (15-60 minutes) for sensitive content.

Frequently Asked Questions

Do I need an AWS certification for interviews?

Helpful for breaking in but not required. Hands-on project experience trumps certs at senior levels.

Which services come up most?

EC2, S3, IAM, VPC, RDS, Lambda, DynamoDB, CloudWatch, CloudFormation/Terraform. Know these cold.

How important is IAM?

Very — most cloud security questions trace back to IAM policies, roles, and trust relationships.

Terraform or CloudFormation?

Terraform is more widely used; CloudFormation/CDK is AWS-native. Know one well and the concepts translate.

What about cost optimization?

Senior AWS roles always ask. Understand reserved instances, spot, S3 storage classes, and common anti-patterns (NAT gateway egress).

Related Topics

Ready to apply?

TryApplyNow scores matches, tailors resumes, and tracks applications so you can focus on prep, not paperwork.

Try for free →