Docker is an open-source platform that packages an application and all its dependencies into a self-contained unit called a container. Containers share the host OS kernel but are isolated via Linux namespaces and cgroups, making them far lighter than virtual machines. The key benefit is environment consistency: a container that runs on a developer laptop runs identically in CI, staging, and production. Docker also provides a layered image format that caches unchanged layers so rebuilds and pulls are fast.
Virtual machines include a full guest OS kernel, a hypervisor layer, and virtualised hardware, so each VM typically uses gigabytes of RAM and takes minutes to boot. Containers share the host OS kernel and use namespaces/cgroups for isolation, so they start in milliseconds and use megabytes of RAM. The trade-off is isolation: a VM provides hardware-level separation (a guest kernel bug cannot escape), while containers share the host kernel, making them slightly less isolated. For most workloads the performance and density advantages of containers outweigh the isolation trade-off.
A Docker image is a read-only, layered filesystem snapshot that contains the application code, runtime, libraries, and config. It is the blueprint. A container is a running instance of an image: Docker takes the image layers, adds a thin read-write layer on top, and starts a process inside an isolated namespace. You can run many containers from the same image simultaneously; each gets its own writable layer but shares the underlying read-only layers, saving disk space.
A Dockerfile is a plain-text script of instructions that Docker executes sequentially to build an image. Each instruction (`FROM`, `RUN`, `COPY`, etc.) creates a new immutable layer. The file is checked into source control alongside the application code, making builds reproducible and auditable.
dockerfile
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
CMD ["node", "server.js"]
The `FROM` instruction sets the base image for the build. Every subsequent instruction operates on top of that base. Docker pulls the specified image from the registry if it is not cached locally. You can use `FROM scratch` to start with an empty filesystem, which is useful for statically compiled binaries. In multi-stage builds you write multiple `FROM` statements, each starting a new stage with a clean base layer.
`RUN` executes a command at build time and commits the result as a new image layer — used for installing packages. `CMD` provides the default command that runs when the container starts, but it can be overridden by arguments passed to `docker run`. `ENTRYPOINT` sets the fixed executable; any `CMD` or `docker run` arguments are appended to it as parameters. Best practice: set `ENTRYPOINT` to the binary and `CMD` to default flags so the container is easy to use as a CLI tool.
`RUN` executes a command at build time and commits the result as a new image layer — used for installing packages. `CMD` provides the default command that runs when the container starts, but it can be overridden by arguments passed to `docker run`. `ENTRYPOINT` sets the fixed executable; any `CMD` or `docker run` arguments are appended to it as parameters. Best practice: set `ENTRYPOINT` to the binary and `CMD` to default flags so the container is easy to use as a CLI tool.
```dockerfile
ENTRYPOINT ["nginx"]
CMD ["-g", "daemon off;"]
```
`COPY` copies files or directories from the build context into the image. `ADD` does everything `COPY` does but also auto-extracts `.tar` archives and fetches remote URLs. The Docker best-practice guide recommends always using `COPY` unless you explicitly need tar extraction, because `ADD` with remote URLs bypasses the layer cache in unpredictable ways and is harder to audit. For downloading remote files, use `RUN curl` with explicit checksums instead.
`COPY` copies files or directories from the build context into the image. `ADD` does everything `COPY` does but also auto-extracts `.tar` archives and fetches remote URLs. The Docker best-practice guide recommends always using `COPY` unless you explicitly need tar extraction, because `ADD` with remote URLs bypasses the layer cache in unpredictable ways and is harder to audit. For downloading remote files, use `RUN curl` with explicit checksums instead.
Each instruction in a Dockerfile that modifies the filesystem (FROM, RUN, COPY, ADD) creates an immutable layer. Layers are content-addressed by a SHA-256 hash and stored once on disk, so two images that share a base layer only store it once. When you rebuild an image Docker checks whether the layer's cache key has changed — if not, it reuses the cached layer, making rebuilds fast. Layer order matters: put instructions that change rarely (like installing OS packages) before instructions that change often (like copying app code).
Each instruction in a Dockerfile that modifies the filesystem (FROM, RUN, COPY, ADD) creates an immutable layer. Layers are content-addressed by a SHA-256 hash and stored once on disk, so two images that share a base layer only store it once. When you rebuild an image Docker checks whether the layer's cache key has changed — if not, it reuses the cached layer, making rebuilds fast. Layer order matters: put instructions that change rarely (like installing OS packages) before instructions that change often (like copying app code).
A container registry is a server that stores, versions, and distributes Docker images. Docker Hub is the default public registry at hub.docker.com; it hosts official images like `node`, `postgres`, and `nginx`. Images are referenced as `registry/repository:tag` — if the registry is omitted, Docker assumes Docker Hub. You can also run private registries: AWS ECR, Google Artifact Registry, GitHub Container Registry, or a self-hosted Harbor instance.
A container registry is a server that stores, versions, and distributes Docker images. Docker Hub is the default public registry at hub.docker.com; it hosts official images like `node`, `postgres`, and `nginx`. Images are referenced as `registry/repository:tag` — if the registry is omitted, Docker assumes Docker Hub. You can also run private registries: AWS ECR, Google Artifact Registry, GitHub Container Registry, or a self-hosted Harbor instance.
Use `docker build` with the build context (directory containing the Dockerfile) and an optional tag.
bash
# Build and tag from current directory
docker build -t myapp:1.0 .
# Use a different Dockerfile location
docker build -f docker/Dockerfile.prod -t myapp:prod .
# Pass a build argument
docker build --build-arg NODE_ENV=production -t myapp:prod .
Docker sends the build context to the Docker daemon, which executes each instruction in order. Use `.dockerignore` to exclude files from the context to keep it small.
Use `docker build` with the build context (directory containing the Dockerfile) and an optional tag.
```bash
# Build and tag from current directory
docker build -t myapp:1.0 .
# Use a different Dockerfile location
docker build -f docker/Dockerfile.prod -t myapp:prod .
# Pass a build argument
docker build --build-arg NODE_ENV=production -t myapp:prod .
```
Docker sends the build context to the Docker daemon, which executes each instruction in order. Use `.dockerignore` to exclude files from the context to keep it small.
Use `docker run` with the image name. Common flags: `-d` to run detached, `-p` to publish ports, `-e` to set environment variables, `-v` to mount volumes, `--name` to give the container a name.
bash
# Run detached, publish port, set env var
docker run -d -p 8080:3000 -e NODE_ENV=production --name api myapp:1.0
# Run interactively and remove on exit
docker run -it --rm node:20-alpine sh
Use `docker run` with the image name. Common flags: `-d` to run detached, `-p` to publish ports, `-e` to set environment variables, `-v` to mount volumes, `--name` to give the container a name.
```bash
# Run detached, publish port, set env var
docker run -d -p 8080:3000 -e NODE_ENV=production --name api myapp:1.0
# Run interactively and remove on exit
docker run -it --rm node:20-alpine sh
```
`docker ps` lists running containers, showing the container ID, image, command, created time, status, ports, and name. `docker ps -a` includes stopped containers. `docker logs <container>` streams the stdout/stderr of a container; `docker logs -f` follows the log output in real time, and `--tail 100` limits output to the last 100 lines.
bash
docker ps
docker ps -a --format "table {{.Names}}\t{{.Status}}"
docker logs -f --tail 50 api
`docker ps` lists running containers, showing the container ID, image, command, created time, status, ports, and name. `docker ps -a` includes stopped containers. `docker logs <container>` streams the stdout/stderr of a container; `docker logs -f` follows the log output in real time, and `--tail 100` limits output to the last 100 lines.
```bash
docker ps
docker ps -a --format "table {{.Names}}\t{{.Status}}"
docker logs -f --tail 50 api
```
A container's writable layer is destroyed when the container is removed. Docker volumes provide persistent storage that lives outside the container lifecycle and is managed by the Docker daemon on the host at `/var/lib/docker/volumes/`. Volumes can be shared between containers and are the recommended way to persist database files, user uploads, and logs. Bind mounts map a specific host path into the container and are useful for development (live code reloading).
bash
# Named volume
docker run -v pgdata:/var/lib/postgresql/data postgres:16
# Bind mount for dev
docker run -v $(pwd):/app node:20-alpine
A container's writable layer is destroyed when the container is removed. Docker volumes provide persistent storage that lives outside the container lifecycle and is managed by the Docker daemon on the host at `/var/lib/docker/volumes/`. Volumes can be shared between containers and are the recommended way to persist database files, user uploads, and logs. Bind mounts map a specific host path into the container and are useful for development (live code reloading).
```bash
# Named volume
docker run -v pgdata:/var/lib/postgresql/data postgres:16
# Bind mount for dev
docker run -v $(pwd):/app node:20-alpine
```
Docker networks control how containers communicate with each other and the host. The default **bridge** network creates a virtual ethernet switch; containers get private IPs and communicate by name on user-defined bridge networks. **Host** mode removes network isolation — the container shares the host's network stack directly, useful for high-throughput scenarios but risky. **None** gives the container a loopback interface only, completely isolating it from all networks. User-defined bridge networks provide automatic DNS resolution between containers, which the default `bridge` network does not.
bash
docker network create mynet
docker run --network mynet --name db postgres:16
docker run --network mynet --name api myapp:1.0
# api container can reach db via hostname "db"
Docker networks control how containers communicate with each other and the host. The default **bridge** network creates a virtual ethernet switch; containers get private IPs and communicate by name on user-defined bridge networks. **Host** mode removes network isolation — the container shares the host's network stack directly, useful for high-throughput scenarios but risky. **None** gives the container a loopback interface only, completely isolating it from all networks. User-defined bridge networks provide automatic DNS resolution between containers, which the default `bridge` network does not.
```bash
docker network create mynet
docker run --network mynet --name db postgres:16
docker run --network mynet --name api myapp:1.0
# api container can reach db via hostname "db"
```
Docker Compose is a tool for defining and running multi-container applications with a single YAML file (`docker-compose.yml`). It manages service definitions, networks, volumes, environment variables, and port mappings declaratively. `docker compose up -d` starts all services; `docker compose down` stops and removes them. It is ideal for local development environments that need a database, cache, and app server to run together reproducibly.
Docker Compose is a tool for defining and running multi-container applications with a single YAML file (`docker-compose.yml`). It manages service definitions, networks, volumes, environment variables, and port mappings declaratively. `docker compose up -d` starts all services; `docker compose down` stops and removes them. It is ideal for local development environments that need a database, cache, and app server to run together reproducibly.
```yaml
services:
api:
build: .
ports: ["3000:3000"]
depends_on: [db]
db:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
```
Container orchestration is the automated management of containerised workloads at scale. It covers scheduling containers onto nodes, scaling them up or down based on load, restarting failed containers, rolling out new versions without downtime, managing service discovery and load balancing, and handling secrets and configuration. Without orchestration, operating hundreds of containers across many machines would require extensive manual work. Kubernetes is the dominant open-source orchestrator; alternatives include Docker Swarm, Nomad, and managed platforms like AWS ECS.
Container orchestration is the automated management of containerised workloads at scale. It covers scheduling containers onto nodes, scaling them up or down based on load, restarting failed containers, rolling out new versions without downtime, managing service discovery and load balancing, and handling secrets and configuration. Without orchestration, operating hundreds of containers across many machines would require extensive manual work. Kubernetes is the dominant open-source orchestrator; alternatives include Docker Swarm, Nomad, and managed platforms like AWS ECS.
Kubernetes (K8s) is an open-source container orchestration platform originally developed by Google and donated to the CNCF in 2014. It automates deploying, scaling, and managing containerised applications. The core model is declarative: you describe desired state in YAML manifests, and Kubernetes continuously reconciles the actual state of the cluster toward that desired state. It provides primitives for workloads (Pods, Deployments, StatefulSets), networking (Services, Ingress), storage (PVCs), and configuration (ConfigMaps, Secrets).
Kubernetes (K8s) is an open-source container orchestration platform originally developed by Google and donated to the CNCF in 2014. It automates deploying, scaling, and managing containerised applications. The core model is declarative: you describe desired state in YAML manifests, and Kubernetes continuously reconciles the actual state of the cluster toward that desired state. It provides primitives for workloads (Pods, Deployments, StatefulSets), networking (Services, Ingress), storage (PVCs), and configuration (ConfigMaps, Secrets).
A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share the same network namespace (same IP, same ports) and can share storage volumes. Containers in a Pod are co-scheduled on the same node and communicate via `localhost`. Pods are ephemeral: if a Pod dies, Kubernetes creates a new one with a new IP rather than restarting the old one. You rarely create bare Pods in production; Deployments and StatefulSets manage Pods for you.
A Pod is the smallest deployable unit in Kubernetes. It wraps one or more containers that share the same network namespace (same IP, same ports) and can share storage volumes. Containers in a Pod are co-scheduled on the same node and communicate via `localhost`. Pods are ephemeral: if a Pod dies, Kubernetes creates a new one with a new IP rather than restarting the old one. You rarely create bare Pods in production; Deployments and StatefulSets manage Pods for you.
A Deployment is a higher-level Kubernetes object that manages a desired number of identical Pod replicas. It owns a ReplicaSet and uses a rolling update strategy by default — when you update the pod template, the Deployment gradually creates new pods and terminates old ones with zero downtime. You can pause, resume, and roll back Deployments. Deployments are the standard way to run stateless workloads.
bash
kubectl create deployment nginx --image=nginx:1.25 --replicas=3
kubectl set image deployment/nginx nginx=nginx:1.26
kubectl rollout status deployment/nginx
A Deployment is a higher-level Kubernetes object that manages a desired number of identical Pod replicas. It owns a ReplicaSet and uses a rolling update strategy by default — when you update the pod template, the Deployment gradually creates new pods and terminates old ones with zero downtime. You can pause, resume, and roll back Deployments. Deployments are the standard way to run stateless workloads.
```bash
kubectl create deployment nginx --image=nginx:1.25 --replicas=3
kubectl set image deployment/nginx nginx=nginx:1.26
kubectl rollout status deployment/nginx
```
A Service is a stable network endpoint that provides load-balanced access to a dynamic set of Pods selected by a label selector. Because Pods are ephemeral and their IPs change, Services provide a constant DNS name and ClusterIP that other workloads can rely on. A Service proxies traffic to healthy Pods via kube-proxy. Types include ClusterIP (internal only), NodePort (expose on each node's port), LoadBalancer (provision a cloud load balancer), and ExternalName (alias to an external DNS name).
A Service is a stable network endpoint that provides load-balanced access to a dynamic set of Pods selected by a label selector. Because Pods are ephemeral and their IPs change, Services provide a constant DNS name and ClusterIP that other workloads can rely on. A Service proxies traffic to healthy Pods via kube-proxy. Types include ClusterIP (internal only), NodePort (expose on each node's port), LoadBalancer (provision a cloud load balancer), and ExternalName (alias to an external DNS name).
A Namespace is a logical partition within a Kubernetes cluster that groups resources and provides a scope for names. Resources in different Namespaces can have the same name without conflict. Namespaces are commonly used to separate environments (dev, staging, prod) within a shared cluster, or to isolate teams. Resource quotas and RBAC policies can be applied per-Namespace. Cluster-scoped resources like Nodes and PersistentVolumes are not Namespace-scoped.
A Namespace is a logical partition within a Kubernetes cluster that groups resources and provides a scope for names. Resources in different Namespaces can have the same name without conflict. Namespaces are commonly used to separate environments (dev, staging, prod) within a shared cluster, or to isolate teams. Resource quotas and RBAC policies can be applied per-Namespace. Cluster-scoped resources like Nodes and PersistentVolumes are not Namespace-scoped.
`kubectl` is the command-line tool for interacting with a Kubernetes cluster via the Kubernetes API server. It reads cluster connection details from `~/.kube/config` (the kubeconfig file) and communicates over HTTPS. Common operations include applying manifests, inspecting resources, viewing logs, exec-ing into containers, and managing rollouts. `kubectl` supports multiple contexts so you can switch between clusters quickly.
`kubectl` is the command-line tool for interacting with a Kubernetes cluster via the Kubernetes API server. It reads cluster connection details from `~/.kube/config` (the kubeconfig file) and communicates over HTTPS. Common operations include applying manifests, inspecting resources, viewing logs, exec-ing into containers, and managing rollouts. `kubectl` supports multiple contexts so you can switch between clusters quickly.
```bash
kubectl config get-contexts
kubectl config use-context prod-cluster
```
`kubectl get` lists resources (pods, services, deployments, etc.) in table form. `kubectl describe` shows detailed information about a specific resource including events, which is invaluable for debugging. `kubectl apply -f manifest.yaml` creates or updates resources to match the manifest (declarative). `kubectl delete` removes resources by name or label selector.
bash
kubectl get pods -n production -o wide
kubectl describe pod api-6f9d8b-xyz -n production
kubectl apply -f k8s/
kubectl delete deployment old-service -n staging
`kubectl get` lists resources (pods, services, deployments, etc.) in table form. `kubectl describe` shows detailed information about a specific resource including events, which is invaluable for debugging. `kubectl apply -f manifest.yaml` creates or updates resources to match the manifest (declarative). `kubectl delete` removes resources by name or label selector.
```bash
kubectl get pods -n production -o wide
kubectl describe pod api-6f9d8b-xyz -n production
kubectl apply -f k8s/
kubectl delete deployment old-service -n staging
```
A ConfigMap stores non-sensitive configuration data as key-value pairs that can be injected into Pods as environment variables, command-line arguments, or mounted as files in a volume. This decouples configuration from container images, so you can change settings without rebuilding. Changes to a mounted ConfigMap volume propagate to running Pods (with some latency), while env var injection requires a Pod restart.
A ConfigMap stores non-sensitive configuration data as key-value pairs that can be injected into Pods as environment variables, command-line arguments, or mounted as files in a volume. This decouples configuration from container images, so you can change settings without rebuilding. Changes to a mounted ConfigMap volume propagate to running Pods (with some latency), while env var injection requires a Pod restart.
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
LOG_LEVEL: "info"
DATABASE_URL: "postgres://db:5432/mydb"
```
A Secret stores sensitive data such as passwords, API keys, and TLS certificates. Secrets are base64-encoded (not encrypted by default) in etcd; enabling encryption at rest requires configuring an EncryptionConfiguration for the API server. Secrets can be mounted as files or injected as environment variables. Best practice is to avoid env var injection (secrets can leak into logs) and instead mount as files with tight file permissions. Tools like Sealed Secrets or External Secrets Operator manage secrets safely in GitOps workflows.
A Secret stores sensitive data such as passwords, API keys, and TLS certificates. Secrets are base64-encoded (not encrypted by default) in etcd; enabling encryption at rest requires configuring an EncryptionConfiguration for the API server. Secrets can be mounted as files or injected as environment variables. Best practice is to avoid env var injection (secrets can leak into logs) and instead mount as files with tight file permissions. Tools like Sealed Secrets or External Secrets Operator manage secrets safely in GitOps workflows.
A Node is a physical or virtual machine that runs container workloads. Each Node runs three core components: the **kubelet** (communicates with the control plane and manages Pods), **kube-proxy** (handles network routing for Services), and a **container runtime** (containerd or CRI-O). Nodes are registered with the control plane, which schedules Pods onto them based on available resources, taints, labels, and affinity rules. `kubectl get nodes` shows node status and allocated resources.
A Node is a physical or virtual machine that runs container workloads. Each Node runs three core components: the **kubelet** (communicates with the control plane and manages Pods), **kube-proxy** (handles network routing for Services), and a **container runtime** (containerd or CRI-O). Nodes are registered with the control plane, which schedules Pods onto them based on available resources, taints, labels, and affinity rules. `kubectl get nodes` shows node status and allocated resources.
The Control Plane is the set of components that manage the overall cluster state. It runs on dedicated master nodes (or is managed by the cloud provider). Key components: **kube-apiserver** (the REST API gateway, the only component other components talk to), **etcd** (distributed key-value store, the source of truth for all cluster state), **kube-scheduler** (assigns Pods to Nodes), **kube-controller-manager** (runs controllers that reconcile desired vs actual state), and **cloud-controller-manager** (integrates with the cloud provider API).
The Control Plane is the set of components that manage the overall cluster state. It runs on dedicated master nodes (or is managed by the cloud provider). Key components: **kube-apiserver** (the REST API gateway, the only component other components talk to), **etcd** (distributed key-value store, the source of truth for all cluster state), **kube-scheduler** (assigns Pods to Nodes), **kube-controller-manager** (runs controllers that reconcile desired vs actual state), and **cloud-controller-manager** (integrates with the cloud provider API).
The kubelet is an agent that runs on every Node. It watches the API server for PodSpecs assigned to its Node and ensures the containers described in those specs are running and healthy. It interacts with the container runtime via the CRI (Container Runtime Interface) to pull images and start/stop containers. The kubelet also runs liveness/readiness/startup probes and reports Pod status back to the API server. If a container fails, the kubelet restarts it according to the Pod's `restartPolicy`.
The kubelet is an agent that runs on every Node. It watches the API server for PodSpecs assigned to its Node and ensures the containers described in those specs are running and healthy. It interacts with the container runtime via the CRI (Container Runtime Interface) to pull images and start/stop containers. The kubelet also runs liveness/readiness/startup probes and reports Pod status back to the API server. If a container fails, the kubelet restarts it according to the Pod's `restartPolicy`.
A ReplicaSet ensures that a specified number of Pod replicas are running at any given time. It uses a label selector to identify the Pods it owns; if a Pod dies, the ReplicaSet creates a replacement. In practice you almost never create ReplicaSets directly — Deployments manage them for you and add rolling update and rollback capabilities. You might inspect a ReplicaSet to understand how a Deployment manages versions: `kubectl get replicaset -n <ns>` shows old and new RS during a rollout.
A ReplicaSet ensures that a specified number of Pod replicas are running at any given time. It uses a label selector to identify the Pods it owns; if a Pod dies, the ReplicaSet creates a replacement. In practice you almost never create ReplicaSets directly — Deployments manage them for you and add rolling update and rollback capabilities. You might inspect a ReplicaSet to understand how a Deployment manages versions: `kubectl get replicaset -n <ns>` shows old and new RS during a rollout.
A container registry stores and distributes container images by name and tag. Kubernetes nodes pull images via the container runtime (containerd/CRI-O) using the image reference in the Pod spec. For private registries, you create a Kubernetes Secret of type `kubernetes.io/dockerconfigjson` containing registry credentials and reference it in the Pod spec's `imagePullSecrets` field. Cloud providers offer integrated credential helpers (e.g., AWS ECR, GCP Artifact Registry) that auto-refresh short-lived tokens so you don't need to rotate the Secret manually.
A container registry stores and distributes container images by name and tag. Kubernetes nodes pull images via the container runtime (containerd/CRI-O) using the image reference in the Pod spec. For private registries, you create a Kubernetes Secret of type `kubernetes.io/dockerconfigjson` containing registry credentials and reference it in the Pod spec's `imagePullSecrets` field. Cloud providers offer integrated credential helpers (e.g., AWS ECR, GCP Artifact Registry) that auto-refresh short-lived tokens so you don't need to rotate the Secret manually.
```bash
kubectl create secret docker-registry regcred \
--docker-server=myregistry.io \
--docker-username=user --docker-password=pass
```
Multi-stage builds use multiple `FROM` statements in one Dockerfile. Each stage is isolated: you can use a full build toolchain (e.g., `golang:1.22`) in stage 1 to compile the binary, then copy only the compiled artifact into a minimal runtime image (e.g., `gcr.io/distroless/static`) in stage 2. The final image contains none of the build tools, source code, or intermediate files, dramatically reducing the attack surface and image size. You reference a previous stage with `COPY --from=builder`.
dockerfile
FROM golang:1.22 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server
FROM gcr.io/distroless/static:nonroot
COPY --from=builder /app /app
ENTRYPOINT ["/app"]
This typically reduces a Go image from ~900 MB to ~10 MB.
Multi-stage builds use multiple `FROM` statements in one Dockerfile. Each stage is isolated: you can use a full build toolchain (e.g., `golang:1.22`) in stage 1 to compile the binary, then copy only the compiled artifact into a minimal runtime image (e.g., `gcr.io/distroless/static`) in stage 2. The final image contains none of the build tools, source code, or intermediate files, dramatically reducing the attack surface and image size. You reference a previous stage with `COPY --from=builder`.
```dockerfile
FROM golang:1.22 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /app ./cmd/server
FROM gcr.io/distroless/static:nonroot
COPY --from=builder /app /app
ENTRYPOINT ["/app"]
```
This typically reduces a Go image from ~900 MB to ~10 MB.
The `.dockerignore` file lists paths that Docker excludes from the build context sent to the daemon. Without it, `docker build .` sends everything — including `node_modules/`, `.git/`, test fixtures, and local `.env` files — which inflates build time and risks leaking secrets into the image. Common entries: `.git`, `node_modules`, `*.log`, `.env*`, `coverage/`, `dist/` (if not needed). Excluding `node_modules` is especially important because `npm install` inside the container must run fresh anyway and the local folder can be gigabytes.
.git
.env*
node_modules
dist
coverage
*.log
The `.dockerignore` file lists paths that Docker excludes from the build context sent to the daemon. Without it, `docker build .` sends everything — including `node_modules/`, `.git/`, test fixtures, and local `.env` files — which inflates build time and risks leaking secrets into the image. Common entries: `.git`, `node_modules`, `*.log`, `.env*`, `coverage/`, `dist/` (if not needed). Excluding `node_modules` is especially important because `npm install` inside the container must run fresh anyway and the local folder can be gigabytes.
```
.git
.env*
node_modules
dist
coverage
*.log
```
Linux namespaces wrap a system resource so each container sees its own isolated view. The six main types: **pid** — each container gets its own PID 1, so processes cannot see or signal processes in other containers. **net** — each container gets its own network interfaces, routing tables, and firewall rules. **mnt** — separate mount table, so bind mounts and volumes are per-container. **uts** — containers can have their own hostname. **ipc** — isolates System V IPC and POSIX message queues. **user** — maps container UIDs to different host UIDs, enabling rootless containers where the in-container root maps to an unprivileged host user. Docker creates all six namespaces per container by default. Sharing `--network host` or `--pid host` removes that specific isolation.
Linux namespaces wrap a system resource so each container sees its own isolated view. The six main types: **pid** — each container gets its own PID 1, so processes cannot see or signal processes in other containers. **net** — each container gets its own network interfaces, routing tables, and firewall rules. **mnt** — separate mount table, so bind mounts and volumes are per-container. **uts** — containers can have their own hostname. **ipc** — isolates System V IPC and POSIX message queues. **user** — maps container UIDs to different host UIDs, enabling rootless containers where the in-container root maps to an unprivileged host user. Docker creates all six namespaces per container by default. Sharing `--network host` or `--pid host` removes that specific isolation.
Control groups (cgroups) are a Linux kernel mechanism that limits and accounts for resource usage of process groups. **cgroups v1** has separate hierarchies for each resource controller (cpu, memory, blkio), which makes it complex to coordinate resource accounting across controllers — for example, a process can appear in different hierarchies. **cgroups v2** uses a unified hierarchy where all controllers are under a single tree, fixing the coordination issues and enabling better memory accounting (including page cache). Docker and container runtimes map `--memory` and `--cpus` to cgroup limits. Kubernetes sets cgroup limits via the kubelet from Pod `resources.limits`. cgroups v2 is the default on modern kernels (5.8+) and enables better OOM handling and accurate CPU throttling metrics.
Control groups (cgroups) are a Linux kernel mechanism that limits and accounts for resource usage of process groups. **cgroups v1** has separate hierarchies for each resource controller (cpu, memory, blkio), which makes it complex to coordinate resource accounting across controllers — for example, a process can appear in different hierarchies. **cgroups v2** uses a unified hierarchy where all controllers are under a single tree, fixing the coordination issues and enabling better memory accounting (including page cache). Docker and container runtimes map `--memory` and `--cpus` to cgroup limits. Kubernetes sets cgroup limits via the kubelet from Pod `resources.limits`. cgroups v2 is the default on modern kernels (5.8+) and enables better OOM handling and accurate CPU throttling metrics.
Docker uses overlayfs (overlay2 storage driver) to compose image layers. An overlay mount has a lower directory (one or more read-only layers stacked from bottom to top), an upper directory (the writable container layer), and a merged view. When a container reads a file, the kernel looks from the top (upper, then each lower layer in order) and returns the first match. When a container writes to a file that only exists in a lower layer, the kernel performs a copy-on-write: it copies the file into the upper directory before modifying it. Deleting a file creates a "whiteout" file in the upper layer that hides the lower layer version. This means image layers are never modified in place, keeping them immutable and safely shareable across containers.
Docker uses overlayfs (overlay2 storage driver) to compose image layers. An overlay mount has a lower directory (one or more read-only layers stacked from bottom to top), an upper directory (the writable container layer), and a merged view. When a container reads a file, the kernel looks from the top (upper, then each lower layer in order) and returns the first match. When a container writes to a file that only exists in a lower layer, the kernel performs a copy-on-write: it copies the file into the upper directory before modifying it. Deleting a file creates a "whiteout" file in the upper layer that hides the lower layer version. This means image layers are never modified in place, keeping them immutable and safely shareable across containers.
`docker history <image>` shows each layer, its creation command, and its size. Layers that consume the most space are candidates for optimisation. Because Docker's build cache invalidates all layers below the first changed instruction, layer ordering directly affects cache hit rates. Put infrequently changing instructions (OS package installs, dependency downloads) near the top and frequently changing instructions (copying app source code) near the bottom.
bash
docker history myapp:latest
# Good order:
# COPY package.json . ← changes rarely
# RUN npm ci ← cached if package.json unchanged
# COPY . . ← changes often, invalidates only layers below
A poorly ordered Dockerfile that copies source code before installing dependencies re-runs `npm ci` on every code change.
`docker history <image>` shows each layer, its creation command, and its size. Layers that consume the most space are candidates for optimisation. Because Docker's build cache invalidates all layers below the first changed instruction, layer ordering directly affects cache hit rates. Put infrequently changing instructions (OS package installs, dependency downloads) near the top and frequently changing instructions (copying app source code) near the bottom.
```bash
docker history myapp:latest
# Good order:
# COPY package.json . ← changes rarely
# RUN npm ci ← cached if package.json unchanged
# COPY . . ← changes often, invalidates only layers below
```
A poorly ordered Dockerfile that copies source code before installing dependencies re-runs `npm ci` on every code change.
`FROM scratch` builds an image with an empty filesystem — ideal for statically compiled binaries (Go, Rust) that have zero external dependencies. Distroless images (from Google's `gcr.io/distroless`) contain only the language runtime and its standard library without a shell, package manager, or other OS utilities. Both approaches drastically reduce image size and attack surface: fewer binaries means fewer CVEs, and the absence of a shell prevents trivial remote code execution. The trade-off is debuggability — you cannot `exec` into the container and run shell commands. The workaround is using a debug variant (`gcr.io/distroless/static:debug`) which adds a busybox shell, kept out of production.
`FROM scratch` builds an image with an empty filesystem — ideal for statically compiled binaries (Go, Rust) that have zero external dependencies. Distroless images (from Google's `gcr.io/distroless`) contain only the language runtime and its standard library without a shell, package manager, or other OS utilities. Both approaches drastically reduce image size and attack surface: fewer binaries means fewer CVEs, and the absence of a shell prevents trivial remote code execution. The trade-off is debuggability — you cannot `exec` into the container and run shell commands. The workaround is using a debug variant (`gcr.io/distroless/static:debug`) which adds a busybox shell, kept out of production.
By default Docker containers run as UID 0 (root) inside the container. If the container runtime is misconfigured or a container escape vulnerability is exploited, root in the container can become root on the host. The `USER` instruction in a Dockerfile switches to a non-root user for all subsequent `RUN`, `CMD`, and `ENTRYPOINT` instructions. Create the user explicitly in the Dockerfile so you control the UID.
dockerfile
FROM node:20-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --chown=appuser:appgroup . .
USER appuser
CMD ["node", "server.js"]
Kubernetes enforces this at the cluster level via `securityContext.runAsNonRoot: true` and `runAsUser`.
By default Docker containers run as UID 0 (root) inside the container. If the container runtime is misconfigured or a container escape vulnerability is exploited, root in the container can become root on the host. The `USER` instruction in a Dockerfile switches to a non-root user for all subsequent `RUN`, `CMD`, and `ENTRYPOINT` instructions. Create the user explicitly in the Dockerfile so you control the UID.
```dockerfile
FROM node:20-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
COPY --chown=appuser:appgroup . .
USER appuser
CMD ["node", "server.js"]
```
Kubernetes enforces this at the cluster level via `securityContext.runAsNonRoot: true` and `runAsUser`.
BuildKit is Docker's next-generation build engine, enabled with `DOCKER_BUILDKIT=1` or `docker buildx build`. `--mount=type=cache` mounts a persistent cache directory that is not included in the final image layer — perfect for `npm`, `pip`, or `go` module caches that speed up repeated builds without bloating the image. `--mount=type=secret` makes a secret (e.g. a private npm token) available at build time without baking it into any layer. `--ssh` forwards SSH agent credentials into the build so you can clone private repositories without exposing keys.
dockerfile
# Cache npm modules across builds
RUN --mount=type=cache,target=/root/.npm \
npm ci --prefer-offline
# Use a secret at build time (never in layer)
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
npm install
BuildKit is Docker's next-generation build engine, enabled with `DOCKER_BUILDKIT=1` or `docker buildx build`. `--mount=type=cache` mounts a persistent cache directory that is not included in the final image layer — perfect for `npm`, `pip`, or `go` module caches that speed up repeated builds without bloating the image. `--mount=type=secret` makes a secret (e.g. a private npm token) available at build time without baking it into any layer. `--ssh` forwards SSH agent credentials into the build so you can clone private repositories without exposing keys.
```dockerfile
# Cache npm modules across builds
RUN --mount=type=cache,target=/root/.npm \
npm ci --prefer-offline
# Use a secret at build time (never in layer)
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
npm install
```
Three complementary hardening strategies: **Read-only root filesystem** (`--read-only`) prevents the container process from writing anywhere except explicitly mounted writable volumes or tmpfs mounts, blocking most persistence-based attacks. **Capability dropping** (`--cap-drop ALL --cap-add NET_BIND_SERVICE`) removes Linux capabilities the container does not need; most apps need zero capabilities. **Seccomp profiles** filter which syscalls the container can make — the Docker default profile blocks ~44 dangerous syscalls; a custom profile can be even stricter. In Kubernetes these map to `securityContext.readOnlyRootFilesystem`, `securityContext.capabilities`, and `securityContext.seccompProfile`.
bash
docker run --read-only --cap-drop ALL \
--security-opt seccomp=/path/to/profile.json myapp
Three complementary hardening strategies: **Read-only root filesystem** (`--read-only`) prevents the container process from writing anywhere except explicitly mounted writable volumes or tmpfs mounts, blocking most persistence-based attacks. **Capability dropping** (`--cap-drop ALL --cap-add NET_BIND_SERVICE`) removes Linux capabilities the container does not need; most apps need zero capabilities. **Seccomp profiles** filter which syscalls the container can make — the Docker default profile blocks ~44 dangerous syscalls; a custom profile can be even stricter. In Kubernetes these map to `securityContext.readOnlyRootFilesystem`, `securityContext.capabilities`, and `securityContext.seccompProfile`.
```bash
docker run --read-only --cap-drop ALL \
--security-opt seccomp=/path/to/profile.json myapp
```
A `HEALTHCHECK` in a Dockerfile defines how the Docker daemon checks if the container is still functioning — if it fails repeatedly, `docker ps` marks the container as unhealthy, and Docker Swarm can restart it. Kubernetes largely ignores Dockerfile HEALTHCHECK and uses its own probe system: **Liveness probes** restart a container that has deadlocked. **Readiness probes** remove a Pod from Service endpoints until it is ready to serve traffic — preventing requests being routed to a starting or overloaded Pod. **Startup probes** give slow-starting containers extra time before liveness kicks in. Probes can be HTTP GET, TCP socket, or exec-based, and each has `initialDelaySeconds`, `periodSeconds`, `failureThreshold` configuration.
A `HEALTHCHECK` in a Dockerfile defines how the Docker daemon checks if the container is still functioning — if it fails repeatedly, `docker ps` marks the container as unhealthy, and Docker Swarm can restart it. Kubernetes largely ignores Dockerfile HEALTHCHECK and uses its own probe system: **Liveness probes** restart a container that has deadlocked. **Readiness probes** remove a Pod from Service endpoints until it is ready to serve traffic — preventing requests being routed to a starting or overloaded Pod. **Startup probes** give slow-starting containers extra time before liveness kicks in. Probes can be HTTP GET, TCP socket, or exec-based, and each has `initialDelaySeconds`, `periodSeconds`, `failureThreshold` configuration.
The default **RollingUpdate** strategy gradually replaces old Pods with new ones. `maxSurge` controls how many extra Pods above the desired count can run during the update; `maxUnavailable` controls how many Pods can be unavailable. This achieves zero downtime but means both versions run simultaneously, so your app must be backward-compatible. The **Recreate** strategy terminates all old Pods before starting new ones, causing downtime but ensuring only one version runs at a time — useful for databases or apps with breaking schema changes.
The default **RollingUpdate** strategy gradually replaces old Pods with new ones. `maxSurge` controls how many extra Pods above the desired count can run during the update; `maxUnavailable` controls how many Pods can be unavailable. This achieves zero downtime but means both versions run simultaneously, so your app must be backward-compatible. The **Recreate** strategy terminates all old Pods before starting new ones, causing downtime but ensuring only one version runs at a time — useful for databases or apps with breaking schema changes.
```yaml
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
```
A **request** is the minimum resource a Pod needs; the scheduler uses requests to decide which Node has enough capacity to place the Pod. A **limit** is the maximum the container may use; the kernel enforces CPU limits via cgroup throttling and memory limits via OOM killer. Setting requests without limits allows a noisy-neighbour to starve other Pods on the same node. Setting limits below actual usage causes CPU throttling (bad for latency) or OOMKill (bad for reliability). Best practice: set requests equal to the P95 usage and limits at 2–3× requests for CPU, and set memory limits tightly since memory is not compressible.
A **request** is the minimum resource a Pod needs; the scheduler uses requests to decide which Node has enough capacity to place the Pod. A **limit** is the maximum the container may use; the kernel enforces CPU limits via cgroup throttling and memory limits via OOM killer. Setting requests without limits allows a noisy-neighbour to starve other Pods on the same node. Setting limits below actual usage causes CPU throttling (bad for latency) or OOMKill (bad for reliability). Best practice: set requests equal to the P95 usage and limits at 2–3× requests for CPU, and set memory limits tightly since memory is not compressible.
```yaml
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
```
Kubernetes assigns every Pod a Quality of Service class based on its resource declarations. **Guaranteed**: every container has equal requests and limits for both CPU and memory — these Pods are last to be evicted under pressure. **Burstable**: at least one container has a request or limit set but not all are equal — evicted second. **BestEffort**: no requests or limits set at all — these Pods are first to be evicted when the Node is under memory pressure. QoS class determines eviction order when a node runs low on resources. For production workloads you should always aim for Guaranteed or at minimum Burstable.
Kubernetes assigns every Pod a Quality of Service class based on its resource declarations. **Guaranteed**: every container has equal requests and limits for both CPU and memory — these Pods are last to be evicted under pressure. **Burstable**: at least one container has a request or limit set but not all are equal — evicted second. **BestEffort**: no requests or limits set at all — these Pods are first to be evicted when the Node is under memory pressure. QoS class determines eviction order when a node runs low on resources. For production workloads you should always aim for Guaranteed or at minimum Burstable.
The HPA controller polls the Metrics Server (or custom metrics adapter) every 15 seconds and compares current utilisation to a target. For CPU-based scaling: `desiredReplicas = ceil(currentReplicas * currentCPU / targetCPU)`. The HPA respects `minReplicas` and `maxReplicas` bounds and has a stabilisation window to prevent flapping (default: 5 minutes to scale down, 3 minutes to scale up). You can also scale on custom metrics (e.g., RPS from Prometheus via the KEDA adapter) or external metrics (e.g., SQS queue depth).
bash
kubectl autoscale deployment api \
--cpu-percent=60 --min=2 --max=20
kubectl get hpa
The HPA controller polls the Metrics Server (or custom metrics adapter) every 15 seconds and compares current utilisation to a target. For CPU-based scaling: `desiredReplicas = ceil(currentReplicas * currentCPU / targetCPU)`. The HPA respects `minReplicas` and `maxReplicas` bounds and has a stabilisation window to prevent flapping (default: 5 minutes to scale down, 3 minutes to scale up). You can also scale on custom metrics (e.g., RPS from Prometheus via the KEDA adapter) or external metrics (e.g., SQS queue depth).
```bash
kubectl autoscale deployment api \
--cpu-percent=60 --min=2 --max=20
kubectl get hpa
```
**ClusterIP** (default) exposes the Service on an internal cluster IP only — no external access. **NodePort** opens a port on every Node (30000–32767 range) and forwards to the Service — simple but exposes every node. **LoadBalancer** provisions a cloud load balancer with an external IP; each Service that uses it typically costs money and gets its own IP, which doesn't scale. **Ingress** is an L7 HTTP/HTTPS router that sits in front of multiple Services and routes traffic by host or path using a single external load balancer, making it far more cost-effective and enabling SSL termination, virtual hosting, and path-based routing.
**ClusterIP** (default) exposes the Service on an internal cluster IP only — no external access. **NodePort** opens a port on every Node (30000–32767 range) and forwards to the Service — simple but exposes every node. **LoadBalancer** provisions a cloud load balancer with an external IP; each Service that uses it typically costs money and gets its own IP, which doesn't scale. **Ingress** is an L7 HTTP/HTTPS router that sits in front of multiple Services and routes traffic by host or path using a single external load balancer, making it far more cost-effective and enabling SSL termination, virtual hosting, and path-based routing.
An Ingress controller is the component that implements the Kubernetes Ingress resource. **Nginx Ingress Controller** (ingress-nginx) is the most widely deployed; it runs Nginx inside the cluster, supports rich annotation-based configuration, and is well-understood. **Traefik** is a cloud-native reverse proxy with automatic Let's Encrypt, built-in dashboard, and native Kubernetes service discovery via CRDs. **AWS Load Balancer Controller** provisions an AWS ALB per Ingress (or per Service with the NLB mode), keeping load balancing outside the cluster — this offloads TLS to AWS ACM, provides WAF integration, and is more operationally simple on EKS but ties you to AWS.
An Ingress controller is the component that implements the Kubernetes Ingress resource. **Nginx Ingress Controller** (ingress-nginx) is the most widely deployed; it runs Nginx inside the cluster, supports rich annotation-based configuration, and is well-understood. **Traefik** is a cloud-native reverse proxy with automatic Let's Encrypt, built-in dashboard, and native Kubernetes service discovery via CRDs. **AWS Load Balancer Controller** provisions an AWS ALB per Ingress (or per Service with the NLB mode), keeping load balancing outside the cluster — this offloads TLS to AWS ACM, provides WAF integration, and is more operationally simple on EKS but ties you to AWS.
A **PersistentVolume (PV)** is a cluster-level storage resource provisioned by an admin or dynamically by a StorageClass. A **PersistentVolumeClaim (PVC)** is a user's request for storage — it specifies size, access mode (ReadWriteOnce, ReadWriteMany), and optionally a StorageClass. Kubernetes binds a PVC to a suitable PV. A **StorageClass** defines the "class" of storage (e.g., `gp3`, `io2`, `nfs`) and the provisioner that creates volumes on demand. Dynamic provisioning via StorageClass is the modern approach: a PVC is created, the provisioner creates a cloud disk, and a PV is automatically bound.
A **PersistentVolume (PV)** is a cluster-level storage resource provisioned by an admin or dynamically by a StorageClass. A **PersistentVolumeClaim (PVC)** is a user's request for storage — it specifies size, access mode (ReadWriteOnce, ReadWriteMany), and optionally a StorageClass. Kubernetes binds a PVC to a suitable PV. A **StorageClass** defines the "class" of storage (e.g., `gp3`, `io2`, `nfs`) and the provisioner that creates volumes on demand. Dynamic provisioning via StorageClass is the modern approach: a PVC is created, the provisioner creates a cloud disk, and a PV is automatically bound.
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources:
requests:
storage: 50Gi
```
StatefulSets are designed for stateful applications that require stable, persistent identities. Unlike Deployments, each StatefulSet Pod gets a stable hostname (`pod-0`, `pod-1`, ...) that persists across restarts, an ordered startup/shutdown sequence, and a dedicated PersistentVolumeClaim per replica that is not shared. This makes StatefulSets the correct choice for databases (PostgreSQL, MongoDB), message brokers (Kafka, RabbitMQ), and distributed caches (Redis cluster) — anywhere a Pod needs to know its own identity or have exclusive access to persistent storage. Pods are started in order (0, 1, 2...) and terminated in reverse order, which matters for primary/replica replication setup.
StatefulSets are designed for stateful applications that require stable, persistent identities. Unlike Deployments, each StatefulSet Pod gets a stable hostname (`pod-0`, `pod-1`, ...) that persists across restarts, an ordered startup/shutdown sequence, and a dedicated PersistentVolumeClaim per replica that is not shared. This makes StatefulSets the correct choice for databases (PostgreSQL, MongoDB), message brokers (Kafka, RabbitMQ), and distributed caches (Redis cluster) — anywhere a Pod needs to know its own identity or have exclusive access to persistent storage. Pods are started in order (0, 1, 2...) and terminated in reverse order, which matters for primary/replica replication setup.
A DaemonSet ensures that exactly one Pod runs on every (or a subset of) Node(s) in the cluster. When a new Node joins the cluster, the DaemonSet automatically schedules a Pod on it; when a Node is removed, the Pod is garbage-collected. Classic use cases: **node-level log collection** (Fluent Bit, Fluentd), **node monitoring agents** (Datadog agent, Prometheus Node Exporter), **CNI plugins** (Calico, Cilium node agents), **storage daemons** (Ceph, GlusterFS). DaemonSets can use `nodeSelector` or `nodeAffinity` to target specific Node roles (e.g., only GPU nodes).
A DaemonSet ensures that exactly one Pod runs on every (or a subset of) Node(s) in the cluster. When a new Node joins the cluster, the DaemonSet automatically schedules a Pod on it; when a Node is removed, the Pod is garbage-collected. Classic use cases: **node-level log collection** (Fluent Bit, Fluentd), **node monitoring agents** (Datadog agent, Prometheus Node Exporter), **CNI plugins** (Calico, Cilium node agents), **storage daemons** (Ceph, GlusterFS). DaemonSets can use `nodeSelector` or `nodeAffinity` to target specific Node roles (e.g., only GPU nodes).
A **Job** runs one or more Pods to completion — it tracks successful completions and retries on failure up to `backoffLimit`. Jobs are for batch tasks: data migrations, report generation, one-off scripts. `parallelism` and `completions` control concurrent execution. A **CronJob** wraps a Job with a cron schedule and creates a new Job at each trigger. `concurrencyPolicy` controls what happens if a previous Job is still running: `Allow`, `Forbid`, or `Replace`.
A **Job** runs one or more Pods to completion — it tracks successful completions and retries on failure up to `backoffLimit`. Jobs are for batch tasks: data migrations, report generation, one-off scripts. `parallelism` and `completions` control concurrent execution. A **CronJob** wraps a Job with a cron schedule and creates a new Job at each trigger. `concurrencyPolicy` controls what happens if a previous Job is still running: `Allow`, `Forbid`, or `Replace`.
```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-cleanup
spec:
schedule: "0 2 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: myapp:latest
command: ["node", "scripts/cleanup.js"]
restartPolicy: OnFailure
```
Init containers are special containers that run and complete before the main application containers start. They run sequentially, one at a time, and must exit with code 0 for the Pod to proceed. Use cases: waiting for a dependency to become available (database readiness check), running database migrations before the app starts, downloading configuration or secrets from a vault, setting up file permissions in a shared volume. Because they run before the app, they can use a different (more privileged) image without exposing those tools in the production container.
yaml
initContainers:
- name: wait-for-db
image: busybox
command: ['sh', '-c', 'until nc -z db 5432; do sleep 2; done']
Init containers are special containers that run and complete before the main application containers start. They run sequentially, one at a time, and must exit with code 0 for the Pod to proceed. Use cases: waiting for a dependency to become available (database readiness check), running database migrations before the app starts, downloading configuration or secrets from a vault, setting up file permissions in a shared volume. Because they run before the app, they can use a different (more privileged) image without exposing those tools in the production container.
```yaml
initContainers:
- name: wait-for-db
image: busybox
command: ['sh', '-c', 'until nc -z db 5432; do sleep 2; done']
```
The sidecar pattern places a helper container in the same Pod as the main application container. Both containers share the same network namespace and can share volumes. The sidecar augments the main container without modifying it, enforcing the single-responsibility principle at the container level. Common sidecars: **service mesh proxies** (Envoy in Istio) that handle mTLS, circuit breaking, and telemetry transparently; **log shippers** (Fluent Bit) that tail the app's log files from a shared volume; **credential refreshers** that write refreshed tokens into a shared volume for the app to read; **reverse proxies** for TLS termination. In Kubernetes 1.29+ there is native sidecar support via `initContainers` with `restartPolicy: Always` that ensures the sidecar starts before and outlives the main container.
The sidecar pattern places a helper container in the same Pod as the main application container. Both containers share the same network namespace and can share volumes. The sidecar augments the main container without modifying it, enforcing the single-responsibility principle at the container level. Common sidecars: **service mesh proxies** (Envoy in Istio) that handle mTLS, circuit breaking, and telemetry transparently; **log shippers** (Fluent Bit) that tail the app's log files from a shared volume; **credential refreshers** that write refreshed tokens into a shared volume for the app to read; **reverse proxies** for TLS termination. In Kubernetes 1.29+ there is native sidecar support via `initContainers` with `restartPolicy: Always` that ensures the sidecar starts before and outlives the main container.
ConfigMaps and Secrets both store key-value data that can be injected into Pods. The difference is intent and encoding: Secrets are base64-encoded (not encrypted!) by default and are treated specially — they are not printed by `kubectl get` by default and can be stored encrypted. Encryption at rest requires configuring an `EncryptionConfiguration` on the kube-apiserver that specifies an encryption provider (AES-CBC, AES-GCM, or KMS). The KMS provider is the strongest option because the encryption key is stored externally (AWS KMS, GCP KMS, Vault) and never touches etcd. Without encryption at rest, anyone with raw etcd access can decode all Secrets.
ConfigMaps and Secrets both store key-value data that can be injected into Pods. The difference is intent and encoding: Secrets are base64-encoded (not encrypted!) by default and are treated specially — they are not printed by `kubectl get` by default and can be stored encrypted. Encryption at rest requires configuring an `EncryptionConfiguration` on the kube-apiserver that specifies an encryption provider (AES-CBC, AES-GCM, or KMS). The KMS provider is the strongest option because the encryption key is stored externally (AWS KMS, GCP KMS, Vault) and never touches etcd. Without encryption at rest, anyone with raw etcd access can decode all Secrets.
**Node selectors** (`nodeSelector`) are the simplest form: schedule a Pod only on Nodes with a specific label (e.g., `disktype: ssd`). **Node affinity** is more expressive — it supports `In`, `NotIn`, `Exists` operators and has `requiredDuringSchedulingIgnoredDuringExecution` (hard) vs `preferredDuringSchedulingIgnoredDuringExecution` (soft) rules. **Pod affinity** schedules a Pod near (same zone/node) other Pods matching a label selector — useful for co-locating a cache with an app for low latency. **Pod anti-affinity** is the opposite: spread Pods across failure domains (nodes, zones) to improve availability.
**Node selectors** (`nodeSelector`) are the simplest form: schedule a Pod only on Nodes with a specific label (e.g., `disktype: ssd`). **Node affinity** is more expressive — it supports `In`, `NotIn`, `Exists` operators and has `requiredDuringSchedulingIgnoredDuringExecution` (hard) vs `preferredDuringSchedulingIgnoredDuringExecution` (soft) rules. **Pod affinity** schedules a Pod near (same zone/node) other Pods matching a label selector — useful for co-locating a cache with an app for low latency. **Pod anti-affinity** is the opposite: spread Pods across failure domains (nodes, zones) to improve availability.
```yaml
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: api
```
Taints are applied to Nodes to repel Pods that do not explicitly tolerate them, allowing you to reserve Nodes for specific workloads (GPU nodes, spot nodes, control-plane nodes). A Pod must declare a matching **toleration** to be scheduled on a tainted Node. Taint effects: `NoSchedule` (don't schedule without toleration), `PreferNoSchedule` (try to avoid), `NoExecute` (evict existing Pods that don't tolerate). Tolerations do not require the Pod to be scheduled on a tainted Node — combine with node affinity to both attract and repel.
yaml
# Toleration in Pod spec
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Taints are applied to Nodes to repel Pods that do not explicitly tolerate them, allowing you to reserve Nodes for specific workloads (GPU nodes, spot nodes, control-plane nodes). A Pod must declare a matching **toleration** to be scheduled on a tainted Node. Taint effects: `NoSchedule` (don't schedule without toleration), `PreferNoSchedule` (try to avoid), `NoExecute` (evict existing Pods that don't tolerate). Tolerations do not require the Pod to be scheduled on a tainted Node — combine with node affinity to both attract and repel.
```bash
# Taint a GPU node
kubectl taint nodes gpu-node-1 nvidia.com/gpu=true:NoSchedule
```
```yaml
# Toleration in Pod spec
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
```
RBAC (Role-Based Access Control) controls which subjects (users, groups, ServiceAccounts) can perform which verbs (get, list, create, delete, patch) on which API resources. A **Role** grants permissions within a single Namespace; a **ClusterRole** grants permissions cluster-wide or can be bound per-Namespace. A **RoleBinding** attaches a Role or ClusterRole to a subject within a Namespace; a **ClusterRoleBinding** attaches a ClusterRole cluster-wide. The principle of least privilege: ServiceAccounts for applications should only have the specific permissions needed.
yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
RBAC (Role-Based Access Control) controls which subjects (users, groups, ServiceAccounts) can perform which verbs (get, list, create, delete, patch) on which API resources. A **Role** grants permissions within a single Namespace; a **ClusterRole** grants permissions cluster-wide or can be bound per-Namespace. A **RoleBinding** attaches a Role or ClusterRole to a subject within a Namespace; a **ClusterRoleBinding** attaches a ClusterRole cluster-wide. The principle of least privilege: ServiceAccounts for applications should only have the specific permissions needed.
```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-reader
namespace: production
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
```
By default, all Pods in a Kubernetes cluster can communicate with all other Pods. A **NetworkPolicy** is a namespace-scoped resource that uses label selectors to define allowed ingress/egress traffic for Pods. NetworkPolicies are enforced by the CNI plugin — not all CNIs support them (Flannel does not; Calico and Cilium do). The default-deny pattern works by applying an empty policy that selects all Pods but specifies no `ingress` or `egress` rules, then adding specific allow policies.
yaml
# Default deny all ingress in namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes: [Ingress]
By default, all Pods in a Kubernetes cluster can communicate with all other Pods. A **NetworkPolicy** is a namespace-scoped resource that uses label selectors to define allowed ingress/egress traffic for Pods. NetworkPolicies are enforced by the CNI plugin — not all CNIs support them (Flannel does not; Calico and Cilium do). The default-deny pattern works by applying an empty policy that selects all Pods but specifies no `ingress` or `egress` rules, then adding specific allow policies.
```yaml
# Default deny all ingress in namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes: [Ingress]
```
Helm is the package manager for Kubernetes. A **chart** is a directory of templates (Go templates that produce YAML manifests), a `values.yaml` file with default configuration, and a `Chart.yaml` with metadata. When you run `helm install`, Helm renders templates by merging `values.yaml` with any overrides (`--set` or `-f`), sends the resulting manifests to the API server, and records the **release** (name + revision + rendered manifests) as a Secret in the target Namespace. `helm upgrade` creates a new revision; `helm rollback` returns to a previous one.
Helm is the package manager for Kubernetes. A **chart** is a directory of templates (Go templates that produce YAML manifests), a `values.yaml` file with default configuration, and a `Chart.yaml` with metadata. When you run `helm install`, Helm renders templates by merging `values.yaml` with any overrides (`--set` or `-f`), sends the resulting manifests to the API server, and records the **release** (name + revision + rendered manifests) as a Secret in the target Namespace. `helm upgrade` creates a new revision; `helm rollback` returns to a previous one.
```bash
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-postgres bitnami/postgresql \
--set auth.postgresPassword=secret \
--namespace db --create-namespace
helm upgrade my-postgres bitnami/postgresql --set image.tag=16.2.0
```
Kustomize is a built-in Kubernetes tool (`kubectl apply -k`) for customising YAML without templating. It uses a **base** (the original manifests) and **overlays** (patches for each environment) composed via a `kustomization.yaml` file. Choose Kustomize over Helm when: you own the manifests and don't need packaging/versioning, you want patches rather than parameterisation, or you need to patch third-party YAML without forking it. Choose Helm when: you need versioned, distributable packages (e.g. distributing your software to customers), need conditional logic in templates, or need complex dependency management. In practice, many teams use both: Kustomize for their own app overlays and Helm for third-party dependencies.
Kustomize is a built-in Kubernetes tool (`kubectl apply -k`) for customising YAML without templating. It uses a **base** (the original manifests) and **overlays** (patches for each environment) composed via a `kustomization.yaml` file. Choose Kustomize over Helm when: you own the manifests and don't need packaging/versioning, you want patches rather than parameterisation, or you need to patch third-party YAML without forking it. Choose Helm when: you need versioned, distributable packages (e.g. distributing your software to customers), need conditional logic in templates, or need complex dependency management. In practice, many teams use both: Kustomize for their own app overlays and Helm for third-party dependencies.
```yaml
# kustomization.yaml (production overlay)
bases:
- ../../base
patchesStrategicMerge:
- increase-replicas.yaml
images:
- name: myapp
newTag: "1.5.0"
```
`kubectl rollout status` watches a Deployment (or StatefulSet/DaemonSet) and streams progress, returning exit code 0 when the rollout completes successfully or non-zero on failure — making it composable with CI scripts. `kubectl rollout undo` rolls back to the previous revision by reverting the pod template; you can target a specific revision with `--to-revision=N`. The Deployment's `revisionHistoryLimit` (default 10) controls how many old ReplicaSets are kept for rollback.
bash
kubectl rollout status deployment/api -n production
# Watch live progress, exit 0 on success
kubectl rollout undo deployment/api -n production
# Roll back to previous revision
kubectl rollout history deployment/api
# Show revision list
`kubectl rollout status` watches a Deployment (or StatefulSet/DaemonSet) and streams progress, returning exit code 0 when the rollout completes successfully or non-zero on failure — making it composable with CI scripts. `kubectl rollout undo` rolls back to the previous revision by reverting the pod template; you can target a specific revision with `--to-revision=N`. The Deployment's `revisionHistoryLimit` (default 10) controls how many old ReplicaSets are kept for rollback.
```bash
kubectl rollout status deployment/api -n production
# Watch live progress, exit 0 on success
kubectl rollout undo deployment/api -n production
# Roll back to previous revision
kubectl rollout history deployment/api
# Show revision list
```
`kubectl port-forward` tunnels a local port to a port on a Pod (or Service), allowing you to access services inside the cluster without exposing them externally. `kubectl exec` runs a command inside a running container, or opens an interactive shell.
bash
# Forward local 5432 to postgres Pod port 5432
kubectl port-forward pod/postgres-0 5432:5432 -n db
# Open interactive shell in a Pod
kubectl exec -it deployment/api -n production -- sh
# Run a one-off command
kubectl exec deployment/api -- node -e "console.log(process.env)"
# Debug with an ephemeral container (1.23+)
kubectl debug -it pod/api-xyz --image=busybox --target=api
Ephemeral debug containers (1.23+) let you attach a debug image to a running Pod without restarting it.
`kubectl port-forward` tunnels a local port to a port on a Pod (or Service), allowing you to access services inside the cluster without exposing them externally. `kubectl exec` runs a command inside a running container, or opens an interactive shell.
```bash
# Forward local 5432 to postgres Pod port 5432
kubectl port-forward pod/postgres-0 5432:5432 -n db
# Open interactive shell in a Pod
kubectl exec -it deployment/api -n production -- sh
# Run a one-off command
kubectl exec deployment/api -- node -e "console.log(process.env)"
# Debug with an ephemeral container (1.23+)
kubectl debug -it pod/api-xyz --image=busybox --target=api
```
Ephemeral debug containers (1.23+) let you attach a debug image to a running Pod without restarting it.
A **ResourceQuota** limits the total aggregate resources consumed within a Namespace — total CPU, memory, and number of objects (Pods, Services, PVCs). It prevents one team from consuming the entire cluster. A **LimitRange** sets default and maximum resource requests/limits for individual containers and Pods in a Namespace. Without LimitRange, a Pod can be created with no `resources` set, giving it BestEffort QoS and potentially unlimited consumption. Together, they enforce a "guardrails" policy: teams work within their allocated quota, and all Pods get sensible defaults.
A **ResourceQuota** limits the total aggregate resources consumed within a Namespace — total CPU, memory, and number of objects (Pods, Services, PVCs). It prevents one team from consuming the entire cluster. A **LimitRange** sets default and maximum resource requests/limits for individual containers and Pods in a Namespace. Without LimitRange, a Pod can be created with no `resources` set, giving it BestEffort QoS and potentially unlimited consumption. Together, they enforce a "guardrails" policy: teams work within their allocated quota, and all Pods get sensible defaults.
```yaml
apiVersion: v1
kind: LimitRange
metadata:
name: container-defaults
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
```
A PodDisruptionBudget sets a policy on how many replicas of a labelled set of Pods must remain available (or can be unavailable) during voluntary disruptions — like `kubectl drain` during a Node upgrade, or a Deployment rollout. Without a PDB, draining a Node could take all replicas of a service offline simultaneously. `minAvailable: 2` guarantees at least 2 replicas are up; `maxUnavailable: 1` allows only one to be down at a time. PDBs only protect against voluntary disruptions (drains, evictions); involuntary ones (hardware failure) can still reduce replicas below the budget.
A PodDisruptionBudget sets a policy on how many replicas of a labelled set of Pods must remain available (or can be unavailable) during voluntary disruptions — like `kubectl drain` during a Node upgrade, or a Deployment rollout. Without a PDB, draining a Node could take all replicas of a service offline simultaneously. `minAvailable: 2` guarantees at least 2 replicas are up; `maxUnavailable: 1` allows only one to be down at a time. PDBs only protect against voluntary disruptions (drains, evictions); involuntary ones (hardware failure) can still reduce replicas below the budget.
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
```
Kubernetes communicates with container runtimes via the Container Runtime Interface (CRI). Docker was never a CRI-compliant runtime — Kubernetes used a translation layer called **dockershim** maintained in the kubelet. In Kubernetes 1.24, dockershim was removed. Clusters now use **containerd** (the runtime that was already inside Docker) or **CRI-O** directly. For most users nothing changes because the images (OCI-format) are identical, but tooling changes: you use `crictl` instead of `docker` on nodes, and `ctr` or `nerdctl` for lower-level inspection. Docker CLI still works fine on developer machines to build images.
Kubernetes communicates with container runtimes via the Container Runtime Interface (CRI). Docker was never a CRI-compliant runtime — Kubernetes used a translation layer called **dockershim** maintained in the kubelet. In Kubernetes 1.24, dockershim was removed. Clusters now use **containerd** (the runtime that was already inside Docker) or **CRI-O** directly. For most users nothing changes because the images (OCI-format) are identical, but tooling changes: you use `crictl` instead of `docker` on nodes, and `ctr` or `nerdctl` for lower-level inspection. Docker CLI still works fine on developer machines to build images.
The `imagePullPolicy` field on a container spec controls when the kubelet pulls the image. **IfNotPresent** (default for non-latest tags) pulls only if the image is not already cached on the Node — fast and efficient. **Always** (default for `:latest` and untagged) pulls from the registry on every Pod start, guaranteeing the latest content but adding latency and registry load. **Never** never pulls — fails if the image is not already on the Node, useful in air-gapped environments. Best practice: always use specific immutable tags (e.g., `myapp:1.5.2`) with `IfNotPresent`; never use `:latest` in production because you lose reproducibility and rollback capability.
The `imagePullPolicy` field on a container spec controls when the kubelet pulls the image. **IfNotPresent** (default for non-latest tags) pulls only if the image is not already cached on the Node — fast and efficient. **Always** (default for `:latest` and untagged) pulls from the registry on every Pod start, guaranteeing the latest content but adding latency and registry load. **Never** never pulls — fails if the image is not already on the Node, useful in air-gapped environments. Best practice: always use specific immutable tags (e.g., `myapp:1.5.2`) with `IfNotPresent`; never use `:latest` in production because you lose reproducibility and rollback capability.
Kubernetes runs CoreDNS (replacing kube-dns since 1.13) as a Deployment inside the cluster. Every Pod is configured to use CoreDNS as its DNS resolver. A ClusterIP Service named `my-service` in Namespace `prod` gets the fully qualified domain name `my-service.prod.svc.cluster.local`. Within the same Namespace, `my-service` resolves; cross-namespace requires `my-service.prod`. The `ndots:5` search path in Pod `/etc/resolv.conf` means short names are expanded through the search list before a global lookup. Headless Services (ClusterIP: None) return individual Pod IPs instead of the ClusterIP, enabling direct Pod addressing for StatefulSets.
Kubernetes runs CoreDNS (replacing kube-dns since 1.13) as a Deployment inside the cluster. Every Pod is configured to use CoreDNS as its DNS resolver. A ClusterIP Service named `my-service` in Namespace `prod` gets the fully qualified domain name `my-service.prod.svc.cluster.local`. Within the same Namespace, `my-service` resolves; cross-namespace requires `my-service.prod`. The `ndots:5` search path in Pod `/etc/resolv.conf` means short names are expanded through the search list before a global lookup. Headless Services (ClusterIP: None) return individual Pod IPs instead of the ClusterIP, enabling direct Pod addressing for StatefulSets.
All three are health checks run by the kubelet but with different consequences. **Liveness probe**: if it fails `failureThreshold` consecutive times, the kubelet kills and restarts the container — use for detecting deadlocks. **Readiness probe**: if it fails, the Pod's IP is removed from all Service endpoint lists — traffic stops being routed to it, but the container keeps running. **Startup probe**: runs first and disables liveness/readiness until it succeeds; designed for slow-starting apps (legacy apps that take 60s to warm up) so that liveness doesn't kill them prematurely. After startup probe passes, liveness and readiness take over. Use HTTP GET probes for web services, TCP socket for databases, and exec for custom checks.
All three are health checks run by the kubelet but with different consequences. **Liveness probe**: if it fails `failureThreshold` consecutive times, the kubelet kills and restarts the container — use for detecting deadlocks. **Readiness probe**: if it fails, the Pod's IP is removed from all Service endpoint lists — traffic stops being routed to it, but the container keeps running. **Startup probe**: runs first and disables liveness/readiness until it succeeds; designed for slow-starting apps (legacy apps that take 60s to warm up) so that liveness doesn't kill them prematurely. After startup probe passes, liveness and readiness take over. Use HTTP GET probes for web services, TCP socket for databases, and exec for custom checks.
Metrics Server is a cluster add-on that collects real-time CPU and memory usage from every kubelet and exposes them through the Kubernetes Metrics API (`metrics.k8s.io`). `kubectl top pods` and `kubectl top nodes` query this API to display current resource consumption. It is designed for autoscaling (HPA reads from it) and operational visibility — it does not store historical data (use Prometheus/Grafana for that). Metrics Server must be installed separately (it's not included in all distributions); on EKS you deploy it from the official manifest, on GKE it's pre-installed.
bash
kubectl top pods -n production --sort-by=memory
kubectl top nodes
Metrics Server is a cluster add-on that collects real-time CPU and memory usage from every kubelet and exposes them through the Kubernetes Metrics API (`metrics.k8s.io`). `kubectl top pods` and `kubectl top nodes` query this API to display current resource consumption. It is designed for autoscaling (HPA reads from it) and operational visibility — it does not store historical data (use Prometheus/Grafana for that). Metrics Server must be installed separately (it's not included in all distributions); on EKS you deploy it from the official manifest, on GKE it's pre-installed.
```bash
kubectl top pods -n production --sort-by=memory
kubectl top nodes
```
Putting multiple containers in a Pod is appropriate when they are tightly coupled and need to share local network or filesystem state. The canonical cases: **sidecar** (log shipper reading from a shared volume), **ambassador** (proxy that adapts network access for the main container), **adapter** (transforms output of the main container before it is exported). Avoid packing unrelated services into one Pod — they scale together as a unit, so independent scaling is impossible, and a crash in one container restarts the whole Pod. The question to ask: "Could these two containers be run on different nodes and still function?" If yes, they should be in separate Pods.
Putting multiple containers in a Pod is appropriate when they are tightly coupled and need to share local network or filesystem state. The canonical cases: **sidecar** (log shipper reading from a shared volume), **ambassador** (proxy that adapts network access for the main container), **adapter** (transforms output of the main container before it is exported). Avoid packing unrelated services into one Pod — they scale together as a unit, so independent scaling is impossible, and a crash in one container restarts the whole Pod. The question to ask: "Could these two containers be run on different nodes and still function?" If yes, they should be in separate Pods.
**kube-apiserver** is the only component that reads and writes etcd; all others communicate only through the API server. It validates and processes REST requests, enforces admission control, handles authentication/authorisation, and emits watch events. It is horizontally scalable — multiple replicas behind a load balancer for HA. **etcd** is a distributed key-value store using the Raft consensus protocol; it is the sole persistent store for all cluster state. It should be backed up regularly and sized for low-latency disk I/O. **kube-scheduler** watches for unbound Pods and assigns them to Nodes in two phases: filtering (removing Nodes that cannot run the Pod) and scoring (ranking remaining Nodes). It is pluggable via the scheduling framework. **kube-controller-manager** runs dozens of control loops (Deployment controller, ReplicaSet controller, Node controller, Job controller, etc.) as goroutines in one process; each loop watches API objects and reconciles actual vs desired state. **cloud-controller-manager** was split out to decouple cloud-provider logic; it manages cloud-specific resources like load balancers, routes, and node lifecycle for the cloud provider.
**kube-apiserver** is the only component that reads and writes etcd; all others communicate only through the API server. It validates and processes REST requests, enforces admission control, handles authentication/authorisation, and emits watch events. It is horizontally scalable — multiple replicas behind a load balancer for HA. **etcd** is a distributed key-value store using the Raft consensus protocol; it is the sole persistent store for all cluster state. It should be backed up regularly and sized for low-latency disk I/O. **kube-scheduler** watches for unbound Pods and assigns them to Nodes in two phases: filtering (removing Nodes that cannot run the Pod) and scoring (ranking remaining Nodes). It is pluggable via the scheduling framework. **kube-controller-manager** runs dozens of control loops (Deployment controller, ReplicaSet controller, Node controller, Job controller, etc.) as goroutines in one process; each loop watches API objects and reconciles actual vs desired state. **cloud-controller-manager** was split out to decouple cloud-provider logic; it manages cloud-specific resources like load balancers, routes, and node lifecycle for the cloud provider.
etcd uses the **Raft** distributed consensus algorithm to replicate a log of state transitions across an odd number of members (typically 3 or 5). Raft elects one leader; all writes go to the leader, which appends the entry to its log and replicates it to followers. An entry is committed once a majority (quorum) acknowledges it, then applied to the state machine. If the leader fails, followers hold a randomised election timeout; the first to time out becomes a candidate and solicits votes. A cluster of N members tolerates (N-1)/2 failures. For a 3-member etcd cluster this means one member failure is acceptable — but performance degrades and a second failure makes the cluster unable to elect a leader, causing all writes to the API server to fail. etcd is extremely sensitive to disk latency: a slow disk causes leader heartbeat timeouts and frequent re-elections. For production, use local NVMe SSDs, separate etcd from noisy workloads, and monitor `etcd_server_leader_changes_seen_total` closely.
etcd uses the **Raft** distributed consensus algorithm to replicate a log of state transitions across an odd number of members (typically 3 or 5). Raft elects one leader; all writes go to the leader, which appends the entry to its log and replicates it to followers. An entry is committed once a majority (quorum) acknowledges it, then applied to the state machine. If the leader fails, followers hold a randomised election timeout; the first to time out becomes a candidate and solicits votes. A cluster of N members tolerates (N-1)/2 failures. For a 3-member etcd cluster this means one member failure is acceptable — but performance degrades and a second failure makes the cluster unable to elect a leader, causing all writes to the API server to fail. etcd is extremely sensitive to disk latency: a slow disk causes leader heartbeat timeouts and frequent re-elections. For production, use local NVMe SSDs, separate etcd from noisy workloads, and monitor `etcd_server_leader_changes_seen_total` closely.
Scheduling a Pod proceeds through the **scheduling framework** which replaced the old predicates/priorities model in 1.19. The framework defines extension points — plugins can register for each phase. **PreFilter/Filter** phase eliminates Nodes that cannot satisfy the Pod (insufficient CPU/memory, missing labels, taint mismatch, volume zone mismatch, affinity/anti-affinity). **PostFilter** handles preemption: if no Node passes Filter, the scheduler tries to evict lower-priority Pods to make room. **Score** phase ranks passing Nodes using weighted scoring functions (LeastAllocated for bin-packing, SelectorSpread for fault distribution, etc.). **Reserve** and **Bind** phases finalise the assignment and write it to the API server. The scheduler is pluggable: custom plugins can be compiled in or run externally as a second scheduler. Key config: `--leader-elect` for HA, `profiles` for multiple scheduling policies.
Scheduling a Pod proceeds through the **scheduling framework** which replaced the old predicates/priorities model in 1.19. The framework defines extension points — plugins can register for each phase. **PreFilter/Filter** phase eliminates Nodes that cannot satisfy the Pod (insufficient CPU/memory, missing labels, taint mismatch, volume zone mismatch, affinity/anti-affinity). **PostFilter** handles preemption: if no Node passes Filter, the scheduler tries to evict lower-priority Pods to make room. **Score** phase ranks passing Nodes using weighted scoring functions (LeastAllocated for bin-packing, SelectorSpread for fault distribution, etc.). **Reserve** and **Bind** phases finalise the assignment and write it to the API server. The scheduler is pluggable: custom plugins can be compiled in or run externally as a second scheduler. Key config: `--leader-elect` for HA, `profiles` for multiple scheduling policies.
The kubelet maintains a **pod manager** with the desired Pod specs (from the API server watch stream, static Pod manifests in `/etc/kubernetes/manifests/`, and the mirror Pod mechanism for static Pods). Its main reconciliation loop runs every `--sync-frequency` (default 10s): for each desired Pod it checks the **container runtime state** via CRI and calls the appropriate action — create, start, stop, or remove containers. The kubelet also manages volumes (mounts, unmounts), pulls images via the CRI, runs probes, and updates Pod status back to the API server. The **PLEG (Pod Lifecycle Event Generator)** polls the runtime every second and emits events when container states change, triggering reconciliation without waiting for the full sync cycle. When a container fails and `restartPolicy` is `Always` or `OnFailure`, the kubelet uses an exponential back-off (up to 5 minutes) before restarting — this is the `CrashLoopBackOff` you see in `kubectl describe`.
The kubelet maintains a **pod manager** with the desired Pod specs (from the API server watch stream, static Pod manifests in `/etc/kubernetes/manifests/`, and the mirror Pod mechanism for static Pods). Its main reconciliation loop runs every `--sync-frequency` (default 10s): for each desired Pod it checks the **container runtime state** via CRI and calls the appropriate action — create, start, stop, or remove containers. The kubelet also manages volumes (mounts, unmounts), pulls images via the CRI, runs probes, and updates Pod status back to the API server. The **PLEG (Pod Lifecycle Event Generator)** polls the runtime every second and emits events when container states change, triggering reconciliation without waiting for the full sync cycle. When a container fails and `restartPolicy` is `Always` or `OnFailure`, the kubelet uses an exponential back-off (up to 5 minutes) before restarting — this is the `CrashLoopBackOff` you see in `kubectl describe`.
CNI is a CNCF specification that defines how container runtimes call network plugins. When a Pod is created, the container runtime invokes the configured CNI plugin binary with the Pod's network namespace, and the plugin wires up the network interface, assigns an IP, and programs routes. **Flannel** is the simplest: it allocates a subnet per Node and uses VXLAN (or host-gw) encapsulation to route Pod-to-Pod traffic across Nodes. No NetworkPolicy support. **Calico** uses BGP to distribute Pod routes without encapsulation (pure L3, performant), and has a full NetworkPolicy implementation including Calico-specific GlobalNetworkPolicy. It can run in VXLAN mode where BGP is unavailable. **Cilium** uses eBPF programs loaded into the Linux kernel to replace iptables entirely for Service load-balancing and NetworkPolicy enforcement, achieving superior throughput and observability. Cilium also provides L7 policies (HTTP path, gRPC method) and the Hubble observability layer.
CNI is a CNCF specification that defines how container runtimes call network plugins. When a Pod is created, the container runtime invokes the configured CNI plugin binary with the Pod's network namespace, and the plugin wires up the network interface, assigns an IP, and programs routes. **Flannel** is the simplest: it allocates a subnet per Node and uses VXLAN (or host-gw) encapsulation to route Pod-to-Pod traffic across Nodes. No NetworkPolicy support. **Calico** uses BGP to distribute Pod routes without encapsulation (pure L3, performant), and has a full NetworkPolicy implementation including Calico-specific GlobalNetworkPolicy. It can run in VXLAN mode where BGP is unavailable. **Cilium** uses eBPF programs loaded into the Linux kernel to replace iptables entirely for Service load-balancing and NetworkPolicy enforcement, achieving superior throughput and observability. Cilium also provides L7 policies (HTTP path, gRPC method) and the Hubble observability layer.
Traditional Kubernetes networking uses iptables for Service routing (via kube-proxy) and NetworkPolicy enforcement. iptables is a sequential rule list: in a cluster with thousands of Services, a packet traverses thousands of NAT rules, adding significant CPU overhead and packet loss during updates (iptables lock). **eBPF** (Extended Berkeley Packet Filter) allows attaching JIT-compiled programs to kernel hooks — tc (traffic control), XDP (at the NIC driver level), and socket layers. Cilium replaces kube-proxy entirely: Service load-balancing is done with eBPF hash-map lookups at the socket layer (before the packet even hits the network stack) with O(1) complexity instead of O(n). NetworkPolicy is enforced at the eBPF layer with sub-microsecond latency. Cilium also provides transparent encryption (WireGuard or IPSec via eBPF), L7 observability with Hubble (capturing HTTP request metadata without a sidecar), and bandwidth management. Benchmark: at 1,000 Services, Cilium's eBPF datapath is ~3–5× faster than iptables mode.
Traditional Kubernetes networking uses iptables for Service routing (via kube-proxy) and NetworkPolicy enforcement. iptables is a sequential rule list: in a cluster with thousands of Services, a packet traverses thousands of NAT rules, adding significant CPU overhead and packet loss during updates (iptables lock). **eBPF** (Extended Berkeley Packet Filter) allows attaching JIT-compiled programs to kernel hooks — tc (traffic control), XDP (at the NIC driver level), and socket layers. Cilium replaces kube-proxy entirely: Service load-balancing is done with eBPF hash-map lookups at the socket layer (before the packet even hits the network stack) with O(1) complexity instead of O(n). NetworkPolicy is enforced at the eBPF layer with sub-microsecond latency. Cilium also provides transparent encryption (WireGuard or IPSec via eBPF), L7 observability with Hubble (capturing HTTP request metadata without a sidecar), and bandwidth management. Benchmark: at 1,000 Services, Cilium's eBPF datapath is ~3–5× faster than iptables mode.
kube-proxy watches the API server for Service and Endpoints changes and programs the kernel accordingly. In **iptables mode** (still the most common default), kube-proxy creates PREROUTING/OUTPUT chains with DNAT rules that rewrite the destination IP from ClusterIP to a randomly selected Pod IP. Rules scale linearly with the number of Services and Endpoints — 10,000 Services can mean 100,000+ iptables rules, causing high rule-update latency and packet loss during chained-lock updates. In **IPVS mode**, kube-proxy uses the Linux Virtual Server kernel module, which uses hash tables for O(1) Service lookup regardless of the number of Services. IPVS also supports load-balancing algorithms beyond random: round-robin, least connections, source-hash. IPVS mode requires the `ip_vs` kernel module and is recommended for large clusters (500+ Services). Cilium with kube-proxy replacement bypasses both.
kube-proxy watches the API server for Service and Endpoints changes and programs the kernel accordingly. In **iptables mode** (still the most common default), kube-proxy creates PREROUTING/OUTPUT chains with DNAT rules that rewrite the destination IP from ClusterIP to a randomly selected Pod IP. Rules scale linearly with the number of Services and Endpoints — 10,000 Services can mean 100,000+ iptables rules, causing high rule-update latency and packet loss during chained-lock updates. In **IPVS mode**, kube-proxy uses the Linux Virtual Server kernel module, which uses hash tables for O(1) Service lookup regardless of the number of Services. IPVS also supports load-balancing algorithms beyond random: round-robin, least connections, source-hash. IPVS mode requires the `ip_vs` kernel module and is recommended for large clusters (500+ Services). Cilium with kube-proxy replacement bypasses both.
An Operator extends Kubernetes with domain-specific knowledge about an application by combining a **CustomResourceDefinition (CRD)** (which registers a new API type, e.g. `kind: PostgresCluster`) with a **controller** (a reconcile loop that watches instances of that CRD and drives the actual cluster state toward the declared spec). The operator encodes human operational knowledge — how to take backups, perform major version upgrades, handle failover. **kubebuilder** is the official scaffolding tool built on **controller-runtime**, which provides a cached informer-based API client, a reconcile manager, and webhook scaffolding. The reconcile function is idempotent: given a resource name, it reads the current state of the world and makes API calls to move toward the desired state, then returns a `Result{Requeue: true}` if it needs to re-run. Leader election ensures only one controller instance is active at a time for safety.
An Operator extends Kubernetes with domain-specific knowledge about an application by combining a **CustomResourceDefinition (CRD)** (which registers a new API type, e.g. `kind: PostgresCluster`) with a **controller** (a reconcile loop that watches instances of that CRD and drives the actual cluster state toward the declared spec). The operator encodes human operational knowledge — how to take backups, perform major version upgrades, handle failover. **kubebuilder** is the official scaffolding tool built on **controller-runtime**, which provides a cached informer-based API client, a reconcile manager, and webhook scaffolding. The reconcile function is idempotent: given a resource name, it reads the current state of the world and makes API calls to move toward the desired state, then returns a `Result{Requeue: true}` if it needs to re-run. Leader election ensures only one controller instance is active at a time for safety.
When a resource is created or updated, the request flows through the API server's admission control pipeline after authentication and authorisation but before being persisted to etcd. The pipeline has two phases of webhooks: **MutatingAdmissionWebhooks** can modify the object (e.g., inject sidecar containers, add default labels, set resource requests). They run first, potentially multiple times if mutations trigger further watches. **ValidatingAdmissionWebhooks** can only allow or deny — they run after all mutations are complete and cannot change the object. Both are registered via `MutatingWebhookConfiguration` / `ValidatingWebhookConfiguration` objects that specify which API groups/resources/operations to intercept. Webhook servers must serve over TLS with a CA bundle supplied in the configuration. A `failurePolicy: Fail` setting blocks the request if the webhook is unreachable — operationally risky; use `Ignore` for non-critical webhooks. Key operational concern: if your webhook is down and it has `failurePolicy: Fail`, you cannot create Pods.
When a resource is created or updated, the request flows through the API server's admission control pipeline after authentication and authorisation but before being persisted to etcd. The pipeline has two phases of webhooks: **MutatingAdmissionWebhooks** can modify the object (e.g., inject sidecar containers, add default labels, set resource requests). They run first, potentially multiple times if mutations trigger further watches. **ValidatingAdmissionWebhooks** can only allow or deny — they run after all mutations are complete and cannot change the object. Both are registered via `MutatingWebhookConfiguration` / `ValidatingWebhookConfiguration` objects that specify which API groups/resources/operations to intercept. Webhook servers must serve over TLS with a CA bundle supplied in the configuration. A `failurePolicy: Fail` setting blocks the request if the webhook is unreachable — operationally risky; use `Ignore` for non-critical webhooks. Key operational concern: if your webhook is down and it has `failurePolicy: Fail`, you cannot create Pods.
PodSecurityPolicy (PSP) was a cluster-wide admission controller that enforced security constraints on Pods — it was notoriously complex to configure correctly and was removed in Kubernetes 1.25. **Pod Security Admission** is its replacement, built into the API server as a built-in admission controller. It enforces three **Pod Security Standards**: **Privileged** (unrestricted), **Baseline** (minimally restrictive, prevents known privilege escalations), and **Restricted** (heavily hardcoded, following security best practices — requires non-root, drops all capabilities, enforces seccomp). PSA is configured at the Namespace level via labels: `pod-security.kubernetes.io/enforce: restricted`. The three modes are **enforce** (reject), **audit** (log violation), and **warn** (return warning to caller). This is simpler than PSP but less flexible; for complex policies use OPA/Gatekeeper or Kyverno.
PodSecurityPolicy (PSP) was a cluster-wide admission controller that enforced security constraints on Pods — it was notoriously complex to configure correctly and was removed in Kubernetes 1.25. **Pod Security Admission** is its replacement, built into the API server as a built-in admission controller. It enforces three **Pod Security Standards**: **Privileged** (unrestricted), **Baseline** (minimally restrictive, prevents known privilege escalations), and **Restricted** (heavily hardcoded, following security best practices — requires non-root, drops all capabilities, enforces seccomp). PSA is configured at the Namespace level via labels: `pod-security.kubernetes.io/enforce: restricted`. The three modes are **enforce** (reject), **audit** (log violation), and **warn** (return warning to caller). This is simpler than PSP but less flexible; for complex policies use OPA/Gatekeeper or Kyverno.
Gatekeeper is a Kubernetes-native policy engine that runs as a ValidatingAdmissionWebhook backed by Open Policy Agent (OPA). Policy logic is written in **Rego** (OPA's query language) and packaged as a **ConstraintTemplate** CRD, which defines a new CRD kind (e.g., `K8sRequiredLabels`). A **Constraint** is an instance of that template applied to specific resource types and scopes, carrying the policy parameters. When a resource is created, Gatekeeper evaluates all matching Constraints and denies the request if any Rego policy returns violations. Gatekeeper also runs an audit controller that periodically checks existing resources against all policies and reports violations. This enables a GitOps-friendly approach: policies are YAML in a repo, continuously reconciled by the Gatekeeper controller.
Gatekeeper is a Kubernetes-native policy engine that runs as a ValidatingAdmissionWebhook backed by Open Policy Agent (OPA). Policy logic is written in **Rego** (OPA's query language) and packaged as a **ConstraintTemplate** CRD, which defines a new CRD kind (e.g., `K8sRequiredLabels`). A **Constraint** is an instance of that template applied to specific resource types and scopes, carrying the policy parameters. When a resource is created, Gatekeeper evaluates all matching Constraints and denies the request if any Rego policy returns violations. Gatekeeper also runs an audit controller that periodically checks existing resources against all policies and reports violations. This enables a GitOps-friendly approach: policies are YAML in a repo, continuously reconciled by the Gatekeeper controller.
```yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: require-team-label
spec:
match:
kinds: [{apiGroups: [""], kinds: ["Pod"]}]
parameters:
labels: ["team"]
```
A service mesh adds a layer of infrastructure for service-to-service communication. In **Istio**, an Envoy proxy sidecar is injected into every Pod. The **control plane** (istiod) distributes configuration (xDS API) and acts as a certificate authority for mTLS. mTLS is transparent to the application: Envoy intercepts all TCP traffic, upgrades connections to mTLS using short-lived SPIFFE X.509 certificates, and presents service identity — enabling zero-trust networking. **Traffic policies** (VirtualService, DestinationRule) allow canary releases (weight-based routing), circuit breaking, retry logic, and timeout enforcement — all without changing application code. **Observability** is automatic: Envoy emits per-request metrics (latency, success rate, bytes) to Prometheus, distributed traces to Jaeger/Zipkin, and Kiali visualises the topology. **Linkerd** achieves similar results with a lighter-weight Rust-based proxy (lower CPU/memory overhead) but fewer advanced traffic features than Istio.
A service mesh adds a layer of infrastructure for service-to-service communication. In **Istio**, an Envoy proxy sidecar is injected into every Pod. The **control plane** (istiod) distributes configuration (xDS API) and acts as a certificate authority for mTLS. mTLS is transparent to the application: Envoy intercepts all TCP traffic, upgrades connections to mTLS using short-lived SPIFFE X.509 certificates, and presents service identity — enabling zero-trust networking. **Traffic policies** (VirtualService, DestinationRule) allow canary releases (weight-based routing), circuit breaking, retry logic, and timeout enforcement — all without changing application code. **Observability** is automatic: Envoy emits per-request metrics (latency, success rate, bytes) to Prometheus, distributed traces to Jaeger/Zipkin, and Kiali visualises the topology. **Linkerd** achieves similar results with a lighter-weight Rust-based proxy (lower CPU/memory overhead) but fewer advanced traffic features than Istio.
The Gateway API is a set of CRDs (`GatewayClass`, `Gateway`, `HTTPRoute`, `TCPRoute`, etc.) developed as the official successor to the Ingress resource, GA in Kubernetes 1.28. Key improvements over Ingress: **Role separation** — a GatewayClass is managed by the infrastructure team, a Gateway by the platform team, and HTTPRoutes by application teams, with RBAC naturally enforced by the Kubernetes API. **Expressiveness** — header-based routing, traffic splitting, request mirroring, URL rewrites are first-class fields, not implementation-specific annotations. **Protocol support** — HTTP, HTTPS, gRPC, TCP, and TLS routes are separate typed resources. **Portability** — moving between Nginx, Traefik, Istio, or cloud-provider gateways requires changing only the GatewayClass reference, not the Route manifests.
The Gateway API is a set of CRDs (`GatewayClass`, `Gateway`, `HTTPRoute`, `TCPRoute`, etc.) developed as the official successor to the Ingress resource, GA in Kubernetes 1.28. Key improvements over Ingress: **Role separation** — a GatewayClass is managed by the infrastructure team, a Gateway by the platform team, and HTTPRoutes by application teams, with RBAC naturally enforced by the Kubernetes API. **Expressiveness** — header-based routing, traffic splitting, request mirroring, URL rewrites are first-class fields, not implementation-specific annotations. **Protocol support** — HTTP, HTTPS, gRPC, TCP, and TLS routes are separate typed resources. **Portability** — moving between Nginx, Traefik, Istio, or cloud-provider gateways requires changing only the GatewayClass reference, not the Route manifests.
HPA scales horizontally (adds/removes replicas based on metrics) and is well-suited for stateless services. VPA scales vertically (adjusts CPU/memory requests and limits of running Pods) and is better for workloads that cannot scale out (single-instance databases, memory-intensive batch jobs). The current VPA implementation requires a Pod restart to apply new resource values (except with the in-place resize feature in Kubernetes 1.27+ alpha). **Do not use HPA and VPA together on the same CPU/memory metric** — they will fight each other: HPA adds replicas thinking load is high while VPA increases requests, causing the scheduler to see a busy cluster. Safe combined use: VPA on memory (which HPA ignores) and HPA on custom metrics like RPS. The Goldilocks tool from FairwindsOps runs VPA in recommendation-only mode and surfaces right-sizing suggestions without auto-applying them.
HPA scales horizontally (adds/removes replicas based on metrics) and is well-suited for stateless services. VPA scales vertically (adjusts CPU/memory requests and limits of running Pods) and is better for workloads that cannot scale out (single-instance databases, memory-intensive batch jobs). The current VPA implementation requires a Pod restart to apply new resource values (except with the in-place resize feature in Kubernetes 1.27+ alpha). **Do not use HPA and VPA together on the same CPU/memory metric** — they will fight each other: HPA adds replicas thinking load is high while VPA increases requests, causing the scheduler to see a busy cluster. Safe combined use: VPA on memory (which HPA ignores) and HPA on custom metrics like RPS. The Goldilocks tool from FairwindsOps runs VPA in recommendation-only mode and surfaces right-sizing suggestions without auto-applying them.
**Cluster Autoscaler (CA)** watches for unschedulable Pods and checks which node group (ASG on AWS, MIG on GCP) could accommodate them by simulating the scheduler. It scales up the pre-defined node groups, then scales down nodes that have been underutilised for a configurable period (default 10 min). It is limited to the instance types configured in each node group. **Karpenter** (AWS-native, CNCF project) provisions nodes directly via EC2 API without pre-defined node groups. It reads the unscheduled Pods' requirements (CPU, memory, GPU, architecture) and selects the optimal instance type from the full EC2 catalogue in real time. Karpenter can select Spot vs On-Demand automatically, chooses right-sized instances (avoiding over-provisioning), and consolidates nodes actively during low utilisation. Karpenter typically provisions nodes 3–5× faster than CA and achieves better cost efficiency through flexible instance selection and aggressive consolidation.
**Cluster Autoscaler (CA)** watches for unschedulable Pods and checks which node group (ASG on AWS, MIG on GCP) could accommodate them by simulating the scheduler. It scales up the pre-defined node groups, then scales down nodes that have been underutilised for a configurable period (default 10 min). It is limited to the instance types configured in each node group. **Karpenter** (AWS-native, CNCF project) provisions nodes directly via EC2 API without pre-defined node groups. It reads the unscheduled Pods' requirements (CPU, memory, GPU, architecture) and selects the optimal instance type from the full EC2 catalogue in real time. Karpenter can select Spot vs On-Demand automatically, chooses right-sized instances (avoiding over-provisioning), and consolidates nodes actively during low utilisation. Karpenter typically provisions nodes 3–5× faster than CA and achieves better cost efficiency through flexible instance selection and aggressive consolidation.
**Namespace-based multi-tenancy** is the lightest weight: teams share one cluster's control plane, isolated by RBAC + NetworkPolicy + ResourceQuotas. Cost is minimal, but isolation is soft — a compromised workload can attempt API server attacks, kernel exploits could escape namespace isolation, and control plane is a shared blast radius. **vcluster** runs a full virtual Kubernetes control plane as a StatefulSet inside a Namespace, giving each tenant their own kube-apiserver, scheduler, and CRD space while sharing the underlying node infrastructure. This provides stronger isolation and allows tenants to install cluster-scoped resources (CRDs, webhooks) without polluting the host cluster. **Separate clusters** provide the strongest isolation (separate etcd, control plane, network, IAM) at the highest cost and operational overhead. Use separate clusters when: tenants are different business units with compliance boundaries, different upgrade cadences are needed, or blast radius must be fully contained.
**Namespace-based multi-tenancy** is the lightest weight: teams share one cluster's control plane, isolated by RBAC + NetworkPolicy + ResourceQuotas. Cost is minimal, but isolation is soft — a compromised workload can attempt API server attacks, kernel exploits could escape namespace isolation, and control plane is a shared blast radius. **vcluster** runs a full virtual Kubernetes control plane as a StatefulSet inside a Namespace, giving each tenant their own kube-apiserver, scheduler, and CRD space while sharing the underlying node infrastructure. This provides stronger isolation and allows tenants to install cluster-scoped resources (CRDs, webhooks) without polluting the host cluster. **Separate clusters** provide the strongest isolation (separate etcd, control plane, network, IAM) at the highest cost and operational overhead. Use separate clusters when: tenants are different business units with compliance boundaries, different upgrade cadences are needed, or blast radius must be fully contained.
Container image signing proves that an image was produced by a trusted CI system and has not been tampered with. **Cosign** (part of the Sigstore project) signs OCI images by attaching a cryptographic signature as an OCI artifact in the same registry. In keyless mode, Cosign uses Fulcio (an OIDC-backed CA) to issue short-lived signing certificates tied to the CI identity (GitHub Actions OIDC token), and logs signatures to the Rekor transparency log — no private key management needed. **Notary v2** (now Notation) uses a similar OCI artifact attachment model but with traditional key management. In Kubernetes, signatures are verified at admission time by policy engines: Cosign + Kyverno (`verify-image` policy) or Connaisseur. The admission webhook fetches the signature from the registry and verifies it against the trusted key/certificate before allowing the Pod to start. This prevents "tag squatting" — pushing a malicious image over an existing tag.
Container image signing proves that an image was produced by a trusted CI system and has not been tampered with. **Cosign** (part of the Sigstore project) signs OCI images by attaching a cryptographic signature as an OCI artifact in the same registry. In keyless mode, Cosign uses Fulcio (an OIDC-backed CA) to issue short-lived signing certificates tied to the CI identity (GitHub Actions OIDC token), and logs signatures to the Rekor transparency log — no private key management needed. **Notary v2** (now Notation) uses a similar OCI artifact attachment model but with traditional key management. In Kubernetes, signatures are verified at admission time by policy engines: Cosign + Kyverno (`verify-image` policy) or Connaisseur. The admission webhook fetches the signature from the registry and verifies it against the trusted key/certificate before allowing the Pod to start. This prevents "tag squatting" — pushing a malicious image over an existing tag.
A Go binary compiled with `CGO_ENABLED=0` is fully statically linked and needs no shared libraries, making it a perfect fit for a distroless or scratch base image. The build uses multi-stage to keep the final image minimal.
dockerfile
FROM golang:1.22 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -ldflags="-s -w" -trimpath -o /app ./cmd/server
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]
`-s -w` strips debug symbols and DWARF info. `-trimpath` removes local file paths from the binary. `nonroot` ensures the container starts as UID 65532. The resulting image is typically 8–15 MB with zero shell, no package manager, and a minimal CVE footprint. Use `gcr.io/distroless/static-debian12:debug` locally if you need a shell.
A Go binary compiled with `CGO_ENABLED=0` is fully statically linked and needs no shared libraries, making it a perfect fit for a distroless or scratch base image. The build uses multi-stage to keep the final image minimal.
```dockerfile
FROM golang:1.22 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
go build -ldflags="-s -w" -trimpath -o /app ./cmd/server
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]
```
`-s -w` strips debug symbols and DWARF info. `-trimpath` removes local file paths from the binary. `nonroot` ensures the container starts as UID 65532. The resulting image is typically 8–15 MB with zero shell, no package manager, and a minimal CVE footprint. Use `gcr.io/distroless/static-debian12:debug` locally if you need a shell.
Standard containers (runc) share the host Linux kernel — a kernel CVE or container escape can compromise the host. **Kata Containers** solves this by running each container inside a lightweight VM (using QEMU/NEMU/Firecracker) with its own guest kernel. From a Kubernetes perspective it is a drop-in OCI-compatible runtime (configured via RuntimeClass); the performance overhead is ~5–10% vs runc for CPU-bound workloads but adds VM boot latency (~100ms). **gVisor** takes a different approach: it interposes a user-space kernel (Sentry, written in Go) between the container process and the host kernel. Sentry implements ~200 Linux syscalls and translates them to a small set of host syscalls, drastically reducing attack surface. gVisor has higher syscall overhead than runc (~10–50× for syscall-intensive workloads) but near-zero overhead for CPU computation. Both are production-grade on GKE (gVisor via gke-sandbox) and suitable for running untrusted workloads or multi-tenant SaaS.
Standard containers (runc) share the host Linux kernel — a kernel CVE or container escape can compromise the host. **Kata Containers** solves this by running each container inside a lightweight VM (using QEMU/NEMU/Firecracker) with its own guest kernel. From a Kubernetes perspective it is a drop-in OCI-compatible runtime (configured via RuntimeClass); the performance overhead is ~5–10% vs runc for CPU-bound workloads but adds VM boot latency (~100ms). **gVisor** takes a different approach: it interposes a user-space kernel (Sentry, written in Go) between the container process and the host kernel. Sentry implements ~200 Linux syscalls and translates them to a small set of host syscalls, drastically reducing attack surface. gVisor has higher syscall overhead than runc (~10–50× for syscall-intensive workloads) but near-zero overhead for CPU computation. Both are production-grade on GKE (gVisor via gke-sandbox) and suitable for running untrusted workloads or multi-tenant SaaS.
In CI environments, each build runs in a fresh agent, so package managers re-download all dependencies from the internet every time — `npm ci`, `pip install`, `go mod download` can take minutes. BuildKit's `--mount=type=cache` mounts a persistent directory at build time (not included in the image layer) that survives between builds when cache is exported/imported. In GitHub Actions with `docker/build-push-action`, you set `cache-from: type=gha` and `cache-to: type=gha,mode=max` to persist BuildKit's cache in GitHub's cache service.
dockerfile
# Go modules cache — survives rebuilds
RUN --mount=type=cache,target=/go/pkg/mod \
--mount=type=cache,target=/root/.cache/go-build \
go build -o /app ./cmd/server
# npm cache
RUN --mount=type=cache,target=/root/.npm \
npm ci
This can reduce a 6-minute dependency install to under 30 seconds on a warm cache.
In CI environments, each build runs in a fresh agent, so package managers re-download all dependencies from the internet every time — `npm ci`, `pip install`, `go mod download` can take minutes. BuildKit's `--mount=type=cache` mounts a persistent directory at build time (not included in the image layer) that survives between builds when cache is exported/imported. In GitHub Actions with `docker/build-push-action`, you set `cache-from: type=gha` and `cache-to: type=gha,mode=max` to persist BuildKit's cache in GitHub's cache service.
```dockerfile
# Go modules cache — survives rebuilds
RUN --mount=type=cache,target=/go/pkg/mod \
--mount=type=cache,target=/root/.cache/go-build \
go build -o /app ./cmd/server
# npm cache
RUN --mount=type=cache,target=/root/.npm \
npm ci
```
This can reduce a 6-minute dependency install to under 30 seconds on a warm cache.
An SBOM is a machine-readable list of all software components, libraries, and their versions inside a container image — enabling vulnerability tracking and licence compliance. **Syft** (by Anchore) generates SBOMs in SPDX, CycloneDX, or Syft JSON format from images, directories, or OCI archives. **Trivy** can both generate SBOMs and scan them for known CVEs against the NVD and OS vulnerability databases. A production CI workflow:
bash
# Build image
docker build -t myapp:$SHA .
# Generate SBOM
syft myapp:$SHA -o spdx-json > sbom.spdx.json
# Scan for CVEs with exit code on HIGH/CRITICAL
trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:$SHA
# Attach SBOM to image in registry (OCI artifact)
cosign attach sbom --sbom sbom.spdx.json myapp:$SHA
Failing builds on critical CVEs creates a shift-left security gate that prevents known-vulnerable images from ever reaching production.
An SBOM is a machine-readable list of all software components, libraries, and their versions inside a container image — enabling vulnerability tracking and licence compliance. **Syft** (by Anchore) generates SBOMs in SPDX, CycloneDX, or Syft JSON format from images, directories, or OCI archives. **Trivy** can both generate SBOMs and scan them for known CVEs against the NVD and OS vulnerability databases. A production CI workflow:
```bash
# Build image
docker build -t myapp:$SHA .
# Generate SBOM
syft myapp:$SHA -o spdx-json > sbom.spdx.json
# Scan for CVEs with exit code on HIGH/CRITICAL
trivy image --exit-code 1 --severity HIGH,CRITICAL myapp:$SHA
# Attach SBOM to image in registry (OCI artifact)
cosign attach sbom --sbom sbom.spdx.json myapp:$SHA
```
Failing builds on critical CVEs creates a shift-left security gate that prevents known-vulnerable images from ever reaching production.
GitOps treats a Git repository as the single source of truth for cluster state. **Argo CD** runs as a Kubernetes controller that continuously compares the desired state (YAML/Helm/Kustomize in a Git repo) with the actual state of the cluster. When it detects drift it can auto-sync (apply the diff) or alert. The reconciliation cycle: the **application controller** polls Git (or is notified via webhook) for changes, renders the manifests, computes a diff against live objects using a three-way merge (desired, live, last-applied), and applies the delta. **Flux** (CNCF) uses a similar model with separate controllers per source type (GitRepository, HelmRepository, OCIRepository) and Kustomization/HelmRelease objects for reconciliation. Key GitOps properties: all changes are PR-reviewed before reaching production; the cluster state is self-healing (drift is corrected automatically); rollback is `git revert`; and full audit trail lives in Git commit history.
GitOps treats a Git repository as the single source of truth for cluster state. **Argo CD** runs as a Kubernetes controller that continuously compares the desired state (YAML/Helm/Kustomize in a Git repo) with the actual state of the cluster. When it detects drift it can auto-sync (apply the diff) or alert. The reconciliation cycle: the **application controller** polls Git (or is notified via webhook) for changes, renders the manifests, computes a diff against live objects using a three-way merge (desired, live, last-applied), and applies the delta. **Flux** (CNCF) uses a similar model with separate controllers per source type (GitRepository, HelmRepository, OCIRepository) and Kustomization/HelmRelease objects for reconciliation. Key GitOps properties: all changes are PR-reviewed before reaching production; the cluster state is self-healing (drift is corrected automatically); rollback is `git revert`; and full audit trail lives in Git commit history.
Always upgrade the **control plane first**, then nodes. For managed clusters (EKS, GKE, AKS), the cloud provider handles control plane upgrades with zero downtime. For self-managed clusters, upgrade one control plane node at a time: drain it (`kubectl drain --ignore-daemonsets --delete-emptydir-data`), upgrade kubeadm/kubelet/kubectl, uncordon. Control plane components tolerate a version skew of one minor version from the API server. For worker nodes, upgrade using the **cordon → drain → upgrade → uncordon** cycle one node at a time, ensuring PDBs have enough replicas to absorb the drain. Always test upgrades in a lower environment first. Check the Kubernetes deprecation guide for API versions removed in the target version — run `kubectl convert` on manifests. Roll back by restoring the etcd snapshot taken before the upgrade.
Always upgrade the **control plane first**, then nodes. For managed clusters (EKS, GKE, AKS), the cloud provider handles control plane upgrades with zero downtime. For self-managed clusters, upgrade one control plane node at a time: drain it (`kubectl drain --ignore-daemonsets --delete-emptydir-data`), upgrade kubeadm/kubelet/kubectl, uncordon. Control plane components tolerate a version skew of one minor version from the API server. For worker nodes, upgrade using the **cordon → drain → upgrade → uncordon** cycle one node at a time, ensuring PDBs have enough replicas to absorb the drain. Always test upgrades in a lower environment first. Check the Kubernetes deprecation guide for API versions removed in the target version — run `kubectl convert` on manifests. Roll back by restoring the etcd snapshot taken before the upgrade.
`CrashLoopBackOff` means the container starts, exits non-zero, and the kubelet applies an exponential back-off (10s, 20s, 40s... up to 5min) before restarting. Systematic approach:
bash
# 1. Get the exit code and last state
kubectl describe pod <pod> -n <ns>
# Look at: Last State, Exit Code, Reason (OOMKilled vs Error)
# 2. Read logs from the crashed container (previous instance)
kubectl logs <pod> --previous -n <ns>
# 3. If logs are empty (crash before writing anything) — run the image manually
docker run --rm myapp:tag
# 4. Check events for image pull errors, config mount issues
kubectl get events -n <ns> --sort-by=.lastTimestamp
# 5. Use ephemeral debug container if shell needed
kubectl debug -it pod/<pod> --image=busybox --target=mycontainer
Common root causes: wrong `ENTRYPOINT`/`CMD`, missing env vars or secrets, config file not found, permission error on mounted volume, OOMKilled (check `Exit Code: 137`), or application crash on startup (check `--previous` logs).
`CrashLoopBackOff` means the container starts, exits non-zero, and the kubelet applies an exponential back-off (10s, 20s, 40s... up to 5min) before restarting. Systematic approach:
```bash
# 1. Get the exit code and last state
kubectl describe pod <pod> -n <ns>
# Look at: Last State, Exit Code, Reason (OOMKilled vs Error)
# 2. Read logs from the crashed container (previous instance)
kubectl logs <pod> --previous -n <ns>
# 3. If logs are empty (crash before writing anything) — run the image manually
docker run --rm myapp:tag
# 4. Check events for image pull errors, config mount issues
kubectl get events -n <ns> --sort-by=.lastTimestamp
# 5. Use ephemeral debug container if shell needed
kubectl debug -it pod/<pod> --image=busybox --target=mycontainer
```
Common root causes: wrong `ENTRYPOINT`/`CMD`, missing env vars or secrets, config file not found, permission error on mounted volume, OOMKilled (check `Exit Code: 137`), or application crash on startup (check `--previous` logs).
`OOMKilled` (Exit Code 137) means the container exceeded its memory limit and the kernel's OOM killer terminated it. Diagnostic approach:
bash
# 1. Confirm OOMKill in describe
kubectl describe pod <pod> | grep -A5 "Last State"
# Reason: OOMKilled
# 2. Check actual memory usage trend
kubectl top pod <pod> --containers
# Use Grafana/Prometheus: container_memory_working_set_bytes
# 3. Look for memory growth over time (leak vs spike)
# container_memory_working_set_bytes{pod=~"api-.*"}
# 4. Heap dump (Node.js example)
kubectl exec <pod> -- node -e "process.kill(process.pid, 'SIGUSR2')"
# Writes heapdump to disk; copy out with kubectl cp
For Node.js: enable `--expose-gc` and use `clinic.js` or `v8-profiler`. For Go: add `pprof` endpoint and hit `/debug/pprof/heap`. Set memory limits to P99 usage + 20% headroom. For legitimate spikes (batch jobs), use `init containers` to run the memory-intensive work separately, or increase the limit temporarily during the batch window.
`OOMKilled` (Exit Code 137) means the container exceeded its memory limit and the kernel's OOM killer terminated it. Diagnostic approach:
```bash
# 1. Confirm OOMKill in describe
kubectl describe pod <pod> | grep -A5 "Last State"
# Reason: OOMKilled
# 2. Check actual memory usage trend
kubectl top pod <pod> --containers
# Use Grafana/Prometheus: container_memory_working_set_bytes
# 3. Look for memory growth over time (leak vs spike)
# container_memory_working_set_bytes{pod=~"api-.*"}
# 4. Heap dump (Node.js example)
kubectl exec <pod> -- node -e "process.kill(process.pid, 'SIGUSR2')"
# Writes heapdump to disk; copy out with kubectl cp
```
For Node.js: enable `--expose-gc` and use `clinic.js` or `v8-profiler`. For Go: add `pprof` endpoint and hit `/debug/pprof/heap`. Set memory limits to P99 usage + 20% headroom. For legitimate spikes (batch jobs), use `init containers` to run the memory-intensive work separately, or increase the limit temporarily during the batch window.
The Container Storage Interface (CSI) is the standard API that decouples Kubernetes from storage vendor implementations. A CSI driver consists of a **controller plugin** (manages volume lifecycle: create, delete, attach, snapshot — runs as a Deployment) and a **node plugin** (mounts the volume into the Pod — runs as a DaemonSet). Kubernetes communicates with CSI via gRPC sidecar containers (`external-provisioner`, `external-attacher`, `external-snapshotter`). **Volume Snapshots** (GA in 1.20) let you create a point-in-time snapshot of a PVC via a `VolumeSnapshot` object — the CSI driver calls the storage API (e.g., EBS snapshot) and records the snapshot handle. **Volume Clones** create a new PVC pre-populated with data from an existing PVC at provisioning time, useful for spinning up test environments with production-like data. Both require the CSI driver to implement the optional snapshot and clone capabilities.
The Container Storage Interface (CSI) is the standard API that decouples Kubernetes from storage vendor implementations. A CSI driver consists of a **controller plugin** (manages volume lifecycle: create, delete, attach, snapshot — runs as a Deployment) and a **node plugin** (mounts the volume into the Pod — runs as a DaemonSet). Kubernetes communicates with CSI via gRPC sidecar containers (`external-provisioner`, `external-attacher`, `external-snapshotter`). **Volume Snapshots** (GA in 1.20) let you create a point-in-time snapshot of a PVC via a `VolumeSnapshot` object — the CSI driver calls the storage API (e.g., EBS snapshot) and records the snapshot handle. **Volume Clones** create a new PVC pre-populated with data from an existing PVC at provisioning time, useful for spinning up test environments with production-like data. Both require the CSI driver to implement the optional snapshot and clone capabilities.
Multi-cluster networking enables Pods in one cluster to communicate directly with Pods or Services in another without going through public internet or an application-level proxy. **Submariner** works by establishing encrypted tunnels (IPsec or WireGuard via Libreswan) between clusters' gateway nodes and synchronising Service discovery via the `ServiceImport`/`ServiceExport` Multi-Cluster Services API. Each cluster keeps its own Pod CIDR; Submariner handles the cross-cluster routing. **Cilium Cluster Mesh** uses the eBPF datapath and a shared KVStore (etcd) to expose Services across clusters. A Service annotated with `io.cilium/global-service: "true"` is accessible by the same name from any cluster in the mesh. Cilium performs load-balancing across local and remote endpoints natively, with mTLS via SPIFFE. Cilium Cluster Mesh requires non-overlapping Pod CIDRs and is deeper integrated but Cilium-only; Submariner is CNI-agnostic.
Multi-cluster networking enables Pods in one cluster to communicate directly with Pods or Services in another without going through public internet or an application-level proxy. **Submariner** works by establishing encrypted tunnels (IPsec or WireGuard via Libreswan) between clusters' gateway nodes and synchronising Service discovery via the `ServiceImport`/`ServiceExport` Multi-Cluster Services API. Each cluster keeps its own Pod CIDR; Submariner handles the cross-cluster routing. **Cilium Cluster Mesh** uses the eBPF datapath and a shared KVStore (etcd) to expose Services across clusters. A Service annotated with `io.cilium/global-service: "true"` is accessible by the same name from any cluster in the mesh. Cilium performs load-balancing across local and remote endpoints natively, with mTLS via SPIFFE. Cilium Cluster Mesh requires non-overlapping Pod CIDRs and is deeper integrated but Cilium-only; Submariner is CNI-agnostic.
Kubernetes clusters are frequently over-provisioned because teams set conservative resource requests. The main levers: **Right-sizing requests** — run VPA in recommendation mode for two weeks, then apply recommended values; `kubectl-view-allocations` shows current waste. **Goldilocks** (FairwindsOps) runs VPA in recommendation mode for all Deployments and serves a dashboard of suggested requests/limits. **Spot/preemptible instances** can cut EC2/GCE costs 60–90%: use node pools with mixed On-Demand (for critical workloads) and Spot (for stateless, restartable workloads). Karpenter handles Spot interruption gracefully with 2-minute termination notice. **Bin-packing** — use the `MostAllocated` scheduler profile to fill nodes before starting new ones, reducing idle node count. **Cluster Autoscaler** scale-down must be tuned: `scale-down-utilization-threshold: 0.6` prevents keeping 50% empty nodes. Use **Kubecost** or **OpenCost** to attribute spend by namespace, team, and label for chargeback.
Kubernetes clusters are frequently over-provisioned because teams set conservative resource requests. The main levers: **Right-sizing requests** — run VPA in recommendation mode for two weeks, then apply recommended values; `kubectl-view-allocations` shows current waste. **Goldilocks** (FairwindsOps) runs VPA in recommendation mode for all Deployments and serves a dashboard of suggested requests/limits. **Spot/preemptible instances** can cut EC2/GCE costs 60–90%: use node pools with mixed On-Demand (for critical workloads) and Spot (for stateless, restartable workloads). Karpenter handles Spot interruption gracefully with 2-minute termination notice. **Bin-packing** — use the `MostAllocated` scheduler profile to fill nodes before starting new ones, reducing idle node count. **Cluster Autoscaler** scale-down must be tuned: `scale-down-utilization-threshold: 0.6` prevents keeping 50% empty nodes. Use **Kubecost** or **OpenCost** to attribute spend by namespace, team, and label for chargeback.
A production CI/CD pipeline should be automated, reproducible, and safe. Each stage gates the next:
bash
# 1. Build (BuildKit, layer cache from registry)
docker buildx build --cache-from type=registry,ref=$CACHE_REF \
--cache-to type=registry,ref=$CACHE_REF,mode=max \
-t $IMAGE_REF --push .
# 2. Scan — fail on HIGH/CRITICAL CVEs
trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE_REF
# 3. Sign image
cosign sign --yes $IMAGE_REF
# 4. Deploy to staging (Kustomize overlay)
kubectl apply -k k8s/overlays/staging
kubectl rollout status deployment/api -n staging --timeout=5m
# 5. Run smoke tests; promote to production
kubectl apply -k k8s/overlays/production
kubectl rollout status deployment/api -n production --timeout=10m
# 6. Automatic rollback on failure
if ! kubectl rollout status deployment/api -n production; then
kubectl rollout undo deployment/api -n production
exit 1
fi
GitOps variant: merge to main → Argo CD detects new image tag → deploys → health checks gate promotion. Always have a `PodDisruptionBudget` and `minAvailable` to protect production during the rollout.
A production CI/CD pipeline should be automated, reproducible, and safe. Each stage gates the next:
```bash
# 1. Build (BuildKit, layer cache from registry)
docker buildx build --cache-from type=registry,ref=$CACHE_REF \
--cache-to type=registry,ref=$CACHE_REF,mode=max \
-t $IMAGE_REF --push .
# 2. Scan — fail on HIGH/CRITICAL CVEs
trivy image --exit-code 1 --severity HIGH,CRITICAL $IMAGE_REF
# 3. Sign image
cosign sign --yes $IMAGE_REF
# 4. Deploy to staging (Kustomize overlay)
kubectl apply -k k8s/overlays/staging
kubectl rollout status deployment/api -n staging --timeout=5m
# 5. Run smoke tests; promote to production
kubectl apply -k k8s/overlays/production
kubectl rollout status deployment/api -n production --timeout=10m
# 6. Automatic rollback on failure
if ! kubectl rollout status deployment/api -n production; then
kubectl rollout undo deployment/api -n production
exit 1
fi
```
GitOps variant: merge to main → Argo CD detects new image tag → deploys → health checks gate promotion. Always have a `PodDisruptionBudget` and `minAvailable` to protect production during the rollout.
Zero-downtime deployments require coordination at three layers. **Rolling update configuration**: set `maxUnavailable: 0` and `maxSurge: 1` so new Pods are fully ready before old ones are terminated. **PodDisruptionBudget**: set `minAvailable` to at least 50% of replicas to protect against simultaneous voluntary disruptions during rollouts and node drains. **Graceful shutdown with preStop hook**: when Kubernetes terminates a Pod, it sends SIGTERM but simultaneously removes the Pod from Service endpoints. There is a race condition: in-flight requests may arrive after SIGTERM but before the process exits. Fix with a `preStop` sleep to drain the endpoint update propagation, then the app handles SIGTERM:
Set `terminationGracePeriodSeconds` longer than the slowest request. For Node.js/Express, listen for SIGTERM, stop accepting new connections, wait for active requests to finish (`server.close()`), then exit. Readiness probes ensure the new Pod only receives traffic when it is truly ready. Together these eliminate the 502/503 errors typical of naive rolling deployments.
Zero-downtime deployments require coordination at three layers. **Rolling update configuration**: set `maxUnavailable: 0` and `maxSurge: 1` so new Pods are fully ready before old ones are terminated. **PodDisruptionBudget**: set `minAvailable` to at least 50% of replicas to protect against simultaneous voluntary disruptions during rollouts and node drains. **Graceful shutdown with preStop hook**: when Kubernetes terminates a Pod, it sends SIGTERM but simultaneously removes the Pod from Service endpoints. There is a race condition: in-flight requests may arrive after SIGTERM but before the process exits. Fix with a `preStop` sleep to drain the endpoint update propagation, then the app handles SIGTERM:
```yaml
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"]
terminationGracePeriodSeconds: 60
```
Set `terminationGracePeriodSeconds` longer than the slowest request. For Node.js/Express, listen for SIGTERM, stop accepting new connections, wait for active requests to finish (`server.close()`), then exit. Readiness probes ensure the new Pod only receives traffic when it is truly ready. Together these eliminate the 502/503 errors typical of naive rolling deployments.
Frequently Asked Questions
Do I need CKA/CKAD certifications?
Helpful for job applications, not required to pass interviews. They force you to actually run kubectl, which pays off.
What kubectl commands should I know?
get, describe, logs, exec, apply, rollout, port-forward, and cp. You should also read YAML fluently.
Docker vs containerd vs Podman?
Kubernetes uses containerd under the hood. Docker Desktop is a dev tool. Podman is a rootless alternative. Fundamentals are the same.
Helm or Kustomize?
Helm has the ecosystem; Kustomize has simpler mental model. Most orgs use one, often both — know both conceptually.
How do you secure a container?
Non-root user, read-only filesystem, minimal base image, scanned for CVEs, no privileged capabilities, SecurityContext in k8s.