SELECT retrieves rows and columns from one or more tables. You specify the columns you want (or * for all), the source table in a FROM clause, and optional filtering, sorting, and limiting clauses.
sql
SELECT first_name, last_name, email
FROM users
WHERE active = TRUE
ORDER BY last_name ASC
LIMIT 10;
Every SQL query starts with SELECT, making it the most fundamental statement to master.
WHERE filters rows before they are returned, keeping only rows where the condition evaluates to TRUE. NULL values are excluded because NULL comparisons return UNKNOWN, not TRUE.
sql
SELECT * FROM orders
WHERE status = 'shipped' AND total > 100;
WHERE is evaluated before GROUP BY and HAVING, so you cannot reference aggregate functions in it.
ORDER BY sorts the result set by one or more columns (ASC by default). LIMIT caps the number of rows returned; OFFSET skips the first N rows, enabling pagination.
sql
SELECT id, title
FROM articles
ORDER BY published_at DESC
LIMIT 20 OFFSET 40; -- page 3, 20 per page
Without ORDER BY, the row order is non-deterministic, so OFFSET-based pagination can return duplicate or missing rows across pages on a live table.
DISTINCT removes duplicate rows from the result set, comparing all selected columns together. It is useful when a join or query can produce repeated combinations you do not care about.
sql
SELECT DISTINCT country FROM customers;
DISTINCT adds a sort/hash step and can be expensive on large tables. If you find yourself adding it to fix "too many rows," it is often a sign of an incorrect join or missing filter.
Aggregate functions collapse a set of rows into a single value. COUNT(*) counts all rows; COUNT(col) counts non-NULL values. SUM and AVG work on numeric columns. MIN/MAX work on any orderable type.
sql
SELECT
COUNT(*) AS total_orders,
SUM(amount) AS revenue,
AVG(amount) AS avg_order,
MIN(created_at) AS first_order,
MAX(created_at) AS last_order
FROM orders;
All aggregates ignore NULL values except COUNT(*).
GROUP BY collapses rows that share the same values in the specified columns into a single group, allowing aggregates to be computed per group. Every column in the SELECT that is not inside an aggregate function must appear in GROUP BY.
sql
SELECT department, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
Omitting a non-aggregated column from GROUP BY is a SQL error in strict mode (and in PostgreSQL).
GROUP BY collapses rows that share the same values in the specified columns into a single group, allowing aggregates to be computed per group. Every column in the SELECT that is not inside an aggregate function must appear in GROUP BY.
```sql
SELECT department, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
```
Omitting a non-aggregated column from GROUP BY is a SQL error in strict mode (and in PostgreSQL).
WHERE filters individual rows before grouping; HAVING filters groups after GROUP BY and aggregate calculation. You cannot use aggregate functions in WHERE, but you can in HAVING.
sql
SELECT department, COUNT(*) AS cnt
FROM employees
WHERE active = TRUE -- filters rows first
GROUP BY department
HAVING COUNT(*) > 5; -- filters groups after aggregation
The execution order is: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.
WHERE filters individual rows before grouping; HAVING filters groups after GROUP BY and aggregate calculation. You cannot use aggregate functions in WHERE, but you can in HAVING.
```sql
SELECT department, COUNT(*) AS cnt
FROM employees
WHERE active = TRUE -- filters rows first
GROUP BY department
HAVING COUNT(*) > 5; -- filters groups after aggregation
```
The execution order is: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.
NULL represents an unknown or missing value. Any arithmetic or comparison with NULL yields NULL (or UNKNOWN in a boolean context), meaning NULL = NULL is not TRUE — use IS NULL or IS NOT NULL instead. COALESCE returns the first non-NULL argument; NULLIF returns NULL when two arguments are equal.
sql
SELECT COALESCE(phone, 'N/A') AS phone_display
FROM contacts
WHERE deleted_at IS NULL;
NULL propagation is a frequent source of bugs in WHERE clauses and JOIN conditions.
NULL represents an unknown or missing value. Any arithmetic or comparison with NULL yields NULL (or UNKNOWN in a boolean context), meaning NULL = NULL is not TRUE — use IS NULL or IS NOT NULL instead. COALESCE returns the first non-NULL argument; NULLIF returns NULL when two arguments are equal.
```sql
SELECT COALESCE(phone, 'N/A') AS phone_display
FROM contacts
WHERE deleted_at IS NULL;
```
NULL propagation is a frequent source of bugs in WHERE clauses and JOIN conditions.
INNER JOIN returns rows where a matching row exists in both tables based on the join condition. Rows that have no match in either table are excluded from the result set.
sql
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;
Use INNER JOIN when you only care about records that have a corresponding entry in both tables.
INNER JOIN returns rows where a matching row exists in both tables based on the join condition. Rows that have no match in either table are excluded from the result set.
```sql
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;
```
Use INNER JOIN when you only care about records that have a corresponding entry in both tables.
LEFT JOIN (also LEFT OUTER JOIN) returns all rows from the left table and matching rows from the right table. Where no match exists, right-table columns are NULL. Use it when you want all records from the primary table regardless of whether a related record exists.
sql
SELECT u.id, u.email, p.plan
FROM users u
LEFT JOIN subscriptions p ON u.id = p.user_id;
-- users without a subscription still appear, p.plan is NULL
LEFT JOIN (also LEFT OUTER JOIN) returns all rows from the left table and matching rows from the right table. Where no match exists, right-table columns are NULL. Use it when you want all records from the primary table regardless of whether a related record exists.
```sql
SELECT u.id, u.email, p.plan
FROM users u
LEFT JOIN subscriptions p ON u.id = p.user_id;
-- users without a subscription still appear, p.plan is NULL
```
RIGHT JOIN returns all rows from the right table and matching rows from the left table, with NULLs for unmatched left-table columns. It is the mirror of LEFT JOIN and is less commonly used because you can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order.
sql
SELECT e.name, d.name AS dept
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.id;
-- departments with no employees still appear
RIGHT JOIN returns all rows from the right table and matching rows from the left table, with NULLs for unmatched left-table columns. It is the mirror of LEFT JOIN and is less commonly used because you can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order.
```sql
SELECT e.name, d.name AS dept
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.id;
-- departments with no employees still appear
```
FULL OUTER JOIN returns all rows from both tables, filling in NULLs where there is no matching row on either side. Use it to find unmatched rows in both tables simultaneously.
sql
SELECT a.id AS a_id, b.id AS b_id
FROM table_a a
FULL OUTER JOIN table_b b ON a.key = b.key
WHERE a.id IS NULL OR b.id IS NULL; -- only unmatched rows
PostgreSQL supports FULL OUTER JOIN natively; MySQL did not until version 8.
FULL OUTER JOIN returns all rows from both tables, filling in NULLs where there is no matching row on either side. Use it to find unmatched rows in both tables simultaneously.
```sql
SELECT a.id AS a_id, b.id AS b_id
FROM table_a a
FULL OUTER JOIN table_b b ON a.key = b.key
WHERE a.id IS NULL OR b.id IS NULL; -- only unmatched rows
```
PostgreSQL supports FULL OUTER JOIN natively; MySQL did not until version 8.
CROSS JOIN produces the Cartesian product of two tables — every row in the first table paired with every row in the second. A 10-row table crossed with a 100-row table yields 1,000 rows.
sql
SELECT s.size, c.color
FROM sizes s
CROSS JOIN colors c;
Cross joins are intentional when you need all combinations (e.g., generating a calendar, pricing matrix), but accidental cross joins from a missing ON clause are a common performance bug.
CROSS JOIN produces the Cartesian product of two tables — every row in the first table paired with every row in the second. A 10-row table crossed with a 100-row table yields 1,000 rows.
```sql
SELECT s.size, c.color
FROM sizes s
CROSS JOIN colors c;
```
Cross joins are intentional when you need all combinations (e.g., generating a calendar, pricing matrix), but accidental cross joins from a missing ON clause are a common performance bug.
A SELF JOIN joins a table to itself using aliases, treating the same table as two logical tables. Common use cases include hierarchical data (employees and their managers) and comparing rows within the same table.
sql
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
Self joins require aliases to distinguish which copy of the table each column reference comes from.
A SELF JOIN joins a table to itself using aliases, treating the same table as two logical tables. Common use cases include hierarchical data (employees and their managers) and comparing rows within the same table.
```sql
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
```
Self joins require aliases to distinguish which copy of the table each column reference comes from.
Aliases assign a temporary name to a table or column for the duration of the query. Column aliases appear in the output header; table aliases shorten join syntax. The AS keyword is optional in most databases.
sql
SELECT
u.first_name || ' ' || u.last_name AS full_name,
o.total AS order_total
FROM users AS u
JOIN orders AS o ON o.user_id = u.id;
Column aliases defined in SELECT are not available in WHERE or HAVING in most databases (they are in PostgreSQL ORDER BY).
Aliases assign a temporary name to a table or column for the duration of the query. Column aliases appear in the output header; table aliases shorten join syntax. The AS keyword is optional in most databases.
```sql
SELECT
u.first_name || ' ' || u.last_name AS full_name,
o.total AS order_total
FROM users AS u
JOIN orders AS o ON o.user_id = u.id;
```
Column aliases defined in SELECT are not available in WHERE or HAVING in most databases (they are in PostgreSQL ORDER BY).
INSERT INTO adds one or more rows to a table. You can specify explicit column names (recommended) or rely on column order (fragile).
Omitting the column list requires values for every column in table-definition order and breaks if the schema changes.
INSERT INTO adds one or more rows to a table. You can specify explicit column names (recommended) or rely on column order (fragile).
```sql
INSERT INTO products (name, price, stock)
VALUES ('Widget', 9.99, 100);
-- Multi-row insert
INSERT INTO tags (name) VALUES ('sql'), ('database'), ('backend');
```
Omitting the column list requires values for every column in table-definition order and breaks if the schema changes.
UPDATE modifies existing rows in a table. Always include a WHERE clause — omitting it updates every row in the table.
sql
UPDATE users
SET last_login = NOW(), login_count = login_count + 1
WHERE id = 42;
PostgreSQL also supports UPDATE ... FROM for joining another table during the update, and UPDATE ... RETURNING to fetch affected rows without a second query.
UPDATE modifies existing rows in a table. Always include a WHERE clause — omitting it updates every row in the table.
```sql
UPDATE users
SET last_login = NOW(), login_count = login_count + 1
WHERE id = 42;
```
PostgreSQL also supports UPDATE ... FROM for joining another table during the update, and UPDATE ... RETURNING to fetch affected rows without a second query.
DELETE removes rows matching the WHERE condition. Without WHERE, all rows are deleted (but the table structure remains). Always run the equivalent SELECT first to confirm what will be deleted.
sql
DELETE FROM sessions
WHERE expires_at < NOW();
DELETE fires row-level triggers and generates WAL entries, making it slower than TRUNCATE for clearing a whole table.
DELETE removes rows matching the WHERE condition. Without WHERE, all rows are deleted (but the table structure remains). Always run the equivalent SELECT first to confirm what will be deleted.
```sql
DELETE FROM sessions
WHERE expires_at < NOW();
```
DELETE fires row-level triggers and generates WAL entries, making it slower than TRUNCATE for clearing a whole table.
DELETE removes rows one at a time, fires triggers, and can have a WHERE clause. TRUNCATE removes all rows at once, is much faster, does not fire row-level triggers, and resets sequences. TRUNCATE acquires an ACCESS EXCLUSIVE lock on the table.
sql
-- Slower, transactional, supports WHERE
DELETE FROM audit_log WHERE created_at < '2023-01-01';
-- Faster, no WHERE, resets sequences
TRUNCATE TABLE session_cache;
In PostgreSQL, TRUNCATE is transactional and can be rolled back, unlike in MySQL.
DELETE removes rows one at a time, fires triggers, and can have a WHERE clause. TRUNCATE removes all rows at once, is much faster, does not fire row-level triggers, and resets sequences. TRUNCATE acquires an ACCESS EXCLUSIVE lock on the table.
```sql
-- Slower, transactional, supports WHERE
DELETE FROM audit_log WHERE created_at < '2023-01-01';
-- Faster, no WHERE, resets sequences
TRUNCATE TABLE session_cache;
```
In PostgreSQL, TRUNCATE is transactional and can be rolled back, unlike in MySQL.
A PRIMARY KEY uniquely identifies each row in a table. It automatically implies NOT NULL and UNIQUE. A table can have only one primary key, which may span multiple columns (composite key).
sql
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
Databases create a unique B-tree index on the primary key automatically, making lookups by primary key very fast.
A PRIMARY KEY uniquely identifies each row in a table. It automatically implies NOT NULL and UNIQUE. A table can have only one primary key, which may span multiple columns (composite key).
```sql
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
```
Databases create a unique B-tree index on the primary key automatically, making lookups by primary key very fast.
A FOREIGN KEY enforces referential integrity by requiring that the value in the referencing column exists in the referenced table's column (usually a primary key). It prevents orphaned rows.
sql
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(id) ON DELETE CASCADE,
sku TEXT NOT NULL
);
ON DELETE CASCADE, SET NULL, or RESTRICT control what happens when the referenced row is deleted.
A FOREIGN KEY enforces referential integrity by requiring that the value in the referencing column exists in the referenced table's column (usually a primary key). It prevents orphaned rows.
```sql
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(id) ON DELETE CASCADE,
sku TEXT NOT NULL
);
```
ON DELETE CASCADE, SET NULL, or RESTRICT control what happens when the referenced row is deleted.
UNIQUE ensures no two rows have the same value in the constrained column(s); NULLs are usually exempt (multiple NULLs are allowed because NULL != NULL). NOT NULL prevents storing a NULL in the column.
sql
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
name TEXT NOT NULL
);
A composite UNIQUE constraint covers combinations of columns, not each column individually.
UNIQUE ensures no two rows have the same value in the constrained column(s); NULLs are usually exempt (multiple NULLs are allowed because NULL != NULL). NOT NULL prevents storing a NULL in the column.
```sql
CREATE TABLE users (
id SERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
name TEXT NOT NULL
);
```
A composite UNIQUE constraint covers combinations of columns, not each column individually.
CREATE TABLE defines the table name, column names, data types, and optional constraints. Column-level constraints follow the type; table-level constraints (like composite keys) appear after all columns.
sql
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(50) NOT NULL UNIQUE,
name TEXT NOT NULL,
price NUMERIC(10,2) NOT NULL CHECK (price >= 0),
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE TABLE defines the table name, column names, data types, and optional constraints. Column-level constraints follow the type; table-level constraints (like composite keys) appear after all columns.
```sql
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(50) NOT NULL UNIQUE,
name TEXT NOT NULL,
price NUMERIC(10,2) NOT NULL CHECK (price >= 0),
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
```
Core types: INT / BIGINT (integers), NUMERIC(p,s) / DECIMAL (exact decimals), FLOAT / DOUBLE PRECISION (approximate), VARCHAR(n) (variable-length string up to n chars), TEXT (unlimited string), DATE, TIME, TIMESTAMP / TIMESTAMPTZ (with time zone), BOOLEAN. PostgreSQL adds JSONB, UUID, ARRAY, ENUM, and more.
Choose TIMESTAMPTZ over TIMESTAMP when storing user events — it preserves the UTC offset. Use NUMERIC for money, never FLOAT (rounding errors).
Core types: INT / BIGINT (integers), NUMERIC(p,s) / DECIMAL (exact decimals), FLOAT / DOUBLE PRECISION (approximate), VARCHAR(n) (variable-length string up to n chars), TEXT (unlimited string), DATE, TIME, TIMESTAMP / TIMESTAMPTZ (with time zone), BOOLEAN. PostgreSQL adds JSONB, UUID, ARRAY, ENUM, and more.
Choose TIMESTAMPTZ over TIMESTAMP when storing user events — it preserves the UTC offset. Use NUMERIC for money, never FLOAT (rounding errors).
A subquery is a SELECT nested inside another SQL statement. It can appear in WHERE, FROM (as a derived table), or SELECT (scalar subquery). Use subqueries when the inner result is needed as a filter set or inline value.
sql
SELECT name FROM products
WHERE id IN (
SELECT product_id FROM order_items
WHERE quantity > 10
);
Correlated subqueries (referencing the outer query) run once per outer row and can be slow; CTEs or JOINs are often better alternatives.
A subquery is a SELECT nested inside another SQL statement. It can appear in WHERE, FROM (as a derived table), or SELECT (scalar subquery). Use subqueries when the inner result is needed as a filter set or inline value.
```sql
SELECT name FROM products
WHERE id IN (
SELECT product_id FROM order_items
WHERE quantity > 10
);
```
Correlated subqueries (referencing the outer query) run once per outer row and can be slow; CTEs or JOINs are often better alternatives.
IN checks whether a value appears in a static list or subquery result set. EXISTS checks whether a subquery returns at least one row and short-circuits as soon as it finds one. EXISTS is generally faster when the subquery result is large because it does not materialise the full set.
sql
-- IN: loads all product_ids
SELECT * FROM products WHERE id IN (SELECT product_id FROM deals);
-- EXISTS: stops at first match
SELECT * FROM products p
WHERE EXISTS (SELECT 1 FROM deals d WHERE d.product_id = p.id);
With NULLs, IN can produce surprising results — if the subquery contains a NULL, the NOT IN version may return zero rows.
IN checks whether a value appears in a static list or subquery result set. EXISTS checks whether a subquery returns at least one row and short-circuits as soon as it finds one. EXISTS is generally faster when the subquery result is large because it does not materialise the full set.
```sql
-- IN: loads all product_ids
SELECT * FROM products WHERE id IN (SELECT product_id FROM deals);
-- EXISTS: stops at first match
SELECT * FROM products p
WHERE EXISTS (SELECT 1 FROM deals d WHERE d.product_id = p.id);
```
With NULLs, IN can produce surprising results — if the subquery contains a NULL, the NOT IN version may return zero rows.
UNION combines the result sets of two queries and removes duplicates (adds a sort/hash step). UNION ALL combines them and keeps all rows including duplicates, making it faster.
sql
SELECT email FROM customers
UNION ALL
SELECT email FROM leads;
-- keeps duplicates, faster
SELECT email FROM customers
UNION
SELECT email FROM leads;
-- removes duplicates, slower
Both require the same number of columns and compatible data types in each SELECT. Prefer UNION ALL unless duplicate removal is required.
UNION combines the result sets of two queries and removes duplicates (adds a sort/hash step). UNION ALL combines them and keeps all rows including duplicates, making it faster.
```sql
SELECT email FROM customers
UNION ALL
SELECT email FROM leads;
-- keeps duplicates, faster
SELECT email FROM customers
UNION
SELECT email FROM leads;
-- removes duplicates, slower
```
Both require the same number of columns and compatible data types in each SELECT. Prefer UNION ALL unless duplicate removal is required.
CASE WHEN is SQL's conditional expression, evaluating conditions in order and returning the first matching THEN value. It can appear anywhere an expression is valid: SELECT, WHERE, ORDER BY, aggregates.
sql
SELECT
name,
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 70 THEN 'B'
WHEN score >= 50 THEN 'C'
ELSE 'F'
END AS grade
FROM students;
The simple CASE form (CASE col WHEN val THEN ...) is equivalent to equality checks.
CASE WHEN is SQL's conditional expression, evaluating conditions in order and returning the first matching THEN value. It can appear anywhere an expression is valid: SELECT, WHERE, ORDER BY, aggregates.
```sql
SELECT
name,
CASE
WHEN score >= 90 THEN 'A'
WHEN score >= 70 THEN 'B'
WHEN score >= 50 THEN 'C'
ELSE 'F'
END AS grade
FROM students;
```
The simple CASE form (CASE col WHEN val THEN ...) is equivalent to equality checks.
CHAR(n) is fixed-length: values shorter than n are padded with spaces, and it always occupies n characters of storage. VARCHAR(n) is variable-length: it stores only the actual characters plus a small length prefix, up to a maximum of n.
For most use cases VARCHAR is preferred because it saves storage and does not introduce trailing-space comparison issues. CHAR(n) can be slightly faster on fixed-width data like country codes (CHAR(2)) because the storage is predictable. In PostgreSQL, CHAR, VARCHAR, and TEXT have nearly identical performance.
CHAR(n) is fixed-length: values shorter than n are padded with spaces, and it always occupies n characters of storage. VARCHAR(n) is variable-length: it stores only the actual characters plus a small length prefix, up to a maximum of n.
For most use cases VARCHAR is preferred because it saves storage and does not introduce trailing-space comparison issues. CHAR(n) can be slightly faster on fixed-width data like country codes (CHAR(2)) because the storage is predictable. In PostgreSQL, CHAR, VARCHAR, and TEXT have nearly identical performance.
LIKE performs case-sensitive pattern matching using % (any sequence of characters) and _ (any single character). ILIKE in PostgreSQL does case-insensitive matching.
sql
SELECT * FROM products WHERE name LIKE 'Bolt%'; -- starts with Bolt
SELECT * FROM users WHERE email LIKE '%@gmail.com'; -- Gmail addresses
SELECT * FROM codes WHERE code LIKE 'A_C'; -- A, any char, C
Leading % prevents index use because the database cannot do a range scan. For heavy text search, use full-text search (tsvector) instead.
LIKE performs case-sensitive pattern matching using % (any sequence of characters) and _ (any single character). ILIKE in PostgreSQL does case-insensitive matching.
```sql
SELECT * FROM products WHERE name LIKE 'Bolt%'; -- starts with Bolt
SELECT * FROM users WHERE email LIKE '%@gmail.com'; -- Gmail addresses
SELECT * FROM codes WHERE code LIKE 'A_C'; -- A, any char, C
```
Leading % prevents index use because the database cannot do a range scan. For heavy text search, use full-text search (tsvector) instead.
Window functions compute a value across a "window" of rows related to the current row, without collapsing the result set to one row per group. The OVER() clause defines the window. GROUP BY collapses rows; window functions preserve every row.
sql
SELECT
employee_id,
department,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
-- Each employee row is returned, plus the department average alongside it.
Common window functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE, LAG, LEAD, FIRST_VALUE, LAST_VALUE, SUM/AVG/COUNT with OVER().
Window functions compute a value across a "window" of rows related to the current row, without collapsing the result set to one row per group. The OVER() clause defines the window. GROUP BY collapses rows; window functions preserve every row.
```sql
SELECT
employee_id,
department,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
-- Each employee row is returned, plus the department average alongside it.
```
Common window functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE, LAG, LEAD, FIRST_VALUE, LAST_VALUE, SUM/AVG/COUNT with OVER().
All three assign a numeric rank within a partition ordered by a column. They differ when ties exist. ROW_NUMBER always assigns unique sequential integers regardless of ties. RANK skips numbers after a tie (1,1,3). DENSE_RANK never skips (1,1,2).
sql
SELECT
name, score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rnk
FROM leaderboard;
-- Scores: 100,100,90 → row_num: 1,2,3 | rank: 1,1,3 | dense_rank: 1,1,2
Use DENSE_RANK when gaps in ranking numbers would confuse users; use ROW_NUMBER to pick exactly one row per group.
All three assign a numeric rank within a partition ordered by a column. They differ when ties exist. ROW_NUMBER always assigns unique sequential integers regardless of ties. RANK skips numbers after a tie (1,1,3). DENSE_RANK never skips (1,1,2).
```sql
SELECT
name, score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rnk
FROM leaderboard;
-- Scores: 100,100,90 → row_num: 1,2,3 | rank: 1,1,3 | dense_rank: 1,1,2
```
Use DENSE_RANK when gaps in ranking numbers would confuse users; use ROW_NUMBER to pick exactly one row per group.
NTILE(n) divides the ordered partition into n roughly equal buckets and assigns each row a bucket number from 1 to n. If the rows do not divide evenly, earlier buckets get one extra row.
sql
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_summary;
-- quartile 1 = top 25% spenders
NTILE is commonly used for percentile bucketing and A/B test stratification.
NTILE(n) divides the ordered partition into n roughly equal buckets and assigns each row a bucket number from 1 to n. If the rows do not divide evenly, earlier buckets get one extra row.
```sql
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customer_summary;
-- quartile 1 = top 25% spenders
```
NTILE is commonly used for percentile bucketing and A/B test stratification.
LAG accesses a column value from a previous row within the partition; LEAD accesses a future row. Both accept an offset (default 1) and an optional default for when no prior/next row exists.
sql
SELECT
date,
revenue,
LAG(revenue, 1, 0) OVER (ORDER BY date) AS prev_day_revenue,
revenue - LAG(revenue, 1, 0) OVER (ORDER BY date) AS day_over_day_change
FROM daily_revenue;
Useful for day-over-day comparisons, detecting changes in status, and computing time-between-events without a self-join.
LAG accesses a column value from a previous row within the partition; LEAD accesses a future row. Both accept an offset (default 1) and an optional default for when no prior/next row exists.
```sql
SELECT
date,
revenue,
LAG(revenue, 1, 0) OVER (ORDER BY date) AS prev_day_revenue,
revenue - LAG(revenue, 1, 0) OVER (ORDER BY date) AS day_over_day_change
FROM daily_revenue;
```
Useful for day-over-day comparisons, detecting changes in status, and computing time-between-events without a self-join.
Using SUM as a window function with an ORDER BY in the OVER clause computes a cumulative (running) sum. The default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
sql
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
Adding PARTITION BY resets the running total per group (e.g., per customer or per month).
Using SUM as a window function with an ORDER BY in the OVER clause computes a cumulative (running) sum. The default frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
```sql
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
```
Adding PARTITION BY resets the running total per group (e.g., per customer or per month).
PARTITION BY divides rows into independent groups (partitions) before the window function is applied. The function restarts for each partition, similar to GROUP BY but without collapsing rows.
sql
SELECT
region,
salesperson,
sales,
SUM(sales) OVER (PARTITION BY region) AS region_total,
sales * 1.0 / SUM(sales) OVER (PARTITION BY region) AS pct_of_region
FROM sales_data;
Without PARTITION BY, the window spans the entire result set.
PARTITION BY divides rows into independent groups (partitions) before the window function is applied. The function restarts for each partition, similar to GROUP BY but without collapsing rows.
```sql
SELECT
region,
salesperson,
sales,
SUM(sales) OVER (PARTITION BY region) AS region_total,
sales * 1.0 / SUM(sales) OVER (PARTITION BY region) AS pct_of_region
FROM sales_data;
```
Without PARTITION BY, the window spans the entire result set.
A CTE is a named temporary result set defined in a WITH clause at the start of a query. It improves readability by breaking complex queries into named steps and can be referenced multiple times.
sql
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month
FROM monthly_revenue;
In PostgreSQL, CTEs are optimisation fences by default (materialised), but you can use WITH ... AS NOT MATERIALIZED to let the planner inline them.
A CTE is a named temporary result set defined in a WITH clause at the start of a query. It improves readability by breaking complex queries into named steps and can be referenced multiple times.
```sql
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month
FROM monthly_revenue;
```
In PostgreSQL, CTEs are optimisation fences by default (materialised), but you can use WITH ... AS NOT MATERIALIZED to let the planner inline them.
A recursive CTE references itself to traverse hierarchical or graph data. It consists of an anchor (base case) and a recursive member joined with UNION ALL, and runs until no new rows are produced.
sql
WITH RECURSIVE org_chart AS (
-- anchor: top-level employees
SELECT id, name, manager_id, 0 AS depth
FROM employees WHERE manager_id IS NULL
UNION ALL
-- recursive: each employee's reports
SELECT e.id, e.name, e.manager_id, oc.depth + 1
FROM employees e
JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart ORDER BY depth;
A recursive CTE references itself to traverse hierarchical or graph data. It consists of an anchor (base case) and a recursive member joined with UNION ALL, and runs until no new rows are produced.
```sql
WITH RECURSIVE org_chart AS (
-- anchor: top-level employees
SELECT id, name, manager_id, 0 AS depth
FROM employees WHERE manager_id IS NULL
UNION ALL
-- recursive: each employee's reports
SELECT e.id, e.name, e.manager_id, oc.depth + 1
FROM employees e
JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart ORDER BY depth;
```
Common uses: org charts, folder trees, bill-of-materials, graph traversal, path finding.
A correlated subquery references columns from the outer query, causing it to execute once per outer row. Unlike an uncorrelated subquery (which runs once), it can be slow on large datasets.
sql
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e.department -- references outer row
);
This returns employees earning above their department average. The subquery runs for every row in `e`. Rewriting with a window function (AVG OVER PARTITION BY department) is usually much faster.
A correlated subquery references columns from the outer query, causing it to execute once per outer row. Unlike an uncorrelated subquery (which runs once), it can be slow on large datasets.
```sql
SELECT e.name, e.salary
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e.department -- references outer row
);
```
This returns employees earning above their department average. The subquery runs for every row in `e`. Rewriting with a window function (AVG OVER PARTITION BY department) is usually much faster.
EXISTS short-circuits on the first match and never materialises the full subquery result, making it fast when the subquery is large or the existence check is selective. IN materialises the subquery into a hash set — efficient when the subquery is small and re-used, or when NOT IN logic is needed (though watch for NULL pitfalls with NOT IN).
For large correlated datasets, EXISTS with a proper index on the join column typically wins. Modern query planners (PostgreSQL 12+) can rewrite many IN subqueries into semi-joins automatically, narrowing the practical difference. The NULL behaviour of NOT IN is the most important semantic difference: if the subquery returns any NULL, NOT IN returns zero rows.
EXISTS short-circuits on the first match and never materialises the full subquery result, making it fast when the subquery is large or the existence check is selective. IN materialises the subquery into a hash set — efficient when the subquery is small and re-used, or when NOT IN logic is needed (though watch for NULL pitfalls with NOT IN).
For large correlated datasets, EXISTS with a proper index on the join column typically wins. Modern query planners (PostgreSQL 12+) can rewrite many IN subqueries into semi-joins automatically, narrowing the practical difference. The NULL behaviour of NOT IN is the most important semantic difference: if the subquery returns any NULL, NOT IN returns zero rows.
A B-tree (balanced tree) index keeps column values in a sorted structure of tree nodes, allowing the database to find a value in O(log n) time instead of scanning all rows (O(n)). Each leaf node stores the indexed value and a pointer to the heap row.
sql
CREATE INDEX idx_orders_user_id ON orders (user_id);
-- Now lookups like WHERE user_id = 42 go through the index tree,
-- not a full table scan.
B-tree indexes support equality (=), range (<, >, BETWEEN), and prefix LIKE ('abc%') lookups. They are the default index type in PostgreSQL.
A B-tree (balanced tree) index keeps column values in a sorted structure of tree nodes, allowing the database to find a value in O(log n) time instead of scanning all rows (O(n)). Each leaf node stores the indexed value and a pointer to the heap row.
```sql
CREATE INDEX idx_orders_user_id ON orders (user_id);
-- Now lookups like WHERE user_id = 42 go through the index tree,
-- not a full table scan.
```
B-tree indexes support equality (=), range (<, >, BETWEEN), and prefix LIKE ('abc%') lookups. They are the default index type in PostgreSQL.
A composite (multi-column) index covers multiple columns in a specified order. The leftmost column(s) can be used independently; columns further right are only used when all preceding columns are also constrained.
sql
CREATE INDEX idx_orders_user_status ON orders (user_id, status);
-- Efficient for: WHERE user_id = 1
-- Efficient for: WHERE user_id = 1 AND status = 'shipped'
-- NOT efficient for: WHERE status = 'shipped' alone (no left-prefix)
Place the most selective column first, and put equality conditions before range conditions for maximum index utility.
A composite (multi-column) index covers multiple columns in a specified order. The leftmost column(s) can be used independently; columns further right are only used when all preceding columns are also constrained.
```sql
CREATE INDEX idx_orders_user_status ON orders (user_id, status);
-- Efficient for: WHERE user_id = 1
-- Efficient for: WHERE user_id = 1 AND status = 'shipped'
-- NOT efficient for: WHERE status = 'shipped' alone (no left-prefix)
```
Place the most selective column first, and put equality conditions before range conditions for maximum index utility.
An expression index indexes the result of a function or expression rather than a raw column value. The query must use the identical expression to benefit from the index.
sql
-- Index the lowercased email for case-insensitive lookups
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now this query uses the index:
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
PostgreSQL evaluates the expression at write time and stores the result, so reads are fast but writes are slightly slower. Ensure the function is immutable (same output for same input).
An expression index indexes the result of a function or expression rather than a raw column value. The query must use the identical expression to benefit from the index.
```sql
-- Index the lowercased email for case-insensitive lookups
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
-- Now this query uses the index:
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
```
PostgreSQL evaluates the expression at write time and stores the result, so reads are fast but writes are slightly slower. Ensure the function is immutable (same output for same input).
The query planner skips an index when: (1) the table is tiny and a sequential scan is cheaper; (2) the query is not selective (e.g., a boolean column where 90% of rows match); (3) the WHERE clause wraps the indexed column in a function (e.g., WHERE LOWER(email) = ... without a function index); (4) the index column order does not match the query's leading conditions; (5) implicit type casting forces a function call; (6) query statistics are stale (run ANALYZE).
Always use EXPLAIN to confirm whether your index is being used. The planner uses row count estimates from table statistics, so an outdated pg_statistic can cause bad plans.
The query planner skips an index when: (1) the table is tiny and a sequential scan is cheaper; (2) the query is not selective (e.g., a boolean column where 90% of rows match); (3) the WHERE clause wraps the indexed column in a function (e.g., WHERE LOWER(email) = ... without a function index); (4) the index column order does not match the query's leading conditions; (5) implicit type casting forces a function call; (6) query statistics are stale (run ANALYZE).
Always use EXPLAIN to confirm whether your index is being used. The planner uses row count estimates from table statistics, so an outdated pg_statistic can cause bad plans.
EXPLAIN shows the query plan with estimated costs. EXPLAIN ANALYZE actually executes the query and shows real timings. Key fields: "cost=startup..total" (planner estimate), "rows" (estimated row count), "actual time=start..end" (real time), "loops" (how many times the node executed).
sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 42;
-- Seq Scan: full table scan (bad for large tables)
-- Index Scan: uses an index (usually good)
-- Hash Join / Nested Loop / Merge Join: join strategies
Look for large discrepancies between "rows" (estimate) and "actual rows" — that signals stale statistics. Nodes with high "actual time" and many "loops" are tuning targets.
EXPLAIN shows the query plan with estimated costs. EXPLAIN ANALYZE actually executes the query and shows real timings. Key fields: "cost=startup..total" (planner estimate), "rows" (estimated row count), "actual time=start..end" (real time), "loops" (how many times the node executed).
```sql
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 42;
-- Seq Scan: full table scan (bad for large tables)
-- Index Scan: uses an index (usually good)
-- Hash Join / Nested Loop / Merge Join: join strategies
```
Look for large discrepancies between "rows" (estimate) and "actual rows" — that signals stale statistics. Nodes with high "actual time" and many "loops" are tuning targets.
A covering index includes all columns referenced by a query so the database can answer the query entirely from the index without accessing the main table (heap) at all. This eliminates the extra I/O of following row pointers.
sql
-- Query needs id, status, and created_at
SELECT id, status, created_at FROM orders WHERE user_id = 42;
-- A covering index on (user_id, status, created_at, id)
-- lets PostgreSQL use an Index Only Scan.
CREATE INDEX idx_orders_cover ON orders (user_id, status, created_at, id);
PostgreSQL shows "Index Only Scan" in EXPLAIN when a covering index is used. Over-indexing wastes write performance, so add extra columns to an index only when the query is very hot.
A covering index includes all columns referenced by a query so the database can answer the query entirely from the index without accessing the main table (heap) at all. This eliminates the extra I/O of following row pointers.
```sql
-- Query needs id, status, and created_at
SELECT id, status, created_at FROM orders WHERE user_id = 42;
-- A covering index on (user_id, status, created_at, id)
-- lets PostgreSQL use an Index Only Scan.
CREATE INDEX idx_orders_cover ON orders (user_id, status, created_at, id);
```
PostgreSQL shows "Index Only Scan" in EXPLAIN when a covering index is used. Over-indexing wastes write performance, so add extra columns to an index only when the query is very hot.
A partial index is built on a subset of rows defined by a WHERE clause. It is smaller, faster to update, and more likely to fit in cache than a full index on the same column.
sql
-- Index only unprocessed orders (far fewer rows than all orders)
CREATE INDEX idx_orders_pending
ON orders (created_at)
WHERE status = 'pending';
-- This query benefits:
SELECT * FROM orders
WHERE status = 'pending' AND created_at > NOW() - INTERVAL '1 day';
The query's WHERE clause must imply the index predicate for the partial index to be used.
A partial index is built on a subset of rows defined by a WHERE clause. It is smaller, faster to update, and more likely to fit in cache than a full index on the same column.
```sql
-- Index only unprocessed orders (far fewer rows than all orders)
CREATE INDEX idx_orders_pending
ON orders (created_at)
WHERE status = 'pending';
-- This query benefits:
SELECT * FROM orders
WHERE status = 'pending' AND created_at > NOW() - INTERVAL '1 day';
```
The query's WHERE clause must imply the index predicate for the partial index to be used.
1NF: each column holds atomic values (no repeating groups or arrays) and each row is unique. 2NF: meets 1NF and every non-key column is fully dependent on the entire primary key (no partial dependencies on a composite key). 3NF: meets 2NF and no non-key column depends on another non-key column (no transitive dependencies). BCNF (Boyce-Codd NF): stricter 3NF — every determinant is a candidate key.
Example: a table (order_id, product_id, product_name, quantity) violates 2NF because product_name depends only on product_id, not the full key. Fix: split into orders_items and products tables. Normalisation reduces redundancy and anomalies but increases the number of joins needed.
1NF: each column holds atomic values (no repeating groups or arrays) and each row is unique. 2NF: meets 1NF and every non-key column is fully dependent on the entire primary key (no partial dependencies on a composite key). 3NF: meets 2NF and no non-key column depends on another non-key column (no transitive dependencies). BCNF (Boyce-Codd NF): stricter 3NF — every determinant is a candidate key.
Example: a table (order_id, product_id, product_name, quantity) violates 2NF because product_name depends only on product_id, not the full key. Fix: split into orders_items and products tables. Normalisation reduces redundancy and anomalies but increases the number of joins needed.
Denormalization intentionally introduces redundancy to improve read performance. Storing a precomputed column (e.g., order_total on an orders table) avoids a join or aggregation on every read, which matters for high-traffic queries.
Trade-offs: write complexity increases (you must keep the redundant data consistent), data can drift out of sync causing bugs, and storage grows. Denormalization is justified for analytics tables, reporting snapshots, caches, and columns that are read far more often than written. Materialized views are a controlled form of denormalization.
Denormalization intentionally introduces redundancy to improve read performance. Storing a precomputed column (e.g., order_total on an orders table) avoids a join or aggregation on every read, which matters for high-traffic queries.
Trade-offs: write complexity increases (you must keep the redundant data consistent), data can drift out of sync causing bugs, and storage grows. Denormalization is justified for analytics tables, reporting snapshots, caches, and columns that are read far more often than written. Materialized views are a controlled form of denormalization.
ACID stands for Atomicity, Consistency, Isolation, and Durability. Atomicity: the transaction either fully commits or fully rolls back — no partial writes. Consistency: the database moves from one valid state to another, respecting all constraints. Isolation: concurrent transactions behave as if they ran serially. Durability: committed data survives crashes (written to WAL / disk).
ACID guarantees are why relational databases are trusted for financial and inventory systems. NoSQL systems often relax isolation or durability for higher throughput.
ACID stands for Atomicity, Consistency, Isolation, and Durability. Atomicity: the transaction either fully commits or fully rolls back — no partial writes. Consistency: the database moves from one valid state to another, respecting all constraints. Isolation: concurrent transactions behave as if they ran serially. Durability: committed data survives crashes (written to WAL / disk).
ACID guarantees are why relational databases are trusted for financial and inventory systems. NoSQL systems often relax isolation or durability for higher throughput.
From weakest to strongest: READ UNCOMMITTED (can see dirty reads — not supported in PostgreSQL, treated as READ COMMITTED), READ COMMITTED (default in PostgreSQL — sees only committed rows, but non-repeatable reads possible), REPEATABLE READ (same row returns same value within a transaction, prevents non-repeatable reads; PostgreSQL also prevents phantom reads at this level), SERIALIZABLE (full isolation — transactions appear to execute one at a time).
sql
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- your queries
COMMIT;
Higher isolation levels reduce concurrency and can increase deadlock risk.
From weakest to strongest: READ UNCOMMITTED (can see dirty reads — not supported in PostgreSQL, treated as READ COMMITTED), READ COMMITTED (default in PostgreSQL — sees only committed rows, but non-repeatable reads possible), REPEATABLE READ (same row returns same value within a transaction, prevents non-repeatable reads; PostgreSQL also prevents phantom reads at this level), SERIALIZABLE (full isolation — transactions appear to execute one at a time).
```sql
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- your queries
COMMIT;
```
Higher isolation levels reduce concurrency and can increase deadlock risk.
COMMIT makes all changes in the current transaction permanent and visible to other sessions. ROLLBACK undoes all changes back to the start of the transaction (or to the last SAVEPOINT). SAVEPOINT creates a named checkpoint within a transaction; ROLLBACK TO SAVEPOINT undoes only the work since that checkpoint.
sql
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
SAVEPOINT after_insert;
UPDATE accounts SET balance = -500 WHERE id = 1; -- oops
ROLLBACK TO SAVEPOINT after_insert; -- undo the bad update
UPDATE accounts SET balance = 900 WHERE id = 1;
COMMIT;
COMMIT makes all changes in the current transaction permanent and visible to other sessions. ROLLBACK undoes all changes back to the start of the transaction (or to the last SAVEPOINT). SAVEPOINT creates a named checkpoint within a transaction; ROLLBACK TO SAVEPOINT undoes only the work since that checkpoint.
```sql
BEGIN;
INSERT INTO accounts (id, balance) VALUES (1, 1000);
SAVEPOINT after_insert;
UPDATE accounts SET balance = -500 WHERE id = 1; -- oops
ROLLBACK TO SAVEPOINT after_insert; -- undo the bad update
UPDATE accounts SET balance = 900 WHERE id = 1;
COMMIT;
```
A deadlock occurs when two or more transactions each hold a lock and wait for a lock held by another, forming a cycle. Neither can proceed. The database detects the cycle (usually via a lock wait graph) and kills one transaction (the "deadlock victim"), rolling it back so the others can complete.
Prevention strategies: always acquire locks in the same order across transactions; keep transactions short; use SELECT ... FOR UPDATE SKIP LOCKED for queue-style workloads. In PostgreSQL, deadlock detection runs periodically (default 1 second); the killed transaction receives a deadlock error and must be retried by the application.
A deadlock occurs when two or more transactions each hold a lock and wait for a lock held by another, forming a cycle. Neither can proceed. The database detects the cycle (usually via a lock wait graph) and kills one transaction (the "deadlock victim"), rolling it back so the others can complete.
Prevention strategies: always acquire locks in the same order across transactions; keep transactions short; use SELECT ... FOR UPDATE SKIP LOCKED for queue-style workloads. In PostgreSQL, deadlock detection runs periodically (default 1 second); the killed transaction receives a deadlock error and must be retried by the application.
Pessimistic locking acquires a row lock at read time, preventing others from modifying the row until the transaction completes (SELECT ... FOR UPDATE). Optimistic locking reads the row without locking, attempts the update, and checks a version counter or timestamp to detect concurrent modification — rolling back if the version changed.
sql
-- Pessimistic
SELECT * FROM inventory WHERE product_id = 5 FOR UPDATE;
-- Optimistic: application re-reads and retries if version changed
UPDATE inventory
SET quantity = quantity - 1, version = version + 1
WHERE product_id = 5 AND version = 7;
Optimistic is better for low-contention workloads (avoids lock overhead); pessimistic is safer under high contention.
Pessimistic locking acquires a row lock at read time, preventing others from modifying the row until the transaction completes (SELECT ... FOR UPDATE). Optimistic locking reads the row without locking, attempts the update, and checks a version counter or timestamp to detect concurrent modification — rolling back if the version changed.
```sql
-- Pessimistic
SELECT * FROM inventory WHERE product_id = 5 FOR UPDATE;
-- Optimistic: application re-reads and retries if version changed
UPDATE inventory
SET quantity = quantity - 1, version = version + 1
WHERE product_id = 5 AND version = 7;
```
Optimistic is better for low-contention workloads (avoids lock overhead); pessimistic is safer under high contention.
INSERT ... ON CONFLICT allows you to specify what to do when the inserted row violates a unique or primary key constraint: either DO NOTHING (silently skip) or DO UPDATE SET (update the existing row — an atomic upsert).
sql
INSERT INTO page_views (page_id, views)
VALUES (42, 1)
ON CONFLICT (page_id)
DO UPDATE SET views = page_views.views + EXCLUDED.views,
updated_at = NOW();
EXCLUDED refers to the row that was proposed for insertion. This pattern is safe under concurrent writes because the entire operation is atomic at the row level.
INSERT ... ON CONFLICT allows you to specify what to do when the inserted row violates a unique or primary key constraint: either DO NOTHING (silently skip) or DO UPDATE SET (update the existing row — an atomic upsert).
```sql
INSERT INTO page_views (page_id, views)
VALUES (42, 1)
ON CONFLICT (page_id)
DO UPDATE SET views = page_views.views + EXCLUDED.views,
updated_at = NOW();
```
EXCLUDED refers to the row that was proposed for insertion. This pattern is safe under concurrent writes because the entire operation is atomic at the row level.
STRING_AGG concatenates values from a group into a single string, separated by a delimiter. An ORDER BY inside the aggregate controls the order of concatenation.
sql
-- PostgreSQL
SELECT
order_id,
STRING_AGG(product_name, ', ' ORDER BY product_name) AS products
FROM order_items
GROUP BY order_id;
-- MySQL / SQLite equivalent
SELECT order_id, GROUP_CONCAT(product_name ORDER BY product_name SEPARATOR ', ')
FROM order_items GROUP BY order_id;
Useful for creating comma-separated lists without application-side looping.
STRING_AGG concatenates values from a group into a single string, separated by a delimiter. An ORDER BY inside the aggregate controls the order of concatenation.
```sql
-- PostgreSQL
SELECT
order_id,
STRING_AGG(product_name, ', ' ORDER BY product_name) AS products
FROM order_items
GROUP BY order_id;
-- MySQL / SQLite equivalent
SELECT order_id, GROUP_CONCAT(product_name ORDER BY product_name SEPARATOR ', ')
FROM order_items GROUP BY order_id;
```
Useful for creating comma-separated lists without application-side looping.
PostgreSQL has two JSON types: json (stored as text, preserves whitespace and key order) and jsonb (stored as a binary decomposed format — faster for querying, supports GIN indexing, deduplicates keys). Prefer jsonb in almost all cases.
sql
-- Query a jsonb column
SELECT data->>'name' AS name,
(data->'address'->>'city') AS city
FROM profiles
WHERE data @> '{"role": "admin"}';
-- Index a jsonb field
CREATE INDEX idx_profiles_role ON profiles USING GIN (data);
PostgreSQL has two JSON types: json (stored as text, preserves whitespace and key order) and jsonb (stored as a binary decomposed format — faster for querying, supports GIN indexing, deduplicates keys). Prefer jsonb in almost all cases.
```sql
-- Query a jsonb column
SELECT data->>'name' AS name,
(data->'address'->>'city') AS city
FROM profiles
WHERE data @> '{"role": "admin"}';
-- Index a jsonb field
CREATE INDEX idx_profiles_role ON profiles USING GIN (data);
```
`->` returns jsonb; `->>` returns text. `@>` checks containment and uses GIN indexes.
A view is a named saved SELECT query that behaves like a virtual table. It does not store data (unless materialised); every query against it re-executes the underlying SELECT.
sql
CREATE VIEW active_users AS
SELECT id, email, created_at
FROM users
WHERE deleted_at IS NULL;
SELECT * FROM active_users WHERE id = 5;
Use views to: simplify complex queries, apply row-level security by exposing subsets, or provide stable interfaces over tables that may change schema. Do not use views as a performance tool — the query still executes in full unless it is a materialized view.
A view is a named saved SELECT query that behaves like a virtual table. It does not store data (unless materialised); every query against it re-executes the underlying SELECT.
```sql
CREATE VIEW active_users AS
SELECT id, email, created_at
FROM users
WHERE deleted_at IS NULL;
SELECT * FROM active_users WHERE id = 5;
```
Use views to: simplify complex queries, apply row-level security by exposing subsets, or provide stable interfaces over tables that may change schema. Do not use views as a performance tool — the query still executes in full unless it is a materialized view.
A materialized view stores the query result physically on disk at creation or refresh time. Reads are fast (it is a real table), but data is stale until refreshed. A regular view always reflects current data but re-runs the query on every access.
sql
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders GROUP BY 1;
-- Refresh (blocks reads in PostgreSQL by default)
REFRESH MATERIALIZED VIEW monthly_sales;
-- Non-blocking refresh (requires a unique index)
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_sales;
Materialized views are ideal for expensive aggregations that do not need to be real-time.
A materialized view stores the query result physically on disk at creation or refresh time. Reads are fast (it is a real table), but data is stale until refreshed. A regular view always reflects current data but re-runs the query on every access.
```sql
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders GROUP BY 1;
-- Refresh (blocks reads in PostgreSQL by default)
REFRESH MATERIALIZED VIEW monthly_sales;
-- Non-blocking refresh (requires a unique index)
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_sales;
```
Materialized views are ideal for expensive aggregations that do not need to be real-time.
A function returns a value and can be used inside a SELECT or WHERE clause. A stored procedure is called with CALL, does not return a value directly, and (in PostgreSQL 11+) can manage transactions (COMMIT/ROLLBACK inside the procedure). Functions in PostgreSQL run inside the caller's transaction and cannot commit independently.
sql
-- Function
CREATE FUNCTION get_user_tier(uid UUID)
RETURNS TEXT AS $$
SELECT plan FROM subscriptions WHERE user_id = uid;
$$ LANGUAGE sql STABLE;
SELECT get_user_tier('abc-123');
-- Procedure (PostgreSQL 11+)
CREATE PROCEDURE archive_old_orders(cutoff DATE)
LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO order_archive SELECT * FROM orders WHERE created_at < cutoff;
DELETE FROM orders WHERE created_at < cutoff;
END;
$$;
CALL archive_old_orders('2023-01-01');
A function returns a value and can be used inside a SELECT or WHERE clause. A stored procedure is called with CALL, does not return a value directly, and (in PostgreSQL 11+) can manage transactions (COMMIT/ROLLBACK inside the procedure). Functions in PostgreSQL run inside the caller's transaction and cannot commit independently.
```sql
-- Function
CREATE FUNCTION get_user_tier(uid UUID)
RETURNS TEXT AS $$
SELECT plan FROM subscriptions WHERE user_id = uid;
$$ LANGUAGE sql STABLE;
SELECT get_user_tier('abc-123');
-- Procedure (PostgreSQL 11+)
CREATE PROCEDURE archive_old_orders(cutoff DATE)
LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO order_archive SELECT * FROM orders WHERE created_at < cutoff;
DELETE FROM orders WHERE created_at < cutoff;
END;
$$;
CALL archive_old_orders('2023-01-01');
```
FIRST_VALUE returns the value of an expression from the first row of the window frame; LAST_VALUE returns it from the last row. The default window frame for LAST_VALUE is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which means it returns the current row — you usually need to extend the frame to get the true last row.
sql
SELECT
product_id,
sale_date,
amount,
FIRST_VALUE(amount) OVER w AS first_sale,
LAST_VALUE(amount) OVER w AS last_sale
FROM sales
WINDOW w AS (
PARTITION BY product_id
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
Always specify the frame explicitly when using LAST_VALUE to avoid the default frame surprise.
FIRST_VALUE returns the value of an expression from the first row of the window frame; LAST_VALUE returns it from the last row. The default window frame for LAST_VALUE is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which means it returns the current row — you usually need to extend the frame to get the true last row.
```sql
SELECT
product_id,
sale_date,
amount,
FIRST_VALUE(amount) OVER w AS first_sale,
LAST_VALUE(amount) OVER w AS last_sale
FROM sales
WINDOW w AS (
PARTITION BY product_id
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
```
Always specify the frame explicitly when using LAST_VALUE to avoid the default frame surprise.
ROW_NUMBER within a PARTITION of the duplicate group lets you label duplicates and then delete all but the first (or last).
sql
WITH ranked AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM users
)
DELETE FROM users
WHERE id IN (
SELECT id FROM ranked WHERE rn > 1
);
Always verify with a SELECT first before running the DELETE. Wrapping it in a transaction lets you ROLLBACK if needed.
ROW_NUMBER within a PARTITION of the duplicate group lets you label duplicates and then delete all but the first (or last).
```sql
WITH ranked AS (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at ASC
) AS rn
FROM users
)
DELETE FROM users
WHERE id IN (
SELECT id FROM ranked WHERE rn > 1
);
```
Always verify with a SELECT first before running the DELETE. Wrapping it in a transaction lets you ROLLBACK if needed.
The FILTER clause (SQL standard, supported in PostgreSQL) adds a conditional to an aggregate function without requiring a CASE WHEN inside the aggregate. It is cleaner and can be more efficient than the CASE workaround.
sql
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = 'shipped') AS shipped,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled,
AVG(amount) FILTER (WHERE status = 'shipped') AS avg_shipped_amount
FROM orders;
Equivalent CASE workaround: COUNT(CASE WHEN status = 'shipped' THEN 1 END). FILTER is also valid inside window functions.
The FILTER clause (SQL standard, supported in PostgreSQL) adds a conditional to an aggregate function without requiring a CASE WHEN inside the aggregate. It is cleaner and can be more efficient than the CASE workaround.
```sql
SELECT
COUNT(*) AS total,
COUNT(*) FILTER (WHERE status = 'shipped') AS shipped,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled,
AVG(amount) FILTER (WHERE status = 'shipped') AS avg_shipped_amount
FROM orders;
```
Equivalent CASE workaround: COUNT(CASE WHEN status = 'shipped' THEN 1 END). FILTER is also valid inside window functions.
These extensions to GROUP BY compute multiple grouping combinations in a single pass. ROLLUP generates hierarchical subtotals from left to right. CUBE generates all possible subtotal combinations. GROUPING SETS explicitly lists the grouping combinations you want.
sql
-- ROLLUP: totals by (year, quarter), (year), and grand total
SELECT year, quarter, SUM(revenue)
FROM sales
GROUP BY ROLLUP (year, quarter);
-- CUBE: all combinations of region and product
SELECT region, product, SUM(sales)
FROM facts
GROUP BY CUBE (region, product);
-- GROUPING SETS: explicit
GROUP BY GROUPING SETS ((region, product), (region), ());
NULL in the output marks a rolled-up dimension; use GROUPING() to distinguish intentional NULLs from rollup NULLs.
These extensions to GROUP BY compute multiple grouping combinations in a single pass. ROLLUP generates hierarchical subtotals from left to right. CUBE generates all possible subtotal combinations. GROUPING SETS explicitly lists the grouping combinations you want.
```sql
-- ROLLUP: totals by (year, quarter), (year), and grand total
SELECT year, quarter, SUM(revenue)
FROM sales
GROUP BY ROLLUP (year, quarter);
-- CUBE: all combinations of region and product
SELECT region, product, SUM(sales)
FROM facts
GROUP BY CUBE (region, product);
-- GROUPING SETS: explicit
GROUP BY GROUPING SETS ((region, product), (region), ());
```
NULL in the output marks a rolled-up dimension; use GROUPING() to distinguish intentional NULLs from rollup NULLs.
Most SQL databases (including PostgreSQL) lack a PIVOT syntax, but you can simulate it using conditional aggregation with CASE WHEN inside aggregate functions.
sql
-- Pivot monthly revenue into columns
SELECT
product_id,
SUM(CASE WHEN month = 1 THEN revenue ELSE 0 END) AS jan,
SUM(CASE WHEN month = 2 THEN revenue ELSE 0 END) AS feb,
SUM(CASE WHEN month = 3 THEN revenue ELSE 0 END) AS mar
FROM monthly_revenue
GROUP BY product_id;
This approach requires knowing the pivot values at query-write time. For dynamic pivots (unknown values), generate the SQL dynamically in application code or a stored procedure. The FILTER clause is a cleaner modern alternative: SUM(revenue) FILTER (WHERE month = 1).
Most SQL databases (including PostgreSQL) lack a PIVOT syntax, but you can simulate it using conditional aggregation with CASE WHEN inside aggregate functions.
```sql
-- Pivot monthly revenue into columns
SELECT
product_id,
SUM(CASE WHEN month = 1 THEN revenue ELSE 0 END) AS jan,
SUM(CASE WHEN month = 2 THEN revenue ELSE 0 END) AS feb,
SUM(CASE WHEN month = 3 THEN revenue ELSE 0 END) AS mar
FROM monthly_revenue
GROUP BY product_id;
```
This approach requires knowing the pivot values at query-write time. For dynamic pivots (unknown values), generate the SQL dynamically in application code or a stored procedure. The FILTER clause is a cleaner modern alternative: SUM(revenue) FILTER (WHERE month = 1).
A clustered index determines the physical storage order of rows in the table — there can be only one per table, and it is typically the primary key. In SQL Server and MySQL InnoDB, the primary key is always the clustered index. A non-clustered index is a separate structure that stores the indexed values with pointers back to the heap rows; you can have many per table.
PostgreSQL does not have clustered indexes in the traditional sense, but the CLUSTER command rewrites a table in index order and the planner can exploit that order. After new inserts, the clustering degrades. For PostgreSQL, "covering indexes" (Index Only Scans) are usually more impactful than clustering.
In practice: choose primary keys that are monotonically increasing (e.g., BIGSERIAL) to avoid random page splits in clustered structures and keep index inserts fast.
A clustered index determines the physical storage order of rows in the table — there can be only one per table, and it is typically the primary key. In SQL Server and MySQL InnoDB, the primary key is always the clustered index. A non-clustered index is a separate structure that stores the indexed values with pointers back to the heap rows; you can have many per table.
PostgreSQL does not have clustered indexes in the traditional sense, but the CLUSTER command rewrites a table in index order and the planner can exploit that order. After new inserts, the clustering degrades. For PostgreSQL, "covering indexes" (Index Only Scans) are usually more impactful than clustering.
In practice: choose primary keys that are monotonically increasing (e.g., BIGSERIAL) to avoid random page splits in clustered structures and keep index inserts fast.
The N+1 problem occurs when an application fetches a list of N records (1 query) and then executes a separate query for each record to fetch related data (N queries). For 1,000 users, this becomes 1,001 round-trips to the database.
sql
-- N+1 pattern (bad): one query per user to get their latest order
SELECT id FROM users; -- returns N users
-- then per user:
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 1;
-- Fixed with a single query using LATERAL or ROW_NUMBER
SELECT u.id, u.email, o.id AS latest_order_id
FROM users u
LEFT JOIN LATERAL (
SELECT id FROM orders WHERE user_id = u.id
ORDER BY created_at DESC LIMIT 1
) o ON TRUE;
ORM solutions: eager loading (SELECT IN or JOIN), DataLoader pattern for batching. Always profile with pg_stat_statements to detect repeated similar queries.
The N+1 problem occurs when an application fetches a list of N records (1 query) and then executes a separate query for each record to fetch related data (N queries). For 1,000 users, this becomes 1,001 round-trips to the database.
```sql
-- N+1 pattern (bad): one query per user to get their latest order
SELECT id FROM users; -- returns N users
-- then per user:
SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 1;
-- Fixed with a single query using LATERAL or ROW_NUMBER
SELECT u.id, u.email, o.id AS latest_order_id
FROM users u
LEFT JOIN LATERAL (
SELECT id FROM orders WHERE user_id = u.id
ORDER BY created_at DESC LIMIT 1
) o ON TRUE;
```
ORM solutions: eager loading (SELECT IN or JOIN), DataLoader pattern for batching. Always profile with pg_stat_statements to detect repeated similar queries.
Every index on a table must be maintained on each write. An INSERT adds an entry to every index covering written columns. An UPDATE that changes an indexed column removes the old index entry and inserts a new one. DELETE removes entries from all indexes for the deleted row.
Consequences: (1) Write throughput decreases as index count increases. (2) Bulk loads should drop non-essential indexes first, load data, then rebuild. (3) Index maintenance generates WAL, increasing replication bandwidth. (4) Index bloat accumulates from dead entries until VACUUM cleans them.
Rule of thumb: index columns that are frequently filtered, joined, or sorted, and keep the index count per table under ~5–8 for OLTP tables. Use pg_stat_user_indexes to find indexes with zero scans that can be safely dropped.
Every index on a table must be maintained on each write. An INSERT adds an entry to every index covering written columns. An UPDATE that changes an indexed column removes the old index entry and inserts a new one. DELETE removes entries from all indexes for the deleted row.
Consequences: (1) Write throughput decreases as index count increases. (2) Bulk loads should drop non-essential indexes first, load data, then rebuild. (3) Index maintenance generates WAL, increasing replication bandwidth. (4) Index bloat accumulates from dead entries until VACUUM cleans them.
Rule of thumb: index columns that are frequently filtered, joined, or sorted, and keep the index count per table under ~5–8 for OLTP tables. Use pg_stat_user_indexes to find indexes with zero scans that can be safely dropped.
The window frame clause inside OVER(...) refines which rows participate in the window function computation for each current row. ROWS counts physical rows; RANGE groups rows with identical ORDER BY values together (logical peers).
sql
SELECT
sale_date, amount,
-- ROWS: exactly the 3 rows before current row
SUM(amount) OVER (ORDER BY sale_date
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS rolling_3row,
-- RANGE: all rows with same or earlier sale_date value
SUM(amount) OVER (ORDER BY sale_date
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM daily_sales;
Rrows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the safe default for running totals. RANGE is the default when you write just ORDER BY with no frame clause — which can cause surprising results with duplicate ORDER BY values.
The window frame clause inside OVER(...) refines which rows participate in the window function computation for each current row. ROWS counts physical rows; RANGE groups rows with identical ORDER BY values together (logical peers).
```sql
SELECT
sale_date, amount,
-- ROWS: exactly the 3 rows before current row
SUM(amount) OVER (ORDER BY sale_date
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS rolling_3row,
-- RANGE: all rows with same or earlier sale_date value
SUM(amount) OVER (ORDER BY sale_date
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM daily_sales;
```
Rrows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the safe default for running totals. RANGE is the default when you write just ORDER BY with no frame clause — which can cause surprising results with duplicate ORDER BY values.
A trigger is a stored function that executes automatically in response to INSERT, UPDATE, or DELETE events on a table. Triggers can fire BEFORE or AFTER the event, and FOR EACH ROW or FOR EACH STATEMENT.
sql
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at := NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_set_updated_at
BEFORE UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
Common uses: auto-updating updated_at timestamps, audit logging to a history table, enforcing complex constraints, replicating changes to a denormalised summary table. Pitfalls: triggers are invisible — they make behaviour surprising. Cascading triggers (trigger A fires trigger B) can cause infinite loops. Heavy trigger logic degrades write throughput. Prefer application-layer logic or generated columns for simple cases.
A trigger is a stored function that executes automatically in response to INSERT, UPDATE, or DELETE events on a table. Triggers can fire BEFORE or AFTER the event, and FOR EACH ROW or FOR EACH STATEMENT.
```sql
CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at := NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_set_updated_at
BEFORE UPDATE ON products
FOR EACH ROW EXECUTE FUNCTION set_updated_at();
```
Common uses: auto-updating updated_at timestamps, audit logging to a history table, enforcing complex constraints, replicating changes to a denormalised summary table. Pitfalls: triggers are invisible — they make behaviour surprising. Cascading triggers (trigger A fires trigger B) can cause infinite loops. Heavy trigger logic degrades write throughput. Prefer application-layer logic or generated columns for simple cases.
The planner receives a parsed, rewritten query tree and generates candidate execution plans. It estimates the cost (in abstract I/O units) of each plan using table statistics (pg_statistic), cardinality estimates, and a cost model with tunable parameters (seq_page_cost, random_page_cost, cpu_tuple_cost). For small tables it exhausts all join orders; for larger ones it switches to GEQO (genetic query optimiser) to avoid exponential search.
The planner chooses joins strategies (nested loop, hash join, merge join), access methods (seq scan, index scan, bitmap scan), and whether to materialise subqueries. Stale statistics from pg_statistic (caused by heavy writes without ANALYZE) lead to bad cardinality estimates and wrong plan choices. Running ANALYZE or enabling autovacuum keeps statistics fresh.
Key tuning levers: increase statistics targets with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS, set enable_seqscan=off for debugging, and use pg_hint_plan for explicit hints when the planner consistently chooses poorly.
The planner receives a parsed, rewritten query tree and generates candidate execution plans. It estimates the cost (in abstract I/O units) of each plan using table statistics (pg_statistic), cardinality estimates, and a cost model with tunable parameters (seq_page_cost, random_page_cost, cpu_tuple_cost). For small tables it exhausts all join orders; for larger ones it switches to GEQO (genetic query optimiser) to avoid exponential search.
The planner chooses joins strategies (nested loop, hash join, merge join), access methods (seq scan, index scan, bitmap scan), and whether to materialise subqueries. Stale statistics from pg_statistic (caused by heavy writes without ANALYZE) lead to bad cardinality estimates and wrong plan choices. Running ANALYZE or enabling autovacuum keeps statistics fresh.
Key tuning levers: increase statistics targets with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS, set enable_seqscan=off for debugging, and use pg_hint_plan for explicit hints when the planner consistently chooses poorly.
PostgreSQL can parallelise sequential scans, aggregations, joins, and index scans by spawning worker processes that each handle a slice of the data. The leader process gathers partial results from workers via a Gather or Gather Merge node.
sql
SET max_parallel_workers_per_gather = 4;
EXPLAIN ANALYZE SELECT SUM(amount) FROM orders;
-- Look for: Parallel Seq Scan, Partial Aggregate, Gather
Key parameters: max_parallel_workers, max_parallel_workers_per_gather, min_parallel_table_scan_size. Parallelism is disabled for functions marked VOLATILE or for tables with row-level security in some configurations. Not all query shapes parallelise — correlated subqueries and CTEs (materialised) typically do not. Parallel query helps most for large analytical scans on tables that do not fit in shared_buffers.
PostgreSQL can parallelise sequential scans, aggregations, joins, and index scans by spawning worker processes that each handle a slice of the data. The leader process gathers partial results from workers via a Gather or Gather Merge node.
```sql
SET max_parallel_workers_per_gather = 4;
EXPLAIN ANALYZE SELECT SUM(amount) FROM orders;
-- Look for: Parallel Seq Scan, Partial Aggregate, Gather
```
Key parameters: max_parallel_workers, max_parallel_workers_per_gather, min_parallel_table_scan_size. Parallelism is disabled for functions marked VOLATILE or for tables with row-level security in some configurations. Not all query shapes parallelise — correlated subqueries and CTEs (materialised) typically do not. Parallel query helps most for large analytical scans on tables that do not fit in shared_buffers.
Partitioning divides a large logical table into smaller physical child tables (partitions) while queries target the parent table transparently. Three built-in strategies: Range — rows go into the partition whose range covers the partition key value (ideal for time-series data, e.g., one partition per month). List — rows are routed by explicit discrete values (e.g., country code). Hash — rows are distributed by a hash of the key, giving even distribution across N partitions.
sql
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2026_q1
PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
Partitioning benefits: partition pruning (the planner skips irrelevant partitions), faster bulk deletes (DROP the old partition), and better cache utilisation. Drawbacks: foreign keys into partitioned tables have restrictions, and global unique indexes are not supported (only per-partition).
Partitioning divides a large logical table into smaller physical child tables (partitions) while queries target the parent table transparently. Three built-in strategies: Range — rows go into the partition whose range covers the partition key value (ideal for time-series data, e.g., one partition per month). List — rows are routed by explicit discrete values (e.g., country code). Hash — rows are distributed by a hash of the key, giving even distribution across N partitions.
```sql
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2026_q1
PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
```
Partitioning benefits: partition pruning (the planner skips irrelevant partitions), faster bulk deletes (DROP the old partition), and better cache utilisation. Drawbacks: foreign keys into partitioned tables have restrictions, and global unique indexes are not supported (only per-partition).
Partition pruning is the query planner's ability to eliminate child partitions that cannot possibly contain rows matching the WHERE clause, avoiding scanning them entirely. For a range-partitioned table with 36 monthly partitions, a query with WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31' scans only the January 2026 partition.
sql
EXPLAIN SELECT * FROM events
WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01';
-- Plan shows: Seq Scan on events_2026_01 (1 partition, not 36)
Pruning works when the partition key appears in the WHERE clause with a constant or stable expression. Dynamic pruning (runtime pruning for parameterised queries) was added in PostgreSQL 11. Without proper partition key predicates, the planner scans all partitions — potentially worse than an unpartitioned table due to partition overhead.
Partition pruning is the query planner's ability to eliminate child partitions that cannot possibly contain rows matching the WHERE clause, avoiding scanning them entirely. For a range-partitioned table with 36 monthly partitions, a query with WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31' scans only the January 2026 partition.
```sql
EXPLAIN SELECT * FROM events
WHERE created_at >= '2026-01-01' AND created_at < '2026-02-01';
-- Plan shows: Seq Scan on events_2026_01 (1 partition, not 36)
```
Pruning works when the partition key appears in the WHERE clause with a constant or stable expression. Dynamic pruning (runtime pruning for parameterised queries) was added in PostgreSQL 11. Without proper partition key predicates, the planner scans all partitions — potentially worse than an unpartitioned table due to partition overhead.
pg_stat_statements is a PostgreSQL extension that tracks execution statistics for every unique query text: total time, mean time, call count, rows returned, buffer hits/misses, and I/O time. It is the first tool to reach for when identifying slow queries.
sql
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 slowest queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
Key columns: calls (frequency), mean_exec_time (average latency), stddev_exec_time (variance — high variance = occasional slow outliers), blk_read_time (I/O-bound). Reset with pg_stat_statements_reset(). Enable track_io_timing = on in postgresql.conf to populate I/O timing columns.
pg_stat_statements is a PostgreSQL extension that tracks execution statistics for every unique query text: total time, mean time, call count, rows returned, buffer hits/misses, and I/O time. It is the first tool to reach for when identifying slow queries.
```sql
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 slowest queries by total time
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
```
Key columns: calls (frequency), mean_exec_time (average latency), stddev_exec_time (variance — high variance = occasional slow outliers), blk_read_time (I/O-bound). Reset with pg_stat_statements_reset(). Enable track_io_timing = on in postgresql.conf to populate I/O timing columns.
PostgreSQL's MVCC model never overwrites rows in place — an UPDATE creates a new row version (tuple) and marks the old one as dead. Dead tuples accumulate ("bloat") until VACUUM reclaims their space for reuse. Autovacuum is a background daemon that triggers VACUUM and ANALYZE automatically when dead tuple counts cross a threshold (autovacuum_vacuum_scale_factor × table_rows + autovacuum_vacuum_threshold).
Signs of bloat: table file size grows disproportionate to live row count; index scans slow down. Check with pgstattuple extension or query pg_stat_user_tables for n_dead_tup. Fix: run VACUUM VERBOSE tablename, or VACUUM FULL (rewrites the table — takes an exclusive lock). High-churn tables (e.g., audit logs, queues) need more aggressive autovacuum settings: lower scale factors, higher cost limits.
PostgreSQL's MVCC model never overwrites rows in place — an UPDATE creates a new row version (tuple) and marks the old one as dead. Dead tuples accumulate ("bloat") until VACUUM reclaims their space for reuse. Autovacuum is a background daemon that triggers VACUUM and ANALYZE automatically when dead tuple counts cross a threshold (autovacuum_vacuum_scale_factor × table_rows + autovacuum_vacuum_threshold).
Signs of bloat: table file size grows disproportionate to live row count; index scans slow down. Check with pgstattuple extension or query pg_stat_user_tables for n_dead_tup. Fix: run VACUUM VERBOSE tablename, or VACUUM FULL (rewrites the table — takes an exclusive lock). High-churn tables (e.g., audit logs, queues) need more aggressive autovacuum settings: lower scale factors, higher cost limits.
MVCC allows readers and writers to operate concurrently without blocking each other. Instead of locking rows for reads, PostgreSQL gives each transaction a consistent snapshot of the database at the moment the transaction starts. Writes create new row versions (tuples) tagged with the creating transaction's ID (xmin) and the deleting transaction's ID (xmax). A reader sees a row if xmin committed before its snapshot and xmax has not yet committed.
This means: SELECT never blocks INSERT/UPDATE/DELETE and vice versa. The cost is dead tuple accumulation requiring VACUUM. MVCC also underpins REPEATABLE READ and SERIALIZABLE isolation — PostgreSQL detects read-write conflicts via predicate locks rather than actual row locks.
Every row has hidden system columns: xmin (insert XID), xmax (delete/update XID), cmin, cmax (command counters), ctid (physical location). Inspect them with SELECT xmin, xmax, * FROM my_table.
MVCC allows readers and writers to operate concurrently without blocking each other. Instead of locking rows for reads, PostgreSQL gives each transaction a consistent snapshot of the database at the moment the transaction starts. Writes create new row versions (tuples) tagged with the creating transaction's ID (xmin) and the deleting transaction's ID (xmax). A reader sees a row if xmin committed before its snapshot and xmax has not yet committed.
This means: SELECT never blocks INSERT/UPDATE/DELETE and vice versa. The cost is dead tuple accumulation requiring VACUUM. MVCC also underpins REPEATABLE READ and SERIALIZABLE isolation — PostgreSQL detects read-write conflicts via predicate locks rather than actual row locks.
Every row has hidden system columns: xmin (insert XID), xmax (delete/update XID), cmin, cmax (command counters), ctid (physical location). Inspect them with SELECT xmin, xmax, * FROM my_table.
The WAL is a sequential append-only log of all database changes written to disk before the corresponding data page is modified. This ensures crash recovery: on restart, PostgreSQL replays WAL records to bring data files to a consistent state (REDO). WAL enables both streaming and logical replication.
Streaming replication: standby servers connect to the primary and receive WAL segments in real time, applying them to stay in sync (binary level, same PostgreSQL version required). Logical replication (PostgreSQL 10+): decodes WAL changes into logical operations (INSERT/UPDATE/DELETE) allowing cross-version and cross-table-subset replication. WAL level setting: minimal < replica < logical (increasing detail, increasing overhead). pg_wal directory holds WAL segments (each 16 MB by default). WAL archiving (archive_command) enables point-in-time recovery (PITR).
The WAL is a sequential append-only log of all database changes written to disk before the corresponding data page is modified. This ensures crash recovery: on restart, PostgreSQL replays WAL records to bring data files to a consistent state (REDO). WAL enables both streaming and logical replication.
Streaming replication: standby servers connect to the primary and receive WAL segments in real time, applying them to stay in sync (binary level, same PostgreSQL version required). Logical replication (PostgreSQL 10+): decodes WAL changes into logical operations (INSERT/UPDATE/DELETE) allowing cross-version and cross-table-subset replication. WAL level setting: minimal < replica < logical (increasing detail, increasing overhead). pg_wal directory holds WAL segments (each 16 MB by default). WAL archiving (archive_command) enables point-in-time recovery (PITR).
Streaming replication is physical (byte-for-byte): WAL records are shipped to standbys and applied at the storage level. The standby is an exact replica — same schema, same PostgreSQL version (roughly), and the standby is read-only. Failover promotes a standby to primary.
Logical replication decodes WAL into row-level change events (INSERT/UPDATE/DELETE per publication) and sends them to subscribers. It allows: different PostgreSQL versions, replication of a subset of tables, bi-directional replication, and writing to the subscriber. It requires a primary key or replica identity on replicated tables.
Use streaming replication for HA (high-availability) standby and read scaling. Use logical replication for zero-downtime major version upgrades, cross-region read replicas with different schemas, or feeding change data to external systems (Debezium, Kafka).
Streaming replication is physical (byte-for-byte): WAL records are shipped to standbys and applied at the storage level. The standby is an exact replica — same schema, same PostgreSQL version (roughly), and the standby is read-only. Failover promotes a standby to primary.
Logical replication decodes WAL into row-level change events (INSERT/UPDATE/DELETE per publication) and sends them to subscribers. It allows: different PostgreSQL versions, replication of a subset of tables, bi-directional replication, and writing to the subscriber. It requires a primary key or replica identity on replicated tables.
Use streaming replication for HA (high-availability) standby and read scaling. Use logical replication for zero-downtime major version upgrades, cross-region read replicas with different schemas, or feeding change data to external systems (Debezium, Kafka).
Sharding horizontally splits a dataset across multiple database instances (shards) to scale beyond a single server's capacity. Common strategies: Hash sharding — route rows by hash(shard_key) mod N; gives even distribution but makes range queries span all shards. Range sharding — each shard owns a contiguous range of the key (e.g., user IDs 1–1M on shard 1); range queries stay on one shard but hot new-data shards cause imbalance. Directory sharding — a lookup table maps each key to its shard; flexible but the directory is a bottleneck. Geographic sharding — data is placed in the region closest to users.
Challenges: cross-shard JOINs are expensive (denormalise or do application-side joins), distributed transactions require 2-phase commit, re-sharding is painful. Choose the shard key carefully: high cardinality, uniform access, and collocated with common query patterns. Citus (PostgreSQL extension) and Vitess (MySQL) handle sharding transparently.
Sharding horizontally splits a dataset across multiple database instances (shards) to scale beyond a single server's capacity. Common strategies: Hash sharding — route rows by hash(shard_key) mod N; gives even distribution but makes range queries span all shards. Range sharding — each shard owns a contiguous range of the key (e.g., user IDs 1–1M on shard 1); range queries stay on one shard but hot new-data shards cause imbalance. Directory sharding — a lookup table maps each key to its shard; flexible but the directory is a bottleneck. Geographic sharding — data is placed in the region closest to users.
Challenges: cross-shard JOINs are expensive (denormalise or do application-side joins), distributed transactions require 2-phase commit, re-sharding is painful. Choose the shard key carefully: high cardinality, uniform access, and collocated with common query patterns. Citus (PostgreSQL extension) and Vitess (MySQL) handle sharding transparently.
Read replicas are standby database instances that receive a stream of changes from the primary and serve read traffic. They reduce read load on the primary and can be placed geographically closer to users.
Eventual consistency issue (replication lag): changes committed on the primary may not yet be visible on a replica. A user who just saved a profile may query a replica and see stale data — the "read-your-own-writes" problem. Mitigations: route writes and the immediately subsequent read to the primary; use synchronous replication for critical data (performance cost); track replication lag with pg_stat_replication and warn/wait when lag exceeds a threshold.
In PostgreSQL you can set synchronous_standby_names to require at least one standby to confirm receipt before COMMIT returns, ensuring zero lag for those standbys at the cost of added latency.
Read replicas are standby database instances that receive a stream of changes from the primary and serve read traffic. They reduce read load on the primary and can be placed geographically closer to users.
Eventual consistency issue (replication lag): changes committed on the primary may not yet be visible on a replica. A user who just saved a profile may query a replica and see stale data — the "read-your-own-writes" problem. Mitigations: route writes and the immediately subsequent read to the primary; use synchronous replication for critical data (performance cost); track replication lag with pg_stat_replication and warn/wait when lag exceeds a threshold.
In PostgreSQL you can set synchronous_standby_names to require at least one standby to confirm receipt before COMMIT returns, ensuring zero lag for those standbys at the cost of added latency.
Key patterns: (1) Range-partition by time (one partition per day/month) so old data can be dropped cheaply and queries on recent data only scan recent partitions. (2) Use TIMESTAMPTZ for all timestamps. (3) Create composite indexes on (entity_id, recorded_at) for per-entity time queries. (4) Use materialised views or summary tables for common aggregate windows (hourly/daily rollups). (5) TimescaleDB (a PostgreSQL extension) adds automatic time-based partitioning, continuous aggregates, and compression.
sql
-- Efficient hourly rollup query
SELECT
DATE_TRUNC('hour', recorded_at) AS hour,
AVG(cpu_pct) AS avg_cpu
FROM metrics
WHERE device_id = 'srv-01'
AND recorded_at >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;
Avoid storing one row per metric per timestamp for thousands of devices — use JSONB or array columns for batched metrics to reduce row count.
Key patterns: (1) Range-partition by time (one partition per day/month) so old data can be dropped cheaply and queries on recent data only scan recent partitions. (2) Use TIMESTAMPTZ for all timestamps. (3) Create composite indexes on (entity_id, recorded_at) for per-entity time queries. (4) Use materialised views or summary tables for common aggregate windows (hourly/daily rollups). (5) TimescaleDB (a PostgreSQL extension) adds automatic time-based partitioning, continuous aggregates, and compression.
```sql
-- Efficient hourly rollup query
SELECT
DATE_TRUNC('hour', recorded_at) AS hour,
AVG(cpu_pct) AS avg_cpu
FROM metrics
WHERE device_id = 'srv-01'
AND recorded_at >= NOW() - INTERVAL '24 hours'
GROUP BY 1
ORDER BY 1;
```
Avoid storing one row per metric per timestamp for thousands of devices — use JSONB or array columns for batched metrics to reduce row count.
EAV stores flexible attributes as rows in a three-column table (entity_id, attribute_name, value) instead of as columns. It enables schema-free extensibility without ALTER TABLE but comes with severe costs: no data types (everything is text or requires multiple value columns), no NOT NULL constraints per attribute, joins to reconstruct an entity are complex, query performance degrades quickly, and it is hard to enforce referential integrity.
sql
-- EAV (anti-pattern)
SELECT entity_id,
MAX(CASE WHEN attr='price' THEN value END)::NUMERIC AS price,
MAX(CASE WHEN attr='name' THEN value END) AS name
FROM attributes
WHERE entity_id = 5
GROUP BY entity_id;
Better alternatives: PostgreSQL JSONB column for sparse/variable attributes (typed, indexed, queryable); hstore; partial columns with NULLs for known optional fields; or a proper polymorphic design with a base table and type-specific extension tables.
EAV stores flexible attributes as rows in a three-column table (entity_id, attribute_name, value) instead of as columns. It enables schema-free extensibility without ALTER TABLE but comes with severe costs: no data types (everything is text or requires multiple value columns), no NOT NULL constraints per attribute, joins to reconstruct an entity are complex, query performance degrades quickly, and it is hard to enforce referential integrity.
```sql
-- EAV (anti-pattern)
SELECT entity_id,
MAX(CASE WHEN attr='price' THEN value END)::NUMERIC AS price,
MAX(CASE WHEN attr='name' THEN value END) AS name
FROM attributes
WHERE entity_id = 5
GROUP BY entity_id;
```
Better alternatives: PostgreSQL JSONB column for sparse/variable attributes (typed, indexed, queryable); hstore; partial columns with NULLs for known optional fields; or a proper polymorphic design with a base table and type-specific extension tables.
SCDs handle how historical changes to dimension data (e.g., customer address, product price) are tracked in a data warehouse. Type 1: Overwrite — just update the current value, no history kept. Simple but loses history. Type 2: Add a new row — each change creates a new record with effective_from/effective_to dates and a current_flag. Full history is preserved; the natural key + effective_from uniquely identifies a version.
sql
-- SCD Type 2 customer dimension
UPDATE dim_customer SET effective_to = NOW(), current_flag = FALSE
WHERE customer_id = 42 AND current_flag = TRUE;
INSERT INTO dim_customer (customer_id, address, effective_from, effective_to, current_flag)
VALUES (42, '456 New St', NOW(), '9999-12-31', TRUE);
Type 3: Add a column — store both old_value and new_value columns. Only the most recent change is tracked, not the full history. Type 2 is the most common in data warehouses.
SCDs handle how historical changes to dimension data (e.g., customer address, product price) are tracked in a data warehouse. Type 1: Overwrite — just update the current value, no history kept. Simple but loses history. Type 2: Add a new row — each change creates a new record with effective_from/effective_to dates and a current_flag. Full history is preserved; the natural key + effective_from uniquely identifies a version.
```sql
-- SCD Type 2 customer dimension
UPDATE dim_customer SET effective_to = NOW(), current_flag = FALSE
WHERE customer_id = 42 AND current_flag = TRUE;
INSERT INTO dim_customer (customer_id, address, effective_from, effective_to, current_flag)
VALUES (42, '456 New St', NOW(), '9999-12-31', TRUE);
```
Type 3: Add a column — store both old_value and new_value columns. Only the most recent change is tracked, not the full history. Type 2 is the most common in data warehouses.
A star schema has a central fact table surrounded by denormalised dimension tables. Each dimension is a single flat table, making queries simple (few joins). Storage is larger because dimension data is repeated. A snowflake schema normalises dimension tables into multiple levels (e.g., product → category → department), reducing storage and update anomalies but requiring more joins and making queries more complex.
Star schema is preferred in OLAP and BI tools (Tableau, Looker) because fewer joins mean faster queries and simpler SQL generation. Snowflake is better when dimensions change frequently or have large, reused hierarchies. Most modern columnar warehouses (Redshift, BigQuery, Snowflake the product) handle star schemas efficiently and denormalisation overhead is minimal.
A star schema has a central fact table surrounded by denormalised dimension tables. Each dimension is a single flat table, making queries simple (few joins). Storage is larger because dimension data is repeated. A snowflake schema normalises dimension tables into multiple levels (e.g., product → category → department), reducing storage and update anomalies but requiring more joins and making queries more complex.
Star schema is preferred in OLAP and BI tools (Tableau, Looker) because fewer joins mean faster queries and simpler SQL generation. Snowflake is better when dimensions change frequently or have large, reused hierarchies. Most modern columnar warehouses (Redshift, BigQuery, Snowflake the product) handle star schemas efficiently and denormalisation overhead is minimal.
Row-oriented storage (PostgreSQL heap, InnoDB) stores all column values for a row contiguously — optimal for OLTP (read/write a full row). Columnar storage (Redshift, BigQuery, Parquet files, DuckDB) stores all values for a single column contiguously — optimal for analytics that aggregate a few columns across millions of rows.
Benefits: (1) Only read the columns referenced in the query (projection pushdown eliminates I/O for unneeded columns). (2) Homogeneous data in a column compresses far better (run-length encoding, dictionary encoding, bit-packing). (3) SIMD vectorised operations can process batches of same-type values. A query doing SUM(revenue) on 1 billion rows may read only the revenue column — 8 bytes × 1B = 8 GB instead of reading full rows of 200 bytes each. For Parquet-based lakes (S3 + Athena), column pruning and row group filtering (min/max statistics) further reduce data scanned.
Row-oriented storage (PostgreSQL heap, InnoDB) stores all column values for a row contiguously — optimal for OLTP (read/write a full row). Columnar storage (Redshift, BigQuery, Parquet files, DuckDB) stores all values for a single column contiguously — optimal for analytics that aggregate a few columns across millions of rows.
Benefits: (1) Only read the columns referenced in the query (projection pushdown eliminates I/O for unneeded columns). (2) Homogeneous data in a column compresses far better (run-length encoding, dictionary encoding, bit-packing). (3) SIMD vectorised operations can process batches of same-type values. A query doing SUM(revenue) on 1 billion rows may read only the revenue column — 8 bytes × 1B = 8 GB instead of reading full rows of 200 bytes each. For Parquet-based lakes (S3 + Athena), column pruning and row group filtering (min/max statistics) further reduce data scanned.
PostgreSQL converts text to a tsvector (sorted list of normalised lexemes with position info) using to_tsvector(). Queries are expressed as tsquery with Boolean operators. The @@ operator checks if a tsquery matches a tsvector. A GIN index on the tsvector column makes search fast.
sql
-- Add a generated tsvector column (PostgreSQL 12+)
ALTER TABLE articles ADD COLUMN fts tsvector
GENERATED ALWAYS AS
(to_tsvector('english', coalesce(title,'')||' '||coalesce(body,''))) STORED;
CREATE INDEX idx_articles_fts ON articles USING GIN (fts);
-- Search
SELECT title, ts_rank(fts, q) AS rank
FROM articles, to_tsquery('english', 'database & performance') q
WHERE fts @@ q
ORDER BY rank DESC
LIMIT 10;
ts_rank and ts_rank_cd compute relevance scores. For production use, consider a dedicated search engine (Elasticsearch, Typesense) for richer features like fuzzy matching and facets.
PostgreSQL converts text to a tsvector (sorted list of normalised lexemes with position info) using to_tsvector(). Queries are expressed as tsquery with Boolean operators. The @@ operator checks if a tsquery matches a tsvector. A GIN index on the tsvector column makes search fast.
```sql
-- Add a generated tsvector column (PostgreSQL 12+)
ALTER TABLE articles ADD COLUMN fts tsvector
GENERATED ALWAYS AS
(to_tsvector('english', coalesce(title,'')||' '||coalesce(body,''))) STORED;
CREATE INDEX idx_articles_fts ON articles USING GIN (fts);
-- Search
SELECT title, ts_rank(fts, q) AS rank
FROM articles, to_tsquery('english', 'database & performance') q
WHERE fts @@ q
ORDER BY rank DESC
LIMIT 10;
```
ts_rank and ts_rank_cd compute relevance scores. For production use, consider a dedicated search engine (Elasticsearch, Typesense) for richer features like fuzzy matching and facets.
PostGIS is a PostgreSQL extension that adds geometry and geography data types plus hundreds of spatial functions. GEOMETRY uses a Cartesian coordinate system; GEOGRAPHY uses spherical coordinates (longitude/latitude) and computes real-world distances in meters.
sql
-- Find restaurants within 1 km of a point
SELECT name, ST_Distance(location::geography,
ST_MakePoint(-73.9857, 40.7484)::geography) AS dist_m
FROM restaurants
WHERE ST_DWithin(
location::geography,
ST_MakePoint(-73.9857, 40.7484)::geography,
1000 -- 1 km radius
)
ORDER BY dist_m;
-- Speed it up with a GIST index
CREATE INDEX idx_restaurants_location ON restaurants USING GIST (location);
Common functions: ST_Distance, ST_Contains, ST_Intersects, ST_Within, ST_Buffer, ST_Area, ST_AsGeoJSON.
PostGIS is a PostgreSQL extension that adds geometry and geography data types plus hundreds of spatial functions. GEOMETRY uses a Cartesian coordinate system; GEOGRAPHY uses spherical coordinates (longitude/latitude) and computes real-world distances in meters.
```sql
-- Find restaurants within 1 km of a point
SELECT name, ST_Distance(location::geography,
ST_MakePoint(-73.9857, 40.7484)::geography) AS dist_m
FROM restaurants
WHERE ST_DWithin(
location::geography,
ST_MakePoint(-73.9857, 40.7484)::geography,
1000 -- 1 km radius
)
ORDER BY dist_m;
-- Speed it up with a GIST index
CREATE INDEX idx_restaurants_location ON restaurants USING GIST (location);
```
Common functions: ST_Distance, ST_Contains, ST_Intersects, ST_Within, ST_Buffer, ST_Area, ST_AsGeoJSON.
RLS lets you attach WHERE-clause-like policies to a table so that different database users (or application roles) see and can modify only the rows they are permitted to access. The policy is enforced at the storage layer regardless of which query is issued — even SELECT * FROM users only returns the allowed rows.
sql
ALTER TABLE notes ENABLE ROW LEVEL SECURITY;
-- Users can only see their own notes
CREATE POLICY notes_isolation ON notes
USING (owner_id = current_setting('app.current_user_id')::INT);
-- Application sets the context variable at session start
SET app.current_user_id = 42;
SELECT * FROM notes; -- returns only notes where owner_id = 42
RLS is used by Supabase to implement per-user data isolation without application-layer filtering. BYPASSRLS privilege allows superusers and service roles to skip policies. Always test policies with a non-superuser role.
RLS lets you attach WHERE-clause-like policies to a table so that different database users (or application roles) see and can modify only the rows they are permitted to access. The policy is enforced at the storage layer regardless of which query is issued — even SELECT * FROM users only returns the allowed rows.
```sql
ALTER TABLE notes ENABLE ROW LEVEL SECURITY;
-- Users can only see their own notes
CREATE POLICY notes_isolation ON notes
USING (owner_id = current_setting('app.current_user_id')::INT);
-- Application sets the context variable at session start
SET app.current_user_id = 42;
SELECT * FROM notes; -- returns only notes where owner_id = 42
```
RLS is used by Supabase to implement per-user data isolation without application-layer filtering. BYPASSRLS privilege allows superusers and service roles to skip policies. Always test policies with a non-superuser role.
SQL injection occurs when untrusted user input is concatenated into a SQL string, allowing an attacker to alter query logic. Classic example: input' OR '1'='1 turns a login check into an always-true condition.
sql
-- VULNERABLE (never do this)
query = "SELECT * FROM users WHERE email = '" + user_input + "'";
-- SAFE: parameterized query (PostgreSQL node-postgres)
const result = await client.query(
'SELECT * FROM users WHERE email = $1',
[userInput]
);
With a parameterised query the driver sends the SQL template and the parameter value separately — the database treats the value purely as data, never as SQL syntax. Additional defences: least-privilege DB roles (no DROP/ALTER), input validation, WAF rules, and pg_audit logging. ORMs use parameterised queries by default; raw string interpolation is the primary injection vector.
SQL injection occurs when untrusted user input is concatenated into a SQL string, allowing an attacker to alter query logic. Classic example: input' OR '1'='1 turns a login check into an always-true condition.
```sql
-- VULNERABLE (never do this)
query = "SELECT * FROM users WHERE email = '" + user_input + "'";
-- SAFE: parameterized query (PostgreSQL node-postgres)
const result = await client.query(
'SELECT * FROM users WHERE email = $1',
[userInput]
);
```
With a parameterised query the driver sends the SQL template and the parameter value separately — the database treats the value purely as data, never as SQL syntax. Additional defences: least-privilege DB roles (no DROP/ALTER), input validation, WAF rules, and pg_audit logging. ORMs use parameterised queries by default; raw string interpolation is the primary injection vector.
Opening a PostgreSQL connection is expensive (~10 ms, ~5 MB RAM per backend process). A web server handling 1,000 concurrent requests would open 1,000 connections, exhausting memory. A connection pooler sits between the application and PostgreSQL, maintaining a small pool of real database connections and multiplexing many application connections onto them.
PgBouncer is the most popular PostgreSQL pooler. It supports three modes: Session pooling (1 backend per app session — minimal benefit), Transaction pooling (backend assigned only during a transaction — best density, most apps use this), Statement pooling (backend released after each statement — incompatible with multi-statement transactions).
Typical setup: applications connect to PgBouncer on port 6432; PgBouncer maintains 20–100 real connections to PostgreSQL. Max connections in postgresql.conf is set conservatively (e.g., 100) to protect memory. Supabase Vitest uses Supavisor (a Supabase-built pooler) in front of every project.
Opening a PostgreSQL connection is expensive (~10 ms, ~5 MB RAM per backend process). A web server handling 1,000 concurrent requests would open 1,000 connections, exhausting memory. A connection pooler sits between the application and PostgreSQL, maintaining a small pool of real database connections and multiplexing many application connections onto them.
PgBouncer is the most popular PostgreSQL pooler. It supports three modes: Session pooling (1 backend per app session — minimal benefit), Transaction pooling (backend assigned only during a transaction — best density, most apps use this), Statement pooling (backend released after each statement — incompatible with multi-statement transactions).
Typical setup: applications connect to PgBouncer on port 6432; PgBouncer maintains 20–100 real connections to PostgreSQL. Max connections in postgresql.conf is set conservatively (e.g., 100) to protect memory. Supabase Vitest uses Supavisor (a Supabase-built pooler) in front of every project.
VACUUM reclaims storage occupied by dead tuples (MVCC leftovers) and updates the visibility map to allow index-only scans. It does not shrink the table file; VACUUM FULL rewrites the table to return space to the OS but takes an exclusive lock. ANALYZE collects column statistics (histograms, most-common values) used by the query planner to estimate row counts.
Autovacuum handles both automatically, but run them manually when: (1) after a large bulk DELETE/UPDATE to reclaim space immediately; (2) after a bulk COPY/INSERT before a performance-critical query (so planner has fresh stats); (3) after enabling a new index to ensure the planner knows about updated statistics.
sql
VACUUM ANALYZE orders; -- reclaim + refresh stats
VACUUM VERBOSE orders; -- shows what was done
VACUUM FULL orders; -- full rewrite (exclusive lock!)
VACUUM reclaims storage occupied by dead tuples (MVCC leftovers) and updates the visibility map to allow index-only scans. It does not shrink the table file; VACUUM FULL rewrites the table to return space to the OS but takes an exclusive lock. ANALYZE collects column statistics (histograms, most-common values) used by the query planner to estimate row counts.
Autovacuum handles both automatically, but run them manually when: (1) after a large bulk DELETE/UPDATE to reclaim space immediately; (2) after a bulk COPY/INSERT before a performance-critical query (so planner has fresh stats); (3) after enabling a new index to ensure the planner knows about updated statistics.
```sql
VACUUM ANALYZE orders; -- reclaim + refresh stats
VACUUM VERBOSE orders; -- shows what was done
VACUUM FULL orders; -- full rewrite (exclusive lock!)
```
A sequence is a persistent counter object that generates unique integers. SERIAL is a legacy shorthand that creates a sequence and sets the column default to nextval(). Identity columns (PostgreSQL 10+, SQL standard) use GENERATED ALWAYS AS IDENTITY or GENERATED BY DEFAULT AS IDENTITY and are more portable.
sql
-- Modern identity column
CREATE TABLE events (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
data JSONB
);
-- Inspect or alter the sequence
SELECT last_value FROM events_id_seq;
ALTER SEQUENCE events_id_seq RESTART WITH 1000;
Sequences are non-transactional: if an INSERT rolls back, the sequence value is NOT returned, creating gaps. Gaps are normal and expected. Use uuid_generate_v4() or gen_random_uuid() (PostgreSQL 13+) when you need globally unique IDs across distributed systems.
A sequence is a persistent counter object that generates unique integers. SERIAL is a legacy shorthand that creates a sequence and sets the column default to nextval(). Identity columns (PostgreSQL 10+, SQL standard) use GENERATED ALWAYS AS IDENTITY or GENERATED BY DEFAULT AS IDENTITY and are more portable.
```sql
-- Modern identity column
CREATE TABLE events (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
data JSONB
);
-- Inspect or alter the sequence
SELECT last_value FROM events_id_seq;
ALTER SEQUENCE events_id_seq RESTART WITH 1000;
```
Sequences are non-transactional: if an INSERT rolls back, the sequence value is NOT returned, creating gaps. Gaps are normal and expected. Use uuid_generate_v4() or gen_random_uuid() (PostgreSQL 13+) when you need globally unique IDs across distributed systems.
Advisory locks are application-defined locks that PostgreSQL stores and manages but does not automatically associate with any table row or transaction. Applications acquire them by passing an integer key; the lock is exclusive or shared, and can be session-level or transaction-level.
sql
-- Acquire an exclusive session-level advisory lock (blocks until available)
SELECT pg_advisory_lock(12345);
-- ... do work that requires mutual exclusion ...
SELECT pg_advisory_unlock(12345);
-- Non-blocking try-acquire (returns false instead of waiting)
SELECT pg_try_advisory_lock(12345);
Common uses: leader election (only one cron worker runs at a time), distributed mutex (prevent duplicate job processing), co-ordination between background workers. They are lightweight compared to row locks and do not cause MVCC bloat. Session-level locks persist until released or the session ends; transaction-level locks release on COMMIT/ROLLBACK.
Advisory locks are application-defined locks that PostgreSQL stores and manages but does not automatically associate with any table row or transaction. Applications acquire them by passing an integer key; the lock is exclusive or shared, and can be session-level or transaction-level.
```sql
-- Acquire an exclusive session-level advisory lock (blocks until available)
SELECT pg_advisory_lock(12345);
-- ... do work that requires mutual exclusion ...
SELECT pg_advisory_unlock(12345);
-- Non-blocking try-acquire (returns false instead of waiting)
SELECT pg_try_advisory_lock(12345);
```
Common uses: leader election (only one cron worker runs at a time), distributed mutex (prevent duplicate job processing), co-ordination between background workers. They are lightweight compared to row locks and do not cause MVCC bloat. Session-level locks persist until released or the session ends; transaction-level locks release on COMMIT/ROLLBACK.
A LATERAL join allows a subquery in the FROM clause to reference columns from preceding FROM items (like a correlated subquery, but in FROM position). It is essential when the subquery must produce multiple rows per outer row — something a scalar correlated subquery cannot do.
sql
-- For each user, get their 3 most recent orders
SELECT u.id, u.email, o.id AS order_id, o.created_at
FROM users u
CROSS JOIN LATERAL (
SELECT id, created_at
FROM orders
WHERE user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) o;
LATERAL is also useful with set-returning functions like unnest(), jsonb_array_elements(), and generate_series() applied per row. Without LATERAL, a subquery in FROM cannot reference the outer table — it is evaluated independently.
A LATERAL join allows a subquery in the FROM clause to reference columns from preceding FROM items (like a correlated subquery, but in FROM position). It is essential when the subquery must produce multiple rows per outer row — something a scalar correlated subquery cannot do.
```sql
-- For each user, get their 3 most recent orders
SELECT u.id, u.email, o.id AS order_id, o.created_at
FROM users u
CROSS JOIN LATERAL (
SELECT id, created_at
FROM orders
WHERE user_id = u.id
ORDER BY created_at DESC
LIMIT 3
) o;
```
LATERAL is also useful with set-returning functions like unnest(), jsonb_array_elements(), and generate_series() applied per row. Without LATERAL, a subquery in FROM cannot reference the outer table — it is evaluated independently.
At OLAP scale (billions of rows), standard OLTP techniques are insufficient. Key strategies: (1) Use a columnar store (BigQuery, Redshift, DuckDB, ClickHouse) — they scan only needed columns and compress homogeneous data efficiently. (2) Partition tables by date and enforce partition pruning in every query. (3) Pre-aggregate with materialised views or scheduled summary tables; avoid real-time aggregation of raw events. (4) Cluster/sort tables on the most common filter column (Redshift SORTKEY, BigQuery CLUSTER BY) so the engine can skip row groups. (5) Avoid SELECT * — list only needed columns. (6) Push filters as early as possible (predicate pushdown). (7) Avoid correlated subqueries; use window functions or CTEs instead. (8) Use approximate aggregates (HyperLogLog for COUNT DISTINCT, percentile_approx) when exact answers are unnecessary. (9) Cache results for dashboards (e.g., with Redis or BI-layer caching).
At OLAP scale (billions of rows), standard OLTP techniques are insufficient. Key strategies: (1) Use a columnar store (BigQuery, Redshift, DuckDB, ClickHouse) — they scan only needed columns and compress homogeneous data efficiently. (2) Partition tables by date and enforce partition pruning in every query. (3) Pre-aggregate with materialised views or scheduled summary tables; avoid real-time aggregation of raw events. (4) Cluster/sort tables on the most common filter column (Redshift SORTKEY, BigQuery CLUSTER BY) so the engine can skip row groups. (5) Avoid SELECT * — list only needed columns. (6) Push filters as early as possible (predicate pushdown). (7) Avoid correlated subqueries; use window functions or CTEs instead. (8) Use approximate aggregates (HyperLogLog for COUNT DISTINCT, percentile_approx) when exact answers are unnecessary. (9) Cache results for dashboards (e.g., with Redis or BI-layer caching).
Under MVCC, every transaction in PostgreSQL operates on a consistent snapshot taken at transaction start (REPEATABLE READ) or statement start (READ COMMITTED). When a row is updated, the old version is preserved with its xmax set; the new version gets a new xmin. Each transaction's snapshot determines which versions are visible: a tuple is visible if xmin committed before the snapshot and xmax either has not committed or committed after the snapshot.
This means a long-running analytical query sees a stable view of the data even while other transactions modify rows — no read locks needed. The cost is that dead (old) versions accumulate until VACUUM removes them. Under SERIALIZABLE, PostgreSQL additionally tracks predicate reads via Serializable Snapshot Isolation (SSI) with predicate locks, aborting transactions that would create serialization anomalies. Unlike traditional 2PL, SSI does not hold read locks — it detects and resolves conflicts lazily, offering true serializability with good concurrency.
Under MVCC, every transaction in PostgreSQL operates on a consistent snapshot taken at transaction start (REPEATABLE READ) or statement start (READ COMMITTED). When a row is updated, the old version is preserved with its xmax set; the new version gets a new xmin. Each transaction's snapshot determines which versions are visible: a tuple is visible if xmin committed before the snapshot and xmax either has not committed or committed after the snapshot.
This means a long-running analytical query sees a stable view of the data even while other transactions modify rows — no read locks needed. The cost is that dead (old) versions accumulate until VACUUM removes them. Under SERIALIZABLE, PostgreSQL additionally tracks predicate reads via Serializable Snapshot Isolation (SSI) with predicate locks, aborting transactions that would create serialization anomalies. Unlike traditional 2PL, SSI does not hold read locks — it detects and resolves conflicts lazily, offering true serializability with good concurrency.
Bulk DELETE or UPDATE of millions of rows in a single transaction: holds locks for the entire duration, generates massive WAL, can bloat tables severely, and may cause replication lag to spike. Best practices:
1. Batch in small chunks with a loop:
sql
DO $$
DECLARE deleted INT;
BEGIN
LOOP
DELETE FROM events
WHERE id IN (
SELECT id FROM events WHERE archived = TRUE LIMIT 5000
);
GET DIAGNOSTICS deleted = ROW_COUNT;
EXIT WHEN deleted = 0;
PERFORM pg_sleep(0.05); -- throttle WAL / replication
END LOOP;
END $$;
2. Use partition drops for time-based data instead of DELETE. 3. Run VACUUM ANALYZE after large deletes. 4. Schedule during low-traffic windows. 5. Monitor replication lag (pg_stat_replication) during the operation.
Bulk DELETE or UPDATE of millions of rows in a single transaction: holds locks for the entire duration, generates massive WAL, can bloat tables severely, and may cause replication lag to spike. Best practices:
1. Batch in small chunks with a loop:
```sql
DO $$
DECLARE deleted INT;
BEGIN
LOOP
DELETE FROM events
WHERE id IN (
SELECT id FROM events WHERE archived = TRUE LIMIT 5000
);
GET DIAGNOSTICS deleted = ROW_COUNT;
EXIT WHEN deleted = 0;
PERFORM pg_sleep(0.05); -- throttle WAL / replication
END LOOP;
END $$;
```
2. Use partition drops for time-based data instead of DELETE. 3. Run VACUUM ANALYZE after large deletes. 4. Schedule during low-traffic windows. 5. Monitor replication lag (pg_stat_replication) during the operation.
INTERSECT returns only rows that appear in both query result sets (the mathematical intersection). EXCEPT (MINUS in Oracle) returns rows from the first query that do not appear in the second. Both remove duplicates by default; append ALL to keep duplicates (INTERSECT ALL, EXCEPT ALL).
sql
-- Customers who have both placed orders AND left reviews
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM reviews;
-- Customers who ordered but have NOT reviewed anything
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM reviews;
Like UNION, both operators require matching column counts and compatible types. EXCEPT is a clean alternative to NOT IN / NOT EXISTS when the semantics are set-based and NULLs could cause NOT IN to misbehave.
INTERSECT returns only rows that appear in both query result sets (the mathematical intersection). EXCEPT (MINUS in Oracle) returns rows from the first query that do not appear in the second. Both remove duplicates by default; append ALL to keep duplicates (INTERSECT ALL, EXCEPT ALL).
```sql
-- Customers who have both placed orders AND left reviews
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM reviews;
-- Customers who ordered but have NOT reviewed anything
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM reviews;
```
Like UNION, both operators require matching column counts and compatible types. EXCEPT is a clean alternative to NOT IN / NOT EXISTS when the semantics are set-based and NULLs could cause NOT IN to misbehave.
A bitmap index scan is a two-phase strategy: phase 1 scans one or more indexes and builds an in-memory bitmap where each bit corresponds to a heap page, marking pages that contain matching rows. Phase 2 reads only the marked heap pages in physical page order (avoiding random I/O). PostgreSQL can AND/OR multiple bitmaps from separate indexes before hitting the heap — enabling multi-index queries.
sql
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE status = 'pending' AND amount > 500;
-- May show:
-- Bitmap Heap Scan on orders
-- -> BitmapAnd
-- -> Bitmap Index Scan on idx_orders_status
-- -> Bitmap Index Scan on idx_orders_amount
Bitmap scans are chosen when a plain Index Scan would require too many random reads (moderately selective queries on large tables). For highly selective queries (few matching rows) a regular Index Scan is used; for very low selectivity a Seq Scan wins. The bitmap is stored in work_mem — if it overflows, PostgreSQL uses "lossy" bitmaps (page-level, not row-level), adding a recheck step.
A bitmap index scan is a two-phase strategy: phase 1 scans one or more indexes and builds an in-memory bitmap where each bit corresponds to a heap page, marking pages that contain matching rows. Phase 2 reads only the marked heap pages in physical page order (avoiding random I/O). PostgreSQL can AND/OR multiple bitmaps from separate indexes before hitting the heap — enabling multi-index queries.
```sql
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE status = 'pending' AND amount > 500;
-- May show:
-- Bitmap Heap Scan on orders
-- -> BitmapAnd
-- -> Bitmap Index Scan on idx_orders_status
-- -> Bitmap Index Scan on idx_orders_amount
```
Bitmap scans are chosen when a plain Index Scan would require too many random reads (moderately selective queries on large tables). For highly selective queries (few matching rows) a regular Index Scan is used; for very low selectivity a Seq Scan wins. The bitmap is stored in work_mem — if it overflows, PostgreSQL uses "lossy" bitmaps (page-level, not row-level), adding a recheck step.
Frequently Asked Questions
Which SQL dialect should I study?
Start with ANSI SQL then learn one of PostgreSQL or MySQL deeply. Interviewers are usually flexible with syntax as long as the logic is correct.
How important are window functions?
Very — most medium/hard interview problems use ROW_NUMBER, RANK, LAG, or running totals. They come up in nearly every data-heavy loop.
What about indexes?
Know how B-trees work, when indexes help and hurt (writes), and how to read an EXPLAIN plan. This separates senior candidates from juniors.
Do I need NoSQL too?
For data/backend roles, basic understanding (document vs key-value vs wide-column) is enough. Deep NoSQL usually has its own loop.
How do you practice SQL?
Sites like StrataScratch, LeetCode SQL, and HackerRank have real interview-style problems. Run queries locally on sample datasets too.