Database Concepts

14.10.2023 — Database, Introduction, PostgreSQL — 26 min read

Introduction

Databases are used to store information. We take information from some source and put it in database where it gets persisted in either memory or hard disk and we eventually retrieve that data at some point in time. Database clients are used to connect to databases. SQL is a programming language used to interact with databases.

Simple database design process questions:

What kind of thing are we storing?
What properties does this thing have?
What type of data does each of those properties contain?

Tables

A table is a collection of records that are related to each other. A table is made up of columns and rows. Columns are also called fields and rows are also called records. A table is also called a relation. In relational databases, and flat file databases, a table is a set of data elements (values) using a model of vertical columns (identifiable by name) and horizontal rows, the cell being the unit where a row and column intersect. A table has a specified number of columns, but can have any number of rows. Columns store some information about specific properties of records. Rows represent a single, implicitly structured data item in a table.

Creating a table and inserting data

-- Create a table called tablename1, with the two columns shown, for
-- the database currently in use. Lots of other options are available
-- for how you specify the columns, such as their datatypes.
CREATE TABLE tablename1 (fname VARCHAR(20), lname VARCHAR(20));

-- Insert a row of data into the table tablename1. This assumes that the
-- table has been defined to accept these values as appropriate for it.
INSERT INTO tablename1 VALUES('Richard','Mutt');
-- another format: INSERT INTO tablename1 (fname, lname) VALUES('Richard','Mutt');

-- inserting multiple rows:
INSERT INTO tablename1 VALUES('Ethan','Hunt'),('Marcel','Duchamp');

SQL statements consist of Keywords, Options, and Identifiers. Keywords are the words that make up the SQL statement, which specify what needs to be done on the database. Options are the words that modify the statement. Identifiers are the names of the things that the statement is acting on. SQL is not just about pulling raw data out of a table, but it can also transform or process data before we retrieve it. For any SQL statement, the database first considers the data source, applies the filter criteria, and then selects and retrieves specified columns from the data source.

Retrieving, updating and deleting data from a table

-- Select all rows and columns from the current database's departments table.
-- Default activity is for the interpreter to scroll the results on your screen.
SELECT * FROM departments;

-- Retrieve all rows from the departments table, but only the dept_no and dept_name columns.
-- Splitting up commands across lines is OK.
SELECT dept_no, dept_name FROM departments;

-- Retrieve all departments columns, but just 5 rows.
SELECT * FROM departments LIMIT 5;

-- Retrieve dept_name column values from the departments
-- table where the dept_name value has the substring 'en'.
SELECT dept_name FROM departments WHERE dept_name LIKE '%en%';

-- Retrieve all columns from the departments table where the dept_name
-- column starts with an 'S' and has exactly 4 characters after it.
SELECT * FROM departments WHERE dept_name LIKE 'S____';

-- Select title values from the titles table but don't show duplicates.
SELECT DISTINCT title FROM titles;

-- Same as above, but sorted (case-sensitive) by the title values.
SELECT DISTINCT title FROM titles ORDER BY title;

-- Show the number of rows in the departments table.
SELECT COUNT(*) FROM departments;

-- Show the number of rows in the departments table that
-- have 'en' as a substring of the dept_name value.
SELECT COUNT(*) FROM departments WHERE dept_name LIKE '%en%';

-- In tablename1, change the fname value to 'John'
-- for all rows that have an lname value of 'Mutt'.
UPDATE tablename1 SET fname='John' WHERE lname='Mutt';

-- Delete rows from the tablename1 table
-- where the lname value begins with 'M'.
DELETE FROM tablename1 WHERE lname LIKE 'M%';

-- Delete all rows from the tablename1 table, leaving the empty table.
DELETE FROM tablename1;

-- Remove the entire tablename1 table.
DROP TABLE tablename1;

Table relationships

Relationship types between tables: one-to-one, one-to-many, many-to-one, many-to-many.

Primary Key - uniquely identifies a record in the table. Each row in every table has one primary key, which never changes.
Foreign Key - identifies a record (usually in another table) that a particular row is associated with. The 'many' side of the relationship usually gets the foreign key column. Many rows in the same table can have the same foreign key, which can change if the relationship changes.

-- Foreign Key example
CREATE TABLE photos (
    id SERIAL PRIMARY KEY,
    url VARCHAR(200),
    user_id INTEGER REFERENCES users(id)
);
-- SERIAL datatype in Postgres automatically generates integer values for specified column, starting from 1.

Data consistency - the state of the data in a database when all foreign key references are valid. If a foreign key references a primary key that doesn't exist, the data is inconsistent.

There are some deletion constraints that can be used in foreign key references:

ON DELETE RESTRICT - the default, which prevents deletion of a row if it is referenced by a foreign key in another table and throws an error
ON DELETE NO ACTION - same as RESTRICT
ON DELETE CASCADE - deletes all rows that reference the deleted row
ON DELETE SET NULL - sets the foreign key value to NULL for all rows that reference the deleted row
ON DELETE SET DEFAULT - sets the foreign key value to the default value for all rows that reference the deleted row

SQL joins

Inner join - returns rows that have matching values in both tables involved in the join. The join condition is specified in the WHERE clause. The INNER keyword is optional. It is the default join type.
Left outer join - returns all rows from the left table, and the matched rows from the right table. Null values are returned for the right table if there is no match.
Right outer join - returns all rows from the right table, and the matched rows from the left table. Null values are returned for the left table if there is no match.
Full outer join - returns all rows from both tables, regardless of whether there is a match or not. If there is no match, NULL values are used.

Joins - produce values by merging together rows from different related tables. Joins are used most of times when we are asked to find data that involves multiple resources.

Aggregation - looks at many rows and calculates a single value. Words like 'sum', 'average', 'maximum', 'minimum', 'count' are used in aggregation.

Table order between "FROM" and "JOIN" frequently makes a difference. We must give context if column names collide during joins. Tables can be renamed during joins using the "AS" keyword.

Grouping and Aggregation

Grouping - divides rows into groups and applies an aggregate function to each group. The GROUP BY clause is used to group rows together. The GROUP BY clause must be after the FROM and WHERE clauses, but before the ORDER BY clause. The GROUP BY clause can group by multiple columns. The GROUP BY clause can also group by column numbers instead of column names.

Aggregation - looks at many rows and calculates a single value. Words like 'sum', 'average', 'maximum', 'minimum', 'count' are used in aggregation. The HAVING clause is used to filter groups. The HAVING clause must be after the GROUP BY clause, but before the ORDER BY clause. The HAVING clause can use aggregate functions. The HAVING clause can use column numbers instead of column names.

Sorting

The ORDER BY clause is used to sort rows. The ORDER BY clause must be after the FROM, WHERE, and GROUP BY clauses, but before the LIMIT clause. The ORDER BY clause can sort by multiple columns, where column specification is considered as prioritization order. The ORDER BY clause can sort by column numbers and expressions as well. LIMIT clause is used to limit the number of rows returned. OFFSET clause is used to skip a specified number of rows before returning the rest.

Union and Intersection

The UNION operator is used to combine the result-set of two or more SELECT statements. Each SELECT statement within UNION must have the same number of columns. The columns must also have similar data types. The columns in each SELECT statement must also be in the same order. The UNION operator selects only distinct values by default. To allow duplicate values, use the UNION ALL operator. The UNION operator selects only distinct values by default. To allow duplicate values, use the UNION ALL operator.

The INTERSECT operator is used to combine the result-set of two or more SELECT statements, but returns rows only from the first SELECT statement that are identical to a row in the second SELECT statement. The INTERSECT operator selects only distinct values by default. To allow duplicate values, use the INTERSECT ALL operator. Similarly, the EXCEPT operator is used to combine the result-set of two or more SELECT statements, but returns rows only from the first SELECT statement that are not identical to a row in the second SELECT statement. The EXCEPT operator selects only distinct values by default. To allow duplicate values, use the EXCEPT ALL operator.

Data storage in Postgres

SHOW data_directory - SHOW keyword is used to retrieve individual configuration options; in this case it retrieves the location path to the directory where Postgres stores its data. All data for different databases get placed in base directory. SELECT oid, datname FROM pg_database; provides a list of all databases and their unique identifiers, where oid stands for object identifier. Different folders having oid names are present within base directory. SELECT * from pg_class; provides a list of all objects and their unique identifiers for a database; these unique identifiers are used as file names within these folders. SELECT * FROM pg_stats WHERE tablename = 'users'; - provides different statistics about a table, including the number of rows and the size of the table.

File terminologies:

Heap or Heap File - file that contains all the data (rows) of a table (not related to heap data structure)
Tuple or Item - Individual row from the table
Block or Page - the heap file is divided into many different "blocks" or "pages". Each page/block stores some number of rows. By default they are 8KB in size. The number of rows that can be stored in a page depends on the size of the rows. The more rows that can fit in a page, the more efficient the database is.

Full Table Scan - scenario when the database engine has to read every row in a table to find the rows that match the query. This is very slow, especially for large tables. In this case Postgres has to load all the rows stored in heap file into memory, and then execute some kind of iteration over these rows to find some matching number of records.

Indexes

Index - a data structure that makes it faster to find rows in a table. It efficiently tells us exactly what block and index a particular record is stored at. Indexes are created on columns. Indexes are stored in a data structure that is optimized for searching, so searching is faster than scanning the whole table. Indexes are not free. They take up space on disk, can be large as they store data from at least one column of the real table, and they slow down writes as they need to be updated for every insert/update/delete operation. Indexes are automatically created for primary keys and unique constraints, which can be viewed using the following query:

SELECT relname, relkind
FROM pg_class -- this table lists all different objects present in database
WHERE relkind = 'i';

Example to create an index on a column: CREATE INDEX idx_person_name ON person (name);. If we don't provide a name, it will be assigned automatically in this format: <table-name>_<column-name>_idx. SELECT pg_size_pretty(pg_relation_size('idx_person_name')); - provides the size of the index.

Keywords used for benchmarking and evaluating queries, but not for real data fetching:

EXPLAIN - keyword which builds a query plan and displays info about it without actually executing the query.
EXPLAIN ANALYZE - keyword which provides information about the query plan (strategy that the database engine uses to execute the query), query execution time and query results. It builds a query plan, runs it and then provides information about it.

Within the results of these keywords, the rows having -> are called query nodes, which indicate some step where we are trying to access data stored in database, or trying to do some processing. The top row is a query node as well. Order of evaluation: inner most node to the outer most top node. Within each row, the first phrase (example: Hash Join) represents how the node is generating data, second phrase (cost) - amount of processing power required for that step, third phrase (rows) - estimate of how many rows this step will produce, and last phrase (width) - estimate of average number of bytes of each row.

Index Types:

B-Tree Index - default index type in Postgres. It is a balanced tree data structure. It is sorted by the column that we are indexing. It is very efficient for equality and range queries.
Hash Index - it is a hash table data structure. It is very efficient for equality queries, but not for range queries.
GiST Index - it is a generalized search tree data structure used for geometry, full-text search.
SP-GiST Index - space partitioned GiST index is used for clustered data, such as dates - where many rows might have the same value.
GIN Index - it is a generalized inverted index. Used for columns that contain arrays or JSON data.
BRIN Index - it is a block range index. It is very efficient for range queries and specialized for really large datasets.

How indexes are created

First, we identify the column that we want to create an index on, for which we want to have a fast lookup. Then we extract only the property that we want to do fast lookups by and the block / index for that property. These values are sorted in a meaningful order (ascending / descending) and organize them into a tree data structure. They are evenly distributed in the leaf nodes of the tree, in a left to right fashion. Helpers are added to root nodes of tree to make it easier to traverse the tree. The tree is then stored on disk.

When an SQL query is given to Postgres, parser interprets the meaning of every character in the query first, and builds a query tree, which is essentially a programmatic description of the query to run. Then it is passed to rewriter, which decomposes views into underlying table references, expands wildcards, and so on. Then it is passed to planner, which generates a query plan, which is a strategy that the database engine uses to execute the query in the most efficient way. The query plan is then passed to executor, which executes the query plan and returns the results.

Cost = amount of processing power required for a particular step in the query plan

Cost = ((number of pages read sequentially) * seq_page_cost)
    + ((number of pages read randomly) * random_page_cost)
    + ((number of rows scanned) * cpu_tuple_cost)
    + ((number of index entries scanned) * cpu_index_tuple_cost)
    + ((number of times function / operator evaluated) * cpu_operator_cost)

Common Table Expressions (CTE)

CTE is a temporary result set that is defined within the execution scope of a single SQL statement. It is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. Unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query. CTEs are defined using WITH keyword before the main query.

There are two forms of CTEs: non-recursive and recursive. Non-recursive CTEs are used to define temporary result sets that can be referenced multiple times in a query. Recursive CTEs are used to define temporary result sets that reference themselves. Recursive CTEs are used to solve problems that involve hierarchical data (tree or graph-type data structure), such as organizational charts, bill of materials, and parts explosion, and they must use a "union" keyword unlike simple / non-recursive CTEs.

Example for recursive CTE:

WITH RECURSIVE countdown(val) AS (
    SELECT 3 as val -- initial or non-recursive query
    UNION
    SELECT val - 1 FROM countdown WHERE val > 1 -- recursive query
)
SELECT * from countdown;

-- Output: 3, 2, 1

Working of recursive CTEs:

Postgres defines the results table for the non-recursive query, and working table for the recursive query
It runs the initial non-recursive statement, put the results into the results table and working table
It then runs the recursive statement replacing the table name with a reference to the working table
If recursive statement returns some rows, it appends them to the results table and runs recursion again. If it doesn't return any rows, it stops recursion.

Views

A View is a virtual table that consists of a subset of data contained in a table. They are built on top of tables or other views. Views are used for security purposes, to hide the complexity of queries, and to provide a layer of abstraction between the user and the database. Views are created using CREATE VIEW statement. Views can be created from a single table, multiple tables, or other views and can be referenced in a place where we would normally reference a table. Views can be created with or without check option. Check option is used to prevent the user from updating the view in a way that would produce rows that are not included in the view.

Materialized Views - they are similar to views, but they store the results of the query in a table-like structure. They are used to improve the performance of queries that involve expensive operations, such as aggregations, joins, etc. They are updated periodically, either manually or automatically. They are created using CREATE MATERIALIZED VIEW statement. These contain queries that get executed only at very specific times, but the results are saved and can be referenced without running the query. WITH DATA keyword is used to populate the materialized view with data. WITH NO DATA keyword is used to create an empty materialized view. REFRESH MATERIALIZED VIEW keyword is used to refresh the materialized view.

Transactions

A transaction is a logical unit of work that contains one or more SQL statements. A transaction is an atomic unit. The effects of all the SQL statements in a transaction can be either all committed (applied to the database) or all rolled back (undone from the database). Transactions are created using BEGIN keyword. They are committed using COMMIT keyword. They are rolled back using ROLLBACK keyword. Transactions are used to ensure data integrity and consistency. They are used to ensure that the database remains in a consistent state, even in the event of power failure, errors, etc. Transactions are used to ensure that all the statements in a transaction are executed successfully, or none of them are executed at all. If any statement in a transaction fails, the entire transaction will be present in an "aborted" state, from which we must rollback. Losing the connection to database will automatically rollback the transaction.

Schema Migration Files

Schema migration files are used to make changes to the database schema. They are used to create, modify, or delete tables, indexes, views, etc. They are used to make changes to the database schema in a consistent and repeatable way that is safe and reversible, which can be used to keep track of changes over time. A schema migration file can be written in any programming language, and it contains 2 different sections: up and down. up section is used to make changes to the database schema, and down section is used to undo the changes made in the up section. Schema migration files are executed in the order in which they are created. They are executed in a transaction, which means that if any of the statements in the schema migration file fails, the entire schema migration file will be rolled back.

Multi-Version Concurrency Control (MVCC)

Multi-Version Concurrency Control (MVCC) is a technique used by PostgreSQL to allow multiple transactions to access the same data concurrently without conflicts or delays. It ensures that each transaction has a consistent snapshot of the database and can operate on its own version of the data.

How MVCC Works

When a transaction starts, it gets a unique transaction ID (TXID). This ID is later used to keep track of changes made by the transaction.
When a transaction reads data, it only sees the data that was committed before the transaction started, as well as any changes it made itself. This ensures that every transaction has a consistent view of the database.
Whenever a transaction modifies data (INSERT, UPDATE, or DELETE), PostgreSQL creates a new version of the affected rows and assigns the new version the same TXID as the transaction. These new versions are called "tuples".
Other transactions running at the same time will only see the old versions of the modified rows since their snapshots are still based on the earlier state of the data.
When a transaction is committed, PostgreSQL checks for conflicts (such as two transactions trying to modify the same row). If there are no conflicts, the changes are permanently applied to the database, and other transactions can now see the updated data.

Write Ahead Log (WAL)

The Write Ahead Log is a technique where any modification to the data is first recorded in the log before being written into the main data storage. The primary purpose of the WAL is to guarantee that the database state is recoverable to a consistent state even in the event of a crash or hardware failure.

How WAL Works

Write operation: When a change is made to the data, PostgreSQL writes the changes to the WAL buffer instead of immediately modifying the disk pages.
Flush operation: Once the transaction is committed, the WAL buffer contents are flushed to the on-disk WAL file.
Checkpoint: The background writer process writes the 'dirty' pages from the shared buffer to the main data files at specific intervals called 'checkpoints.' It ensures that the actual data files are updated to match the state recorded in the WAL logs.

Stages of Query Processing

Query processing in PostgreSQL involves several stages, from parsing SQL queries to producing the final result set:

Parsing: This is the first stage in query processing, where the SQL query is broken down into smaller components and checked for any syntactical errors. The parser creates a parse tree, a data structure representing the different elements of the query.
Rewriting: At this stage, the parse tree might be modified to apply any necessary optimization or transformation. Examples include removing redundant conditions, simplifying expressions, expanding views, and applying security-related checks.
Optimization: This stage involves selecting the best execution plan from multiple alternatives. The query optimizer evaluates various strategies based on factors like the availability of indexes, the size of the tables, and the complexity of the conditions in the query. The cost of each plan is estimated, and the one with the lowest cost is chosen as the final plan.
Plan Execution: The selected execution plan is converted into a series of low-level operations, which are then executed by the executor. The executor retrieves or modifies the data as specified by the plan, executing the required joins, filtering, aggregations, and sorting steps.
Returning Results: After the successful execution of the plan, the final result set is sent back to the client application. This result set might be in the form of rows of data, a single value, or a confirmation message of completed operations.

Key Components in Query Processing

There are several key components of PostgreSQL's query processing engine:

Parser: The component responsible for breaking down SQL queries and creating parse trees.
Optimizer: The part of the system that evaluates and chooses the optimal execution plan for a given query.
Executor: The component that runs the selected execution plan, performing the required operations to retrieve or modify the data.
Statistics Collector: This component gathers essential information about the status of the database, including table sizes, distribution of the data, and access frequency. This information is used by the optimizer to make better decisions when choosing execution plans.

Row Level Security (RLS)

Row Level Security (RLS) is a feature introduced in PostgreSQL 9.5 that allows us to control access to rows in a table based on a user or role's permissions. This level of granularity in data access provides an extra layer of security for protecting sensitive information from unauthorized access.

To enable RLS, we need to set up policies for our table. A policy is a set of rules that define how users can read or modify table rows. First, enable RLS on the table using the ALTER TABLE command with the FORCE ROW LEVEL SECURITY option:

ALTER TABLE my_table FORCE ROW LEVEL SECURITY;

To create a policy, use the CREATE POLICY command with a USING clause that specifies the conditions for allowing access to a row. Here's an example of a policy that allows users to read rows only if the user's id is equal to the user_id column in the table:

CREATE POLICY my_policy ON my_table
FOR SELECT
USING (current_user_id() = user_id);

We can also create policies for modifying rows by specifying the FOR action as INSERT, UPDATE, or DELETE.

Data Partitioning and Sharding

Data partitioning is a technique that divides a large table into smaller, more manageable pieces called partitions. Each partition is a smaller table that stores a subset of the data, usually based on specific criteria such as ranges, lists, or hashes. PostgreSQL supports different partitioning methods, such as:

Range Partitioning: The data in a range-partitioned table is separated into partitions based on a specified range of values for a given column. For example, orders could be partitioned by date range, with each partition containing orders within a specific date interval.
List Partitioning: The data in a list-partitioned table is separated into partitions based on specified discrete sets of values for a given column. For example, customers could be partitioned by their country, with each partition storing customers from a specific country.
Hash Partitioning: The data in a hash-partitioned table is divided into partitions using a hash function applied to one or more columns. This method distributes data uniformly across all partitions, which helps in load balancing and parallel query processing. For example, products could be hash partitioned based on the product ID.

Sharding is a technique that splits a large dataset across multiple database instances or servers, called shards. Each shard is an independent and self-contained unit that holds a portion of the overall data, and shards can be distributed across different geographical locations or infrastructures. A horizontal fragment or shard of a relation is a subset of the tuples in that relation. The tuples that belong to the horizontal fragment can be specified by a condition on one or more attributes of the relation, or by some other mechanism. Often, only a single attribute in involved in the condition.

In PostgreSQL environment, sharding can be achieved in different ways:

Sharding at the application level: The application defines the logic to decide which shard will store a specific data record. The application communicates directly with each shard for querying or modifying the data.
Sharding using foreign data wrappers: PostgreSQL provides a feature called foreign data wrappers (FDW) that allows a PostgreSQL server to access data stored in remote servers, treating them as local tables. By using this technique, the data can be sharded across multiple remote servers, and the local PostgreSQL instance acts as a coordinator for accessing these shards.
Sharding using 3rd-party tools: Several 3rd-party tools, such as Pgpool-II, Citus, and PLProxy, can be used for sharding purpose. These tools handle connection pooling, load balancing, and data distribution across multiple PostgreSQL instances. The choice of tools depends on the requirements, complexity, and the desired level of control over the sharding logic.

CAP Theorem

Consistency means that the nodes will have the same copies of a replicated data item visible for various transactions. Availability means that each read or write request for a data item will either by processed successfully or will receive a message that the operation cannot be completed. Partition Tolerance means that the system can continue operating if the network connecting the nodes has a fault that results in two or more partitions, where the nodes in each partition can only communicate among each other. The CAP Theorem states that it is not possible to guarantee all three of the desirable properties - consistency, availability, and partition tolerance - at the same time in a distributed system with data replication. If this is the case, then the distributed system designer would have to choose two properties out of the three to guarantee.

Algorithms for SELECT Operation

A number of search algorithms are possible for selecting records from a file, which are known as file scans, because they scan the records of a file to search for and retrieve records that satisfy a selection condition. If the search algorithms involves the use of an index, the index search is called an index scan. Different search algorithms:

S1 - Linear search (brute force algorithm)
S2 - Binary search (requires sorted data)
S3a - Using a primary index, S3b - Using a hash key
S4 - Using a primary index to retrieve multiple records
S5 - Using a clustering index to retrieve multiple records
S6 - Using a secondary (B+ tree) index on an equality comparison
S7a - Using a bitmap index, S7b - Using a functional index
S8 - Conjunctive selection using an individual index
S9 - Conjunctive selection using a composite index
S10 - Conjunctive selection by intersection of record pointers