DDIA Summary: Chapter 3 — Storage and Retrieval

Designing Data-Intensive Applications

Sunny, Lee

Published in

Sunny’s Life in CMU 🔅

15 min readJan 23, 2024

Agenda

2. Data Structures That Power Your Database

Hash Indexes
SSTables and LSM-Trees
B-Trees
Comparing B-Trees and LSM-Trees
Other Indexing Structures

3. Transaction Processing or Analytics?

Data Warehousing
Stars and Snowflakes: Schemas for Analytics

4. Column-Oriented Storage

Column Compression
Sort Order in Column Storage
Writing to Column-Oriented Storage
Aggregation: Data Cubes and Materialized Views

1. Introduction

In this chapter, we will discuss database’s point of view: how we can store the data that we’re given, and how we can find it again when we’re asked for it.

We will examine two families of storage engines: log-structured storage engines, and page-oriented storage engines such as B-trees.

2. Data Structures That Power Your Database

Consider the world’s simplest database, implemented as two Bash functions:

#!/bin/bash
db_set () {
    echo "$1,$2" >> database
}
db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

You can call db_set key value, which will store key and value in the database. You can then call db_get key, which looks up the most recent value associated with that particular key and returns it.

$ db_set 123456 '{"name":"London","attractions":["Big Ben","London Eye"]}' 
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42 
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

The underlying storage format is very simple: a text file where each line contains a key-value pair, separated by a comma. Every call to db_set appends to the end of the file, so if you update a key several times, the old versions of the value are not overwritten — you need to look at the last occurrence of a key in a file to find the latest value.

Our db_set function actually has pretty good performance for something that is so simple, because appending to a file is generally very efficient. Similarly to what db_set does, many databases internally use a log, which is an append-only data file. Real databases have more issues to deal with.

In order to efficiently find the value for a particular key in the database, we need a different data structure: an index.

An index is an additional structure that is derived from the primary data.

a. Hash Indexes

Let’s start with indexes for key-value data.

Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file. For example:

Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote. When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value.

As described so far, we only ever append to a file — so how do we avoid eventually running out of disk space?

A good solution is to break the log into segments of a cer‐ tain size by closing a segment file when it reaches a certain size, and making subse‐ quent writes to a new segment file.

We can then perform compaction on these segments. For example:

Compaction means throwing away duplicate keys in the log, and keeping only the most recent update for each key.

Moreover, since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction. For example:

The merging and compaction of frozen seg‐ ments can be done in a background thread, and while it is going on, we can still con‐ tinue to serve read and write requests as normal, using the old segment files.

An append-only design turns out to be good for several reasons:

Appending and segment merging are sequential write operations, which are generally much faster than random writes
Concurrency and crash recovery are much simpler if segment files are append- only or immutable
Merging old segments avoids the problem of data files getting fragmented over time.

However, the hash table index also has limitations:

The hash table must fit in memory.
Range queries are not efficient. For example, you cannot easily scan over all keys between kitty00000 and kitty99999 — you’d have to look up each key individually in the hash maps.

In the next section we will look at an indexing structure that doesn’t have those limitations.

b. SSTables and LSM-Trees

Each log-structured storage segment is a sequence of key-value pairs. These pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file does not matter.

Now we can make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key. We call this format Sorted String Table, or SSTable for short.

SSTables have several big advantages over log segments with hash indexes:

1. Merging segments is simple and efficient, even if the files are bigger than the available memory. For example:

You start reading the input files side by side, look at the first key in each file, copy the lowest key to the output file, and repeat. This retains only the most recent value for each key.

2. In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory. For an example: say you’re looking for the key handiwork, but you don’t know the exact offset of that key in the segment file.

However, you do know the offsets for the keys handbag and handsome, and because of the sorting you know that handiwork must appear between those two. This means you can jump to the offset for handbag and scan from there until you find handiwork.

3. Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk.

Constructing and maintaining SSTables

How do you get your data to be sorted by key in the first place? Our incoming writes can occur in any order. Maintaining a sorted structure on disk is possible, but maintaining it in memory is much easier, such as using data structures like red-black trees or AVL trees.

We can now make our storage engine work as follows:

When a write comes in, add it to an in-memory balanced tree data structure. This in-memory tree is sometimes called a memtable.
When the memtable gets bigger than some threshold, write it out to disk as an SSTable file.
In order to serve a read request, first try to find the key in the memtable → then in the most recent on-disk segment → then in the next-older segment, etc.
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

Making an LSM-tree out of SSTables

The algorithm is the foundation for LevelDB and RocksDB, key-value storage engines in applications like Riak, Cassandra, and HBase. Originally known as LSM-Tree, it’s based on merging and compacting sorted files. Lucene, used in Elasticsearch and Solr, employs a similar approach for its term dictionary in full-text search, merging sorted files in the background.

Performance optimizations

LSM-trees are a method for efficient storage systems. Key lookup can be slow, but Bloom filters speed it up by checking if a key isn’t in the database without unnecessary disk reads. LSM-trees organize and merge data using methods like size-tiered and leveled compaction. Despite complexities, LSM-trees are simple and effective. They maintain data chunks merged in the background, working well with large datasets. Sorted data order enables efficient range queries, and sequential disk writes allow LSM-trees to handle high write volumes effectively.

c. B-Trees

B-Trees are a widely used indexing structure in databases. They keep key-value pairs sorted, enabling efficient lookups and range queries. Unlike log-structured indexes, B-Trees break the database into fixed-size blocks, usually 4 KB, aligning with the underlying hardware. This design allows reading or writing one page at a time.

Each page can be identified using an address or location, which allows one page to refer to another — similar to a pointer, but on disk instead of in memory. For example:

The branching factor, representing the number of child page references in one page, is crucial. Updating a key involves finding the leaf page, modifying the value, and writing it back. Adding a new key may lead to page splitting if there’s insufficient space, maintaining a balanced tree with O(log n) depth. This ensures efficient key retrieval, even in large databases.

Making B-trees reliable

In B-trees, the main write operation is overwriting a page with new data on disk. Unlike log-structured indexes, B-trees modify files directly instead of just appending. Overwriting involves complex hardware operations, especially on SSDs. Operations like page splitting may require overwriting multiple pages and their parent, posing a risk of corruption if not completed.

To handle crashes, B-tree implementations often use a write-ahead log (WAL), an append-only file where modifications are logged before applying to the tree. After a crash, the log helps restore the B-tree to a consistent state. Updating pages in place also requires careful concurrency control using latches to avoid inconsistencies when multiple threads access the B-tree simultaneously. In contrast, log-structured approaches simplify this by merging data in the background without disrupting ongoing queries.

B-tree optimizations

Over the years, various optimizations have enhanced B-trees:

Copy-on-Write Scheme: Instead of overwriting pages directly, some databases (e.g., LMDB) use a copy-on-write approach. Modified pages are written to new locations, and new versions of parent pages are created, pointing to these locations. This method aids crash recovery and concurrency control.
Key Abbreviation: To save space, especially in interior tree pages, keys can be abbreviated, providing enough information for key range boundaries. This allows for a higher branching factor and fewer tree levels.
Disk Layout Optimization: While pages can be anywhere on disk, some B-tree implementations organize leaf pages sequentially to improve query efficiency. This sequential order is challenging to maintain as the tree grows, unlike LSM-trees that rewrite large segments and can keep sequential keys close on disk.
Additional Pointers: Leaf pages may have references to neighboring pages, facilitating ordered key scanning without returning to parent pages.
B-Tree Variants: Innovations like fractal trees incorporate log-structured ideas to reduce disk seeks, despite having no connection to actual fractals.

d. Comparing B-Trees and LSM-Trees

B-tree implementations are generally more mature, while LSM-trees offer interesting performance characteristics. LSM-trees are often faster for writes, and B-trees are considered faster for reads.

Advantages of LSM-trees

LSM-trees offer advantages over B-trees, particularly in write-heavy scenarios. B-trees write data at least twice, causing write amplification and performance concerns, especially on SSDs.

Downsides of LSM-trees

LSM-trees have downsides, especially in high-write scenarios. Compaction processes, while performed incrementally, may impact ongoing reads and writes, causing occasional high response times at higher percentiles.

e. Other Indexing Structures

1. Introduction to Secondary Indexes:

Secondary indexes complement primary key indexes.
Common in relational databases, they enable efficient joins and queries on non-primary key columns.

2. Construction from Key-Value Indexes:

Secondary indexes can be derived from key-value indexes.
Non-unique keys are allowed, and values can be references or actual data.

3. Clustered Indexes and Covering Indexes:

Clustered Index: Stores data directly within the index; MySQL’s InnoDB uses a clustered index for the primary key.
Covering Index: Stores some columns in the index, allowing certain queries to be answered using the index alone.

4. Multi-Column Indexing:

Concatenated indexes combine multiple fields into one key.
Useful for querying multiple columns simultaneously.

5. Multi-Dimensional Indexing:

Addressing geospatial data challenges.
R-trees and space-filling curves provide efficient solutions.
Applicable beyond geography, e.g., color ranges in an e-commerce database or weather observations.

6. Full-Text Search and Fuzzy Indexing:

Full-text search engines expand queries to include synonyms, ignore grammatical variations, and support proximity searches.
Fuzzy search deals with misspelled words; Lucene handles edit distances.
Techniques include space-filling curves, Levenshtein automata, and machine learning for document classification.

7. In-Memory Databases:

Leverage decreasing RAM costs for performance gains.
Some databases, like VoltDB and Redis, offer in-memory options.
Different from caching solutions; durability achieved through snapshots, replication, or specialized hardware.
Performance advantages come from avoiding encoding overhead for disk storage.

8. Future Trends:

Exploration of in-memory database architectures for datasets larger than available memory.
Anti-caching approach for efficient memory management.
Considerations for non-volatile memory (NVM) technologies in future storage engine designs.

3. Transaction Processing or Analytics?

In the early days of business data processing, a “transaction” referred to a commercial activity like a sale or order. The term persisted even as databases expanded beyond monetary transactions.

Transaction Processing (OLTP):

Focus on low-latency reads and writes for interactive applications.
ACID properties not mandatory.
Common in areas involving business transactions.
Access pattern involves looking up a small number of records by key, typical of business applications.

Analytics (OLAP):

Involves scanning a large number of records and calculating aggregate statistics.
Queries written by business analysts for decision support.
Differentiated from OLTP as it deals with extensive data analysis.
Access pattern includes aggregating data for reports, e.g., total revenue, sales trends.

a. Data Warehousing

An enterprise may operate numerous transaction processing systems, each serving specific functions such as managing the website, point-of-sale systems, inventory tracking, route planning, supplier management, and employee administration. These systems, vital for business operations, operate autonomously, and their complexity requires dedicated maintenance teams.

OLTP Systems:

Multiple systems operating independently.
Highly available, low-latency transaction processing.
Closely guarded by database administrators.
Reluctance to run ad hoc analytic queries due to potential performance impact on transactions.

Data Warehouses:

Separate database allowing analysts to query without affecting OLTP.
Contains a read-only copy of data from various OLTP systems.
Data extraction, transformation, and loading (ETL) process involved.

Allows optimized analytics without disrupting OLTP operations.

Advantages of Data Warehouses:

Optimized for analytic access patterns.
Facilitates querying diverse OLTP systems.
Enables exploration of data without affecting transaction processing.

The separation between OLTP systems and data warehouses facilitates optimized analytics and efficient exploration of extensive datasets, addressing the unique needs of business intelligence in large enterprises.

b. Stars and Snowflakes: Schemas for Analytics

In analytics, where diverse data models are less common, data warehouses often adopt formulaic structures known as star schemas or dimensional modeling. This schema provides a structured approach to organize data for efficient analysis.

Star Schema:

Centralized around a fact table, representing events (e.g., customer purchases).
Each row in the fact table corresponds to an individual event.
Fact table includes attributes and foreign key references to dimension tables.
Dimensions represent various aspects of the event, such as who, what, where, when, how, and why.
Visual representation resembles a star, with the fact table at the center surrounded by dimension tables.

Example of a star schema for use in a data warehouse.

4. Column-Oriented Storage

Challenges:

Managing trillions of rows and petabytes in fact tables presents storage and query challenges.
Fact tables, with over 100 columns, require efficient storage for analytics.

Query Optimization:

Data warehouse queries access a subset of columns, making “SELECT *” queries rare.
Below example query focuses on specific columns (date_key, product_sk, quantity).


SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold 
FROM fact_sales
  JOIN dim_date ON fact_sales.date_key = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk 
WHERE
  dim_date.year = 2013 AND
  dim_product.category IN ('Fresh fruit', 'Candy') 
GROUP BY
  dim_date.weekday, dim_product.category;

Column-Oriented Storage:

Differs from row-oriented storage by storing values from each column in separate files.
Allows queries to read and parse only relevant columns, reducing processing time.
Below illustration in relational data model and applies to non-relational data (e.g., Parquet).

Efficient Query Execution:

Indexes guide row-oriented storage, requiring processing of entire rows.
Column-oriented storage reads only necessary columns, enhancing query performance.

Advantages:

Optimized for analytics, prioritizing relevant columns.
Reduces unnecessary data loading, improving query efficiency.
Enables reassembly of entire rows when needed.

Column-oriented storage addresses challenges in processing large fact tables, optimizing storage for analytics by prioritizing relevant columns and improving query performance.

a. Column Compression

The text discusses column compression techniques in the context of column-oriented storage, particularly in data warehouses. Column-oriented storage is effective for compression due to the repetitive nature of value sequences in columns. One technique mentioned is bitmap encoding, where distinct values in a column are represented by separate bitmaps with 1s and 0s indicating presence or absence in rows.

b. Sort Order in Column Storage

In a column store, row order is typically based on insertion, but for indexing purposes, an imposed order can optimize specific queries. Sorting each column independently is impractical, as it disrupts the relationship between items in the same row. Instead, rows need to be sorted entirely, and administrators strategically choose sorting columns based on common query patterns. Sorting by, for example, date_key as the first sort key can accelerate queries targeting specific date ranges.

Sorted order not only aids query optimization but also facilitates compression, especially for the primary sort column with fewer distinct values. Techniques like run-length encoding efficiently compress long sequences of repeated values in this column. The compression effect is strongest on the first sort key, making it space-efficient. However, subsequent sort keys may not compress as well.

An innovative approach in C-Store and Vertica involves storing the same data sorted in multiple ways to accommodate different query patterns. While this concept is akin to multiple secondary indexes in a row-oriented store, column stores lack pointers to data elsewhere, emphasizing columns containing values over pointers.

c. Writing to Column-Oriented Storage

In data warehouses, optimizations like column-oriented storage, compression, and sorting boost the speed of large read-only queries, but they complicate write operations. Unlike B-trees’ update-in-place approach, compressed columns make inserting a row in a sorted table challenging, often requiring rewriting all column files. LSM-trees offer a solution: writes first go to an in-memory store, where they’re sorted and readied for disk storage. Whether row-oriented or column-oriented, this in-memory store acts as a buffer. When enough writes accumulate, they merge with existing column files on disk and are written in bulk, a process effectively utilized by Vertica.

Queries in these systems must consider both on-disk column data and recent in-memory writes, combining them for accurate results. Despite this complexity, the query optimizer shields users, allowing analysts to perceive immediate reflection of data modifications (inserts, updates, deletes) in subsequent queries.

d. Aggregation: Data Cubes and Materialized Views

In the realm of data warehouses, not all are exclusively column stores; traditional row-oriented databases and other architectures are still in use. However, the efficiency of ad hoc analytical queries has led to the growing popularity of columnar storage. Another significant aspect is materialized aggregates, addressing the inefficiency of recalculating aggregate functions like COUNT, SUM, AVG, MIN, or MAX for multiple queries. Materialized views act as caches, providing actual copies of frequently used query results stored on disk.

Materialized views, unlike virtual views, are denormalized copies of query results and require updates when underlying data changes. While these updates increase write costs, they prove valuable in read-heavy data warehouses. One common form of materialized view is a data cube or OLAP cube, representing aggregates arranged in a grid by different dimensions. For instance, a two-dimensional data cube might group aggregates by date and product, allowing rapid querying along these dimensions.

In more complex scenarios, facts in data cubes may have multiple dimensions. Each cell in a hypercube represents the sales for a specific combination of dimensions. While materialized data cubes enable swift precomputed queries, they lack the flexibility of querying raw data. Maintaining as much raw data as possible is a common practice in data warehouses, with materialized aggregates like data cubes employed selectively to enhance the performance of specific queries.

5. Conclusion

In this chapter of “Designing Data-Intensive Applications,” we explored fundamental aspects of storage and retrieval in databases. We covered storage engine families, including log-structured and page-oriented (B-trees), delving into hash indexes, SSTables, LSM-Trees, and B-trees. The discussion extended to writing in column-oriented storage, materialized aggregates like data cubes, and a comparison between B-trees and LSM-Trees.

We also examined various indexing structures, transaction processing (OLTP) versus analytics (OLAP), and the role of data warehousing. The chapter concluded with insights into column-oriented storage challenges, compression techniques, and the impact of sort order on query optimization.

Overall, the chapter provided a comprehensive overview of storage and retrieval mechanisms, setting the stage for deeper exploration in subsequent chapters.

If you found this chapter helpful, give it a round of applause! 👏

DDIA Summary: Chapter 3 — Storage and Retrieval

Designing Data-Intensive Applications

Agenda

1. Introduction

2. Data Structures That Power Your Database

3. Transaction Processing or Analytics?

4. Column-Oriented Storage

5. Conclusion

1. Introduction

2. Data Structures That Power Your Database

a. Hash Indexes

b. SSTables and LSM-Trees

c. B-Trees

d. Comparing B-Trees and LSM-Trees

e. Other Indexing Structures

3. Transaction Processing or Analytics?

a. Data Warehousing

b. Stars and Snowflakes: Schemas for Analytics

4. Column-Oriented Storage

a. Column Compression

b. Sort Order in Column Storage

c. Writing to Column-Oriented Storage

d. Aggregation: Data Cubes and Materialized Views

5. Conclusion

Written by Sunny, Lee