DDIA Summary: Chapter 2 — Data Models and Query Languages

Designing Data-Intensive Applications

Sunny, Lee

Published in

Sunny’s Life in CMU 🔅

11 min readJan 11, 2024

Agenda

2. Relational Model Versus Document Model

The Birth of NoSQL
The Object-Relational Mismatch
Many-to-One and Many-to-Many Relationships
Relational Versus Document Databases Today

4. Graph-Like Data Models

Property Graphs
The Cypher Query Language
Graph Queries in SQL
Triple-Stores and SPARQL
The Foundation: Datalog

1. Introduction

Data models play a crucial role in software development, influencing both code and problem-solving approaches. The layered representation from real-world applications to hardware highlights the importance of choosing the right model. The diversity in data models, with varying assumptions and performance, is acknowledged. The upcoming chapter will explore general-purpose models like relational, document, and graph-based, along with query languages and storage engine implementation details.

2. Relational Model Versus Document Model

The relational model, proposed by Edgar Codd in 1970, serves as the foundation for organizing SQL data into tables. Initially met with skepticism, it gained popularity by the mid-1980s for its ability to hide complex details. Relational databases, originating from business data processing, particularly in transactions processing (sales, banking) and batch processing (customer invoicing, payroll), have endured competition and remain crucial for various web applications like online publishing and social networking.

a. The Birth of NoSQL

In the 2010s, NoSQL emerged as a challenge to the dominance of the relational model. The term “NoSQL” originated as a catchy hashtag for a 2009 meetup on open source, distributed, nonrelational databases. Despite its name, NoSQL doesn’t refer to a specific technology and has been retroactively interpreted as “Not Only SQL.” Various database systems are associated with the #NoSQL hashtag, driven by factors like the need for greater scalability, preference for open source software, specialized query operations, and frustration with the restrictiveness of relational schemas. Different applications have different requirements, leading to the coexistence of relational databases and non-relational datastores — a concept known as polyglot persistence.

b. The Object-Relational Mismatch

The Object-Relational Mismatch is a challenge in application development, especially in object-oriented programming languages using the SQL data model. Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate ease the translation layer between application objects and relational databases but don’t fully eliminate the mismatch.

For example, representing a LinkedIn profile in a relational schema involves addressing the one-to-many relationships, such as jobs, education, and contact information. Traditional SQL models normalize these into separate tables, while later versions support structured datatypes, XML, and JSON for more flexibility.

For a data structure like a résumé, a JSON representation can be quite appropriate:

{
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "summary": "Co-chair of the Bill & Melinda Gates... Active blogger.", 
  "region_id": "us:91",
  "industry_id": 131,
  "photo_url": "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {
      "job_title": "Co-chair", 
      "organization": "Bill & Melinda Gates Foundation"
    }, 
    {
      "job_title": "Co-founder, Chairman", 
      "organization": "Microsoft"
    }
  ], 
  "education": [
    {"school_name": "Harvard University", "start": 1973, "end": 1975},
    {"school_name": "Lakeside School, Seattle", "start": null, "end": null} 
  ],
  "contact_info": {
    "blog": "http://thegatesnotes.com", 
    "twitter": "http://twitter.com/BillGates"
  } 
}

JSON representation is favored by some developers for self-contained documents like resumes, as it simplifies the impedance mismatch. However, challenges with JSON as a data encoding format, including the lack of a schema, are discussed in Chapter 4. Despite potential issues, the JSON representation offers better locality, with all relevant information in one place, compared to multi-table schemas. The one-to-many relationships in user profiles imply a tree structure, and the JSON representation makes this structure explicit.

The one-to-many relationships from the user profile to the user’s positions, educational history, and contact information imply a tree structure in the data, and the JSON representation makes this tree structure explicit.

c. Many-to-One and Many-to-Many Relationships

Data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem.

For example, consider some changes we could make to the résumé example: Organizations and schools as entities, and Recommendations

Below figure illustrates how these new features require many-to-many relationships.

d. Relational Versus Document Databases Today

The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.

Schema-on-read: In a document database, you would just start writing new documents with the new fields (dynamic (runtime) type) and have code in the application that handles the case when old documents are read. For example:

if (user && user.name && !user.first_name) {
  // Documents written before Dec 8, 2013 don't have first_name 
  user.first_name = user.name.split(" ")[0];
}

On the other hand, in a “statically typed” database schema, you would typically per‐ form a migration along the lines of:

ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL 
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL

A hybrid of the relational and document models is a good route for databases to take in the future.

3. Query Languages for Data

SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code.

Imperative:

function getSharks() { 
  var sharks = [];
  for (var i = 0; i < animals.length; i++) { 
    if (animals[i].family === "Sharks") {
      sharks.push(animals[i]);
    }
  }
  return sharks; 
}

Declarative:

SELECT * FROM animals WHERE family = 'Sharks';

An imperative language tells the computer to perform certain operations in a certain order. You can imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.

In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want. It is up to the database system’s query optimizer to decide which indexes and which join methods to use, and in which order to execute various parts of the query.

a. Declarative Queries on the Web

<ul>
  <li class="selected">
    <p>Sharks</p> 
    <ul>
      <li>Great White Shark</li> 
      <li>Tiger Shark</li> 
      <li>Hammerhead Shark</li>
    </ul> 
  </li>
  <li> 
    <p>Whales</p>
    <ul>
      <li>Blue Whale</li> 
      <li>Humpback Whale</li> 
      <li>Fin Whale</li>
    </ul> 
  </li>
</ul>

Now say you want the title of the currently selected page to have a blue background, so that it is visually highlighted. This is easy, using CSS:

li.selected > p { 
  background-color: blue;
}

b. MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google.

MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (collect) and reduce (fold or inject) functions that exist in many functional programming languages.

To give an example, imagine you are a marine biologist, and you add an observation record to your database every time you see animals in the ocean. Now you want to generate a report saying how many sharks you have sighted per month.

In PostgreSQL you might express that query like this:

SELECT date_trunc('month', observation_timestamp) AS observation_month, 
       sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks' 
GROUP BY observation_month;

The same can be expressed with MongoDB’s MapReduce feature as follows:

db.observations.mapReduce( 
  function map() {
    var year = this.observationTimestamp.getFullYear(); 
    var month = this.observationTimestamp.getMonth() + 1; 
    emit(year + "-" + month, this.numAnimals);
  },
  function reduce(key, values) {
    return Array.sum(values); 
  },
  {
    query: { family: "Sharks" },
    out: "monthlySharkReport"
  }
);

For example, say the observations collection contains these two documents:


{
    observationTimestamp: Date.parse("Mon, 25 Dec 1995 12:34:56 GMT"),
    family:     "Sharks",
    species:    "Carcharodon carcharias",
    numAnimals: 3
},
{
  observationTimestamp: Date.parse("Tue, 12 Dec 1995 16:17:18 GMT"),
  family:     "Sharks",
  species:    "Carcharias taurus",
  numAnimals: 4
}

The map function would be called once for each document, resulting in emit(“1995–12”, 3) and emit(“1995–12”, 4). Subsequently, the reduce function would be called with reduce(“1995–12”, [3, 4]), returning 7.

A usability problem with MapReduce is that you have to write two carefully coordinated JavaScript functions, which is often harder than writing a single query. More‐ over, a declarative query language offers more opportunities for a query optimizer to improve the performance of a query. For these reasons, MongoDB 2.2 added support for a declarative query language called the aggregation pipeline.

 db.observations.aggregate([
  { $match: { family: "Sharks" } },
  { $group: {
    _id: {
      year:  { $year:  "$observationTimestamp" },
      month: { $month: "$observationTimestamp" }
    },
    totalAnimals: { $sum: "$numAnimals" } 
  }}
]);

4. Graph-Like Data Models

If your application has mostly one-to-many relationships or no relationships between records, the document model is appropriate. What if many-to-many relationships are very common in your data?

The relational model can handle simple cases of many-to-many relationships.
As the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

A graph consists of two kinds of objects: vertices (nodes or entities) and edges (relationships or arcs). Many kinds of data can be modeled as a graph. Typical examples include: Social graphs, The web graph and Road or rail networks.

We will use the example shown in below figure. It could be taken from a social network or a genealogical database: it shows two people, Lucy from Idaho and Alain from Beaune, France. They are married and living in London.

There are several different ways of structuring and querying data in graphs. In this section we will discuss the property graph model and the triple-store model. We will look at three declarative query languages for graphs: Cypher, SPARQL, and Datalog.

a. Property Graphs

In the property graph model, each vertex consists of:

A unique identifier
A set of outgoing edges
A set of incoming edges
A collection of properties (key-value pairs)

Each edge consists of:

A unique identifier
The vertex at which the edge starts (the tail vertex)
The vertex at which the edge ends (the head vertex)
A label to describe the kind of relationship between the two vertices
A collection of properties (key-value pairs)

You can think of a graph store as consisting of two relational tables, one for vertices and one for edges:

CREATE TABLE vertices (
  vertex_id integerPRIMARYKEY, properties json
);
CREATE TABLE edges (
  edge_id integer PRIMARY KEY,
  tail_vertex integer REFERENCES vertices (vertex_id), 
  head_vertex integer REFERENCES vertices (vertex_id), 
  label text,
  properties json
);
CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);

Those features give graphs a great deal of flexibility for data modeling.

b. The Cypher Query Language

Cypher is a declarative query language for property graphs, created for the Neo4j graph database.

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location      {name:'United States', type:'country'  }),
  (Idaho:Location    {name:'Idaho',         type:'state'    }),
  (Lucy:Person       {name:'Lucy' }),
  (Idaho) -[:WITHIN]->  (USA)  -[:WITHIN]-> (NAmerica),
  (Lucy)  -[:BORN_IN]-> (Idaho)

Cypher query to find people who emigrated from the US to Europe

MATCH
  (person) -[:BORN_IN]->  () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
  (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'}) 
RETURN person.name

c. Graph Queries in SQL

The same query as previous example, expressed in SQL using recursive common table expressions.

WITH RECURSIVE
  -- in_usa is the set of vertex IDs of all locations within the United States
  in_usa(vertex_id) AS (
    SELECT vertex_id FROM vertices WHERE properties->>'name' = 'United States'
    UNION
    SELECT edges.tail_vertex FROM edges
    JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
    WHERE edges.label = 'within' 
  ),
  -- in_europe is the set of vertex IDs of all locations within Europe
  in_europe(vertex_id) AS (
    SELECT vertex_id FROM vertices WHERE properties->>'name' = 'Europe'
    UNION
    SELECT edges.tail_vertex FROM edges
    JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
    WHERE edges.label = 'within' 
  ),
  -- born_in_usa is the set of vertex IDs of all people born in the US
  born_in_usa(vertex_id) AS (
    SELECT edges.tail_vertex FROM edges
    JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
    WHERE edges.label = 'born_in' 
  ),
  -- lives_in_europe is the set of vertex IDs of all people living in Europe
  lives_in_europe(vertex_id) AS ( 
    SELECT edges.tail_vertex FROM edges
    JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
    WHERE edges.label = 'lives_in' 
  )

SELECT vertices.properties->>'name'
FROM vertices

-- join to find those people who were both born in the US *and* live in Europe JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id
JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;

If the same query can be written in 4 lines in one query language but requires 29 lines in another, that just shows that different data models are designed to satisfy different use cases. It’s important to pick a data model that is suitable for your application.

d. Triple-Stores and SPARQL

The triple-store model is mostly equivalent to the property graph model, using differ‐ ent words to describe the same ideas. In a triple-store, all information is stored in the form of very simple three-part state‐ ments: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.

Triple-Stores (represented as Turtle triples)

@prefix : <urn:example:>.
_:lucy     a       :Person.
_:lucy     :name   "Lucy".
_:lucy     :bornIn _:idaho.
_:idaho    a       :Location.
_:idaho    :name   "Idaho".
_:idaho    :type   "state".
_:idaho    :within _:usa.
_:usa      a       :Location.
_:usa      :name   "United States".
_:usa      :type   "country".
_:usa      :within _:namerica.
_:namerica a       :Location.
_:namerica :name   "North America".
_:namerica :type   "continent".

RDF data model:

<rdf:RDF xmlns="urn:example:"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <Location rdf:nodeID="idaho"> 
    <name>Idaho</name> 
    <type>state</type> 
    <within>
      <Location rdf:nodeID="usa"> 
        <name>United States</name> 
        <type>country</type> 
        <within>
          <Location rdf:nodeID="namerica"> 
            <name>North America</name> 
            <type>continent</type>
          </Location>
        </within>
      </Location>
    </within>
  </Location>
  
  <Person rdf:nodeID="lucy"> 
    <name>Lucy</name>
    <bornIn rdf:nodeID="idaho"/>
  </Person>
</rdf:RDF>

The SPARQL query language:

PREFIX : <urn:example:>
SELECT ?personName WHERE {
  ?person :name ?personName.
  ?person :bornIn  / :within* / :name "United States".
  ?person :livesIn / :within* / :name "Europe".
}

The structure is very similar.

    (person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location)   # Cypher
    ?person :bornIn / :within* ?location.                   # SPARQL


    (usa {name:'United States'})   # Cypher
    ?usa :name "United States".    # SPARQL

e. The Foundation: Datalog

Datalog’s data model is similar to the triple-store model, generalized a bit. Instead of writing a triple as (subject, predicate, object), we write it as predicate(subject, object).

name(namerica, 'North America').
type(namerica, continent).

name(usa, 'United States').
type(usa, country).
within(usa, namerica).

name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).

name(lucy, 'Lucy').
born_in(lucy, idaho).

Query:

within_recursive(Location, Name) :- name(Location, Name). /* Rule 1 */

within_recursive(Location, Name) :- within(Location, Via), /* Rule 2 */ 
                                    within_recursive(Via, Name).

migrated(Name, BornIn, LivingIn) :- name(Person, Name), /* Rule 3 */ born_in(Person, BornLoc),
                                    within_recursive(BornLoc, BornIn),
                                    lives_in(Person, LivingLoc),
                                    within_recursive(LivingLoc, LivingIn).

?- migrated(Who, 'United States', 'Europe'). 
/* Who = 'Lucy'. */

Cypher and SPARQL jump in right away with SELECT, but Datalog takes a small step at a time. Rules can be combined and reused in different queries. It’s less convenient for simple one-off queries, but it can cope better if your data is complex.

5. Conclusion

This chapter provides a thorough overview of different data models, tracing their evolution from hierarchical to relational and the recent rise of NoSQL databases.

Document databases focus on self-contained documents, while graph databases target scenarios with extensive potential relationships.

As developers navigate the intricacies of data modeling, the emphasis remains on aligning the chosen model with the specific needs of their applications, recognizing that each model brings its own strengths and trade-offs to the development landscape.

If you found this chapter helpful, give it a round of applause! 👏

Dive into the next chapter on “Storage and Retrieval” for deeper insights into the evolving landscape of data modeling.

DDIA Summary: Chapter 3 — Storage and Retrieval

Designing Data-Intensive Applications

sunnyleeyun.medium.com

DDIA Summary: Chapter 2 — Data Models and Query Languages

Designing Data-Intensive Applications

Agenda

1. Introduction

2. Relational Model Versus Document Model

3. Query Languages for Data

4. Graph-Like Data Models

5. Conclusion

1. Introduction

2. Relational Model Versus Document Model

a. The Birth of NoSQL

b. The Object-Relational Mismatch

c. Many-to-One and Many-to-Many Relationships

d. Relational Versus Document Databases Today

3. Query Languages for Data

a. Declarative Queries on the Web

b. MapReduce Querying

4. Graph-Like Data Models

a. Property Graphs

b. The Cypher Query Language

c. Graph Queries in SQL

d. Triple-Stores and SPARQL

e. The Foundation: Datalog

5. Conclusion

DDIA Summary: Chapter 3 — Storage and Retrieval

Designing Data-Intensive Applications

Written by Sunny, Lee