DDIA Summary: Chapter 4 — Encoding and Evolution

Designing Data-Intensive Applications

Sunny, Lee

20 min readFeb 6, 2024

Agenda

2. Formats for Encoding Data

Language-Specific Formats
JSON, XML, and Binary Variants
Thrift and Protocol Buffers
Avro
The Merits of Schemas

3. Modes of Dataflow

Dataflow Through Databases
Dataflow Through Services: REST and RPC
Message-Passing Dataflow

1. Introduction

This chapter focuses on the inevitability of changes in applications and the importance of designing systems with adaptability in mind. It explores how modifications in application features often require corresponding adjustments in data storage. The chapter discusses strategies for handling code changes, emphasizing the challenges in large applications with gradual deployment and user-dependent updates. Key concepts introduced include backward compatibility (newer code reading data from older code) and forward compatibility (older code reading data from newer code). The chapter then explores various data encoding formats and their role in handling schema changes, particularly in the context of web services, REST, RPC, and message-passing systems. Overall, Chapter 4 provides valuable insights into managing data evolution and compatibility during the development of data-intensive applications.

2. Formats for Encoding Data

Programs usually work with data in (at least) two different representations:

1. In-Memory Data Representation:

Data resides in objects, structs, lists, arrays, hash tables, trees, etc.
Optimized for efficient CPU access and manipulation, often utilizing pointers.

2. External Data Representation:

To write data to a file or transmit over the network, it must be encoded as a self-contained sequence of bytes.
Example: Encoding as a JSON document.
Due to the impracticality of pointers in external processes, the byte sequence representation differs significantly from in-memory data structures.

Thus, we need some kind of translation between the two representations. The process of translating from in-memory representation to a byte sequence is called encoding (serialization or marshalling). Conversely, converting a byte sequence back to in-memory representation is termed decoding (parsing, deserialization, or unmarshalling).

There are a myriad different libraries and encod‐ ing formats to choose from. Let’s do a brief overview.

a. Language-Specific Formats

Many programming languages provide built-in support for encoding in-memory objects into byte sequences.

Examples include:

Java: java.io.Serializable
Ruby: Marshal
Python: pickle
Third-party libraries also exist, like Kryo for Java.

Challenges with Language-Specific Encoding:

1. Tied to Language:

Encoding is often specific to a programming language.
Reading data in another language becomes challenging, hindering interoperability.

2. Security Concerns:

Decoding process often requires instantiating arbitrary classes, posing security risks.
Attackers exploiting decoding may execute arbitrary code.

3. Limited Versioning Support:

Versioning data is often neglected in these libraries.
Challenges of forward and backward compatibility are not adequately addressed.

4. Efficiency Issues:

Encoding libraries may lack efficiency in terms of CPU time and encoded structure size.
Example: Java’s built-in serialization is criticized for poor performance and bloated encoding.

b. JSON, XML, and Binary Variants

Transitioning to widely used standardized encodings, JSON and XML are prominent options. However, both face criticism — XML for being verbose and overly complex, and JSON, while popular for its browser support and simplicity, is not universally liked. CSV is another language-independent format, though less potent compared to JSON and XML.

Binary encoding:

When your organization uses data only internally, you have more flexibility in choosing a format that suits your needs, like one that is more compact or quicker to process. While JSON is less wordy than XML, both still take up a lot of space compared to binary formats. This realization has led to the creation of various binary encodings for JSON (like MessagePack, BSON, UBJSON) and XML (like WBXML, Fast Infoset). These binary formats are used in specific areas, but they haven’t become as widely accepted as the text versions of JSON and XML. The choice of data format becomes crucial as the dataset size grows, especially into the terabytes.

Some of these formats expand the types of data they support, such as distinguishing between integers and floating-point numbers or adding binary string support. However, they maintain the basic JSON/XML data model, especially by not requiring a schema. This means they include all the object field names within the encoded data.

For instance, consider the JSON document in below example, which encodes a user record.

{
    "userName": "Martin",
    "favoriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

Let’s take MessagePack, a binary encoding for JSON, as an example. The encoded byte sequence is shown in the below figure, starting with the indication of an object and the number of fields. The binary encoding is slightly shorter, 66 bytes, compared to the textual JSON encoding’s 81 bytes (without whitespace). The reduction in size may offer a minor space advantage and potential parsing speedup but comes at the cost of human-readability. Subsequent sections will explore more efficient ways to encode the same record in just 32 bytes.

c. Thrift and Protocol Buffers

Apache Thrift and Protocol Buffers (protobuf) are binary encoding libraries founded on the same principle, originally developed at Facebook and Google, respectively, and later open-sourced in 2007–08.

Both Thrift and Protocol Buffers necessitate a schema for encoding data. In Thrift, the schema is described using the Thrift Interface Definition Language (IDL), while Protocol Buffers use a similar schema definition language. Code generation tools provided by Thrift and Protocol Buffers generate classes in various programming languages based on these schemas.

Thrift:

struct Person {
  1: required string userName,
  2: optional i64 favoriteNumber,
  3: optional list<string> interests
}

Protocol Buffers:

message Person {
  required string user_name = 1;
  optional int64 favorite_number = 2;
  repeated string interests = 3;
}

Thrift and Protocol Buffers offer a code generation tool for schema-based class implementation across multiple languages. Thrift has two binary encoding formats: BinaryProtocol and CompactProtocol, each producing different encoded sizes. In the given example, encoding with Thrift’s BinaryProtocol results in 59 bytes, while CompactProtocol significantly reduces this to 34 bytes by using bit packing and variable-length integers.

Protocol Buffers, with a single binary encoding format, encodes the same data in 33 bytes, using similar principles to Thrift’s CompactProtocol.

Notably, the schemas distinguish between required and optional fields, although this distinction doesn’t affect how the fields are encoded. The primary difference lies in enabling a runtime check for required fields, aiding in bug detection.

Field tags and schema evolution:

In Thrift and Protocol Buffers, encoded records consist of concatenated fields identified by tag numbers and annotated with datatypes. Changing a field’s name is allowed, but altering its tag is not, as it would invalidate existing encoded data.

For schema evolution, both Thrift and Protocol Buffers permit adding new fields with unique tag numbers, ensuring forward compatibility. Backward compatibility is maintained by adding new optional or default value fields after the initial schema deployment.

Removing a field follows a similar logic, but only optional fields can be removed, and reusing the same tag number is not allowed to preserve compatibility with potential existing data.

Datatypes and schema evolution:

Changing the datatype of a field in Thrift or Protocol Buffers may be possible, but it comes with risks, such as potential loss of precision or truncation. For instance, transforming a 32-bit integer into a 64-bit integer can result in truncation if old code attempts to read data written by new code. The new code can handle data written by old code by filling in missing bits with zeros.

Protocol Buffers uses a repeated marker instead of a list or array datatype. Changing an optional (single-valued) field to a repeated (multi-valued) field is acceptable. New code reading old data perceives a list with zero or one element, while old code reading new data sees only the last element of the list.

Thrift, on the other hand, employs a dedicated list datatype, which, although not allowing the same evolution from single-valued to multi-valued as Protocol Buffers, supports nested lists.

d. Avro

Apache Avro is a distinct binary encoding format introduced in 2009 as a subproject of Hadoop, originating from the realization that Thrift wasn’t well-suited for Hadoop’s use cases.

Avro uses a schema to define the encoded data structure, offering two schema languages: Avro IDL for human editing and a JSON-based language for machine readability. The example schema, written in Avro IDL, looks like this:

record Person {
  string userName;
  union { null, long } favoriteNumber = null;
  array<string> interests;
}

The equivalent JSON representation of the schema is as follows:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "interests", "type": {"type": "array", "items": "string"}}
  ]
}

Unlike Thrift and Protocol Buffers, Avro does not use tag numbers in the schema. For the below example, encoding the record with this schema results in a compact 32-byte Avro binary encoding, the most compact among the encodings discussed.

Examining the byte sequence reveals no identifiers for fields or datatypes; it consists simply of concatenated values. Parsing this binary data requires knowledge of the schema to determine the datatype of each field. Consequently, correct decoding is possible only if the reader and writer code share the exact schema. Any mismatch in the schema between the reader and the writer could lead to incorrectly decoded data.

Avro supports schema evolution by allowing changes to the schema over time. However, the specifics of how Avro handles schema evolution are not elaborated in the provided text.

The writer’s schema and the reader’s schema:

In Avro, when encoding data, an application uses the writer’s schema, which represents the version of the schema it knows about, for tasks such as writing to a file or database or sending data over the network. Conversely, when decoding data, the application relies on the reader’s schema, the schema it expects the data to be in. The reader’s schema is the schema that the application code is built upon and is determined during the application’s build process.

Crucially, the writer’s schema and the reader’s schema in Avro do not have to be identical; they only need to be compatible. During the decoding process, the Avro library resolves differences between the two schemas by comparing them side by side and translating the data from the writer’s schema into the reader’s schema. The Avro specification outlines the specifics of this resolution mechanism.

For instance, if the fields in the writer’s schema and the reader’s schema are in a different order, it is not an issue, as the schema resolution matches fields based on their names. If the code reading the data encounters a field present in the writer’s schema but not in the reader’s schema, it is ignored. Similarly, if the code expects a field not found in the writer’s schema, it is filled in with a default value declared in the reader’s schema. The process is illustrated in the below figure of the Avro specification.

Schema evolution rules:

In Avro, forward compatibility occurs when a new schema version is the writer, and an old version is the reader. Backward compatibility is when a new version is the reader, and an old version is the writer.

For compatibility, you can only add or remove a field with a default value, like the favoriteNumber field in our Avro schema, which has a default value of null. Adding a field with a default value allows new readers to use the default when reading data written by old writers.

However, adding a field without a default value breaks backward compatibility, and removing a field without a default value breaks forward compatibility. Avro doesn’t use optional and required markers; instead, it employs union types and default values to handle nullability explicitly.

Changing the datatype of a field is possible if Avro can convert the type. Changing a field name is feasible but is backward compatible and not forward compatible. Adding a branch to a union type is backward compatible but not forward compatible.

But what is the writer’s schema?

In Avro, determining the writer’s schema for a specific piece of data depends on the use case:

1. Large File with Lots of Records:

When storing a large file with millions of records, all encoded with the same schema (common in Hadoop contexts), the writer includes the schema once at the beginning of the file.
Avro specifies a file format, such as object container files, to facilitate this.

2. Database with Individually Written Records:

In a database where records are written at different times with different schemas, a version number is often included at the start of each encoded record.
The database maintains a list of schema versions, allowing a reader to fetch the schema associated with the version number of a specific record and decode it accordingly.
This approach is exemplified by Espresso.

3. Sending Records Over a Network Connection:

When two processes communicate over a bidirectional network connection, they can negotiate the schema version during connection setup.
The agreed-upon schema is then used for the duration of the connection.
This schema negotiation is employed in the Avro RPC protocol.

In all cases, having a database of schema versions serves as documentation and aids in checking schema compatibility. The version number can be a simple incrementing integer or a hash of the schema.

Dynamically generated schemas:

Avro’s schema design, without including tag numbers, provides an advantage for handling dynamically generated schemas compared to Protocol Buffers and Thrift. The absence of tag numbers in Avro schemas simplifies the process of dynamically adapting to schema changes. Here’s why this is significant:

1. Relational Database Example:

In scenarios like dumping a relational database to a file in a binary format, Avro allows for the straightforward generation of a schema from the relational schema.
Each table in the database can have its own Avro record schema, where each column corresponds to a field in that record. The mapping is based on the column names in the database.

2. Schema Changes:

If the database schema undergoes changes (e.g., columns added or removed), generating a new Avro schema from the updated database schema is a relatively simple task.
The data export process can perform schema conversion without requiring special handling for each schema change. The schema conversion happens seamlessly during each run.

3. Dynamic Schema Generation:

Avro’s design accommodates the goal of dynamically generating schemas. For instance, when a new Avro schema is generated from an updated database schema, the fields are identified by name.
This design allows for flexibility in adapting to changes without the need for manual intervention or meticulous handling of field tags.

4. Thrift and Protocol Buffers Comparison:

In contrast, using Thrift or Protocol Buffers for a similar purpose might involve assigning field tags manually, especially when the database schema changes.
This manual assignment of field tags can be error-prone and requires careful management, as the schema generator would need to avoid reusing previously used field tags.

In summary, Avro’s approach aligns well with scenarios involving dynamically generated schemas, providing a more convenient and automated way to handle schema changes compared to other binary encoding formats like Thrift and Protocol Buffers.

Code generation and dynamically typed languages:

Thrift and Protocol Buffers rely on code generation for schema implementation, which is beneficial in statically typed languages but less so in dynamically typed ones. Avro, in contrast, provides optional code generation but is designed to be used effectively without it. This flexibility is advantageous in dynamically typed languages like JavaScript, Ruby, or Python, where code generation might be less practical due to the absence of compile-time type checking.

Avro’s self-describing nature enables direct data inspection without the need for generated code. For example, when dealing with Avro files in languages like Apache Pig, you can seamlessly analyze, generate derived datasets, and write output files in Avro format without explicit schema considerations. This makes Avro a versatile choice, particularly in dynamic data processing scenarios.

e. The Merits of Schemas

In summary, the use of schemas in binary encodings, such as Protocol Buffers, Thrift, and Avro, offers several advantages over textual formats like JSON, XML, and CSV:

Simplicity and Implementation: Schema languages in these binary formats are simpler than XML Schema or JSON Schema, making them easier to implement and use. They support a wide range of programming languages.
Historical Context: The concepts behind these encodings have roots in older technologies like ASN.1, standardized in 1984, which was complex and poorly documented. The newer binary formats have improved on these ideas.
Compactness: Binary encodings with schemas can be more compact than “binary JSON” variants, particularly because they can omit field names from the encoded data.
Documentation and Validation: The schema serves as valuable documentation, ensuring that it stays synchronized with the actual data structure. These schemas also support detailed validation rules, offering more guarantees about data integrity.
Compatibility Checking: Maintaining a database of schemas enables proactive checking for forward and backward compatibility of schema changes before deployment.
Code Generation: Users of statically typed programming languages can benefit from code generation based on the schema, enabling compile-time type checking.

In conclusion, the incorporation of schema evolution in binary encodings provides flexibility similar to schemaless or schema-on-read JSON databases, offering better data guarantees and improved tooling.

3. Modes of Dataflow

In this chapter, we’ve explored various encoding formats for sending data between processes that don’t share memory, such as over the network or when writing to a file. We discussed the importance of forward and backward compatibility in facilitating system evolution.

Now, let’s delve into different modes of data flow between processes:

Dataflow Through Databases: Understanding how data moves through databases, exploring the role of encoding in storage systems, and considering the implications for database schema evolution.
Dataflow Through Services (REST and RPC): Examining how data is exchanged between processes through service calls, comparing Representational State Transfer (REST) and Remote Procedure Call (RPC) mechanisms.
Message-Passing Dataflow: Exploring asynchronous message passing as a mode of data flow between processes, and its relevance in distributed systems.

These modes represent common scenarios in which data travels between processes, each with its own considerations for encoding, compatibility, and overall system design.

a. Dataflow Through Databases

In database operations, the writing process encodes data, and the reading process decodes it. When a single process accesses the database, viewing the reader as a future version of the writer is akin to sending a message to one’s future self.

Backward compatibility is vital for decoding previously written data. In scenarios where multiple processes access the database concurrently, involving various applications or services, ensuring forward compatibility is necessary due to potential differences in code versions.

Adding a new field to a record schema poses challenges. If a newer code writes a value for the new field, and an older version subsequently reads, updates, and writes the record, preserving the new field is desired. While encoding formats can handle unknown fields, caution at the application level is crucial to prevent data loss during decoding and re-encoding processes, as illustrated in below figure.

Different Values Over Time in Databases:

In a database, values can be updated at any time, leading to a mix of data written at various points, spanning milliseconds to years. Unlike application code, which can be swiftly replaced, the database retains old data unless explicitly rewritten — a principle captured by the phrase “data outlives code.”

Migrating data to a new schema is possible but often expensive for large datasets, prompting databases to avoid it if feasible. Many relational databases support simple schema changes, like adding a new column with a null default value, without rewriting existing data. For instance, LinkedIn’s Espresso document database leverages Avro for storage, benefiting from Avro’s schema evolution rules.

This approach enables the entire database to present itself as if encoded with a single schema, despite underlying storage containing records encoded in various historical schema versions.

Archival Storage and Data Dumping:

When taking periodic snapshots of your database for purposes like backups or data warehousing, the data dump is typically encoded using the latest schema, irrespective of the original mixture of schema versions. The immutability of these data dumps makes formats like Avro object container files well-suited. Additionally, encoding the data consistently during the copy process is a pragmatic choice.

For enhanced analytics capabilities, considering column-oriented formats like Parquet is advisable. Chapter 10 will delve deeper into leveraging data in archival storage.

b. Dataflow Through Services: REST and RPC

In the network communication model, clients interact with servers through APIs. This paradigm is widely observed on the web, where browsers, native apps, and JavaScript applications act as clients, making requests to servers via agreed-upon standards like HTTP.

Services, similar to databases, enable data submission and querying but with a more restrictive, application-specific API. The service-oriented or microservices architecture emphasizes independent deployability and evolvability. This approach anticipates the coexistence of old and new service versions, necessitating compatible data encoding across API versions — an idea explored in this chapter.

Web services:

Web services, utilizing HTTP as their underlying protocol, are prevalent in various contexts, not limited to the web. These contexts include client applications communicating over the internet, services within the same organization’s datacenter, and inter-organizational data exchange over the internet.

Two prominent approaches to web services are REST and SOAP, embodying different philosophies. REST, a design philosophy rooted in HTTP principles, prioritizes simplicity, utilizing URLs to identify resources and leveraging HTTP features for various functionalities. In contrast, SOAP is a protocol reliant on XML for network API requests. It features a comprehensive framework with standards like WS-* but tends to be complex and less favored in modern, smaller companies.

RESTful APIs, aligning with the principles of REST, emphasize simplicity and often require less code generation. OpenAPI, also known as Swagger, is a popular format for describing RESTful APIs and generating documentation. While SOAP is still utilized in larger enterprises, RESTful APIs have gained popularity due to their simplicity and ease of use.

The problems with remote procedure calls (RPCs):

RPC technologies, including EJB, RMI, DCOM, and CORBA, have encountered fundamental problems that hinder their effectiveness:

Location Transparency Fallacy: Attempting to make remote network services appear as local function calls overlooks the inherent differences between network and local calls, rendering this abstraction flawed.
Unpredictability of Network Requests: Network requests are prone to issues such as lost requests or responses, network problems, and slow or unavailable remote machines, necessitating anticipatory measures and retry mechanisms.
Outcomes of Network Requests: Network requests may return without a result due to timeouts, introducing uncertainty about the request’s success or failure, requiring special handling.
Retrying and Idempotence Challenges: Retrying failed network requests without ensuring idempotence risks executing actions multiple times, unlike local function calls.
Variable Latency: Network request latency varies widely, ranging from milliseconds to seconds based on network conditions and service availability, unlike the predictable nature of local calls.
Parameter Encoding Overhead: Network requests require parameter encoding into byte sequences for transmission, posing challenges, especially with larger objects, unlike the efficient passing of references in local calls.
Cross-Language Data Type Translation: RPC frameworks encounter difficulties translating data types between different programming languages, owing to language-specific type differences.

Despite these challenges, REST’s transparency regarding its nature as a network protocol has gained popularity, avoiding the pitfalls of attempting to overly resemble local function calls in a different context.

Current Trends in RPC:

RPC remains a popular method for service communication, with modern frameworks integrating various encoding formats like Protocol Buffers, Avro, and JSON over HTTP.

Explicit Remote Request Handling: New frameworks like Finagle and Rest.li embrace futures for managing asynchronous actions and failures, recognizing the distinctions between local function calls and remote requests.
Advanced Feature Support: gRPC introduces stream support, enabling bidirectional communication with multiple requests and responses over time, enhancing RPC functionality.
Service Discovery: Some frameworks offer built-in mechanisms for service discovery, simplifying the process of locating and connecting to remote services dynamically.
Performance vs. Flexibility: Custom RPC protocols with binary encoding offer performance advantages over generic formats like JSON over REST. However, RESTful APIs provide ease of experimentation, broad language support, and extensive tooling.
REST Dominance for Public APIs: Despite RPC’s benefits, REST remains the primary choice for public APIs due to its simplicity, ease of use, and widespread adoption. RPC frameworks primarily serve communication within organizations, especially within datacenters.

In essence, while RPC maintains its role in internal service communication, REST continues to dominate public API development for its accessibility and ecosystem support.

Data Encoding and Evolution in RPC:

In RPC systems, ensuring evolvability is crucial for independent client and server updates. Typically, servers are updated before clients, necessitating backward compatibility on requests and forward compatibility on responses.

Thrift, gRPC, and Avro RPC: They adhere to compatibility rules from their encoding formats, enabling evolution based on Thrift, Protocol Buffers, and Avro schemas.
SOAP: XML schemas define SOAP requests and responses, allowing for evolution with potential complexities.
RESTful APIs: Often using JSON for responses and URI-encoded/form-encoded parameters for requests. Compatible changes include adding optional request parameters or new response fields, although JSON’s lack of a formal schema may require careful consideration.

Maintaining compatibility is challenging in RPC systems used across organizational boundaries. Service providers may support multiple API versions concurrently when compatibility-breaking changes are unavoidable.

API versioning approaches vary. RESTful APIs may include version numbers in URLs or use the HTTP Accept header. Services with API keys may allow clients to specify their preferred API version via an administrative interface.

c. Message-Passing Dataflow

Message-Passing Dataflow bridges the gap between RPC and databases by enabling asynchronous communication between processes. Here are its key aspects:

Low Latency Delivery: Messages are swiftly delivered to processes, akin to RPC.
Intermediate Message Broker: Messages are temporarily stored by a message broker, offering benefits such as buffering for reliability, automatic redelivery, decoupling of sender and recipient, and support for broadcasting.
Asynchronous Communication: Unlike RPC, message-passing communication is typically one-way, without immediate responses, enhancing scalability and responsiveness.

Overall, message-passing dataflow provides a robust and flexible method for inter-process communication.

Message brokers:

Message brokers, once dominated by commercial solutions like TIBCO and IBM WebSphere, now see a shift towards open-source platforms such as RabbitMQ, ActiveMQ, and Kafka. These systems enable communication between processes by storing and delivering messages.

Processes send messages to named queues or topics, with the broker ensuring delivery to subscribers or consumers. Multiple producers and consumers can interact on the same topic, facilitating robust communication.

While topics support one-way data flow, consumers can also publish messages to other topics or reply queues, enabling request/response flows similar to RPC.

Message brokers are agnostic to data models, treating messages as byte sequences with metadata, allowing flexibility in encoding formats. However, caution is needed to preserve unknown fields when republishing messages to avoid compatibility issues.

Distributed actor frameworks:

Distributed actor frameworks extend the actor model, a concurrency programming paradigm that encapsulates logic in actors communicating through asynchronous messages. In distributed setups, actors span multiple nodes, employing the same message-passing mechanism regardless of node location.

Compared to RPC, the actor model’s inherent assumption of potential message loss aligns better with network conditions. While latency increases over the network, the actor model minimizes the disparity between local and remote communication.

These frameworks integrate a message broker and actor programming, yet upgrading actor-based applications requires consideration of forward and backward compatibility. Here’s how three popular frameworks handle message encoding:

Akka primarily uses Java’s serialization, lacking compatibility features. Substituting it with alternatives like Protocol Buffers enables rolling upgrades.
Orleans employs a custom encoding format, necessitating the setup of new clusters for version upgrades. Custom serialization plug-ins can address this limitation.
Erlang OTP faces challenges in schema changes despite its high availability features. Although rolling upgrades are feasible, they require careful planning. The introduction of experimental data types like maps may facilitate upgrades in the future.

4. Conclusion

In conclusion, the exploration of data encoding, evolution, and dataflow modes in this comprehensive overview sheds light on the intricate mechanisms underlying the development and scalability of data-intensive applications.

From the meticulous considerations of schema design in binary encoding formats like Thrift, Protocol Buffers, and Avro, to the pragmatic trade-offs in choosing between RESTful APIs and RPC mechanisms, each section highlights the critical role of compatibility, efficiency, and flexibility in system architecture.

Furthermore, the discussion on message-passing dataflow elucidates the transition from traditional RPC paradigms to distributed actor frameworks, emphasizing the importance of aligning communication models with network realities and system scalability requirements.

Ultimately, in the dynamic landscape of modern application development, the ability to navigate and adapt to evolving data structures, communication patterns, and deployment scenarios becomes paramount. By leveraging the insights provided in this overview, developers and architects can better navigate the complexities of designing robust, scalable, and adaptable data-intensive systems.

DDIA Summary: Chapter 4 — Encoding and Evolution

Designing Data-Intensive Applications

Agenda

1. Introduction

2. Formats for Encoding Data

3. Modes of Dataflow

4. Conclusion

1. Introduction

2. Formats for Encoding Data

a. Language-Specific Formats

b. JSON, XML, and Binary Variants

c. Thrift and Protocol Buffers

d. Avro

e. The Merits of Schemas

3. Modes of Dataflow

a. Dataflow Through Databases

b. Dataflow Through Services: REST and RPC

c. Message-Passing Dataflow

4. Conclusion

Written by Sunny, Lee