Skip to content

Commit 30408a9

Browse files
committed
AVRO-1704: Add single-record encoding spec. (Contributed by Niels Basjes)
1 parent d7e1231 commit 30408a9

File tree

2 files changed

+34
-4
lines changed

2 files changed

+34
-4
lines changed

CHANGES.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ Trunk (not yet released)
88

99
AVRO-1704: Java: Add support for single-message encoding. (blue)
1010

11+
AVRO-1704: Spec: Add single-message encoding format. (Niels Basjes via blue)
12+
1113
OPTIMIZATIONS
1214

1315
IMPROVEMENTS

doc/src/content/xdocs/spec.xml

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -487,18 +487,18 @@
487487
value, followed by that many key/value pairs. A block
488488
with count zero indicates the end of the map. Each item
489489
is encoded per the map's value schema.</p>
490-
490+
491491
<p>If a block's count is negative, its absolute value is used,
492492
and the count is followed immediately by a <code>long</code>
493493
block <em>size</em> indicating the number of bytes in the
494494
block. This block size permits fast skipping through data,
495495
e.g., when projecting a record to a subset of its fields.</p>
496-
496+
497497
<p>The blocked representation permits one to read and write
498498
maps larger than can be buffered in memory, since one can
499499
start writing items without knowing the full length of the
500500
map.</p>
501-
501+
502502
</section>
503503

504504
<section id="union_encoding">
@@ -569,6 +569,34 @@
569569

570570
</section>
571571

572+
<section id="single_object_encoding">
573+
<title>Single-object encoding</title>
574+
575+
<p>In some situations a single Avro serialized object is to be stored for a
576+
longer period of time. One very common example is storing Avro records
577+
for several weeks in an <a href="http://kafka.apache.org/">Apache Kafka</a> topic.</p>
578+
<p>In the period after a schema change this persistance system will contain records
579+
that have been written with different schemas. So the need arises to know which schema
580+
was used to write a record to support schema evolution correctly.
581+
In most cases the schema itself is too large to include in the message,
582+
so this binary wrapper format supports the use case more effectively.</p>
583+
584+
<section id="single_object_encoding_spec">
585+
<title>Single object encoding specification</title>
586+
<p>Single Avro objects are encoded as follows:</p>
587+
<ol>
588+
<li>A two-byte marker, <code>C3 01</code>, to show that the message is Avro and uses this single-record format (version 1).</li>
589+
<li>The 8-byte little-endian CRC-64-AVRO <a href="#schema_fingerprints">fingerprint</a> of the object's schema</li>
590+
<li>The Avro object encoded using <a href="#binary_encoding">Avro's binary encoding</a></li>
591+
</ol>
592+
</section>
593+
594+
<p>Implementations use the 2-byte marker to determine whether a payload is Avro.
595+
This check helps avoid expensive lookups that resolve the schema from a
596+
fingerprint, when the message is not an encoded Avro payload.</p>
597+
598+
</section>
599+
572600
</section>
573601

574602
<section id="order">
@@ -1237,7 +1265,7 @@
12371265
</ul>
12381266
</section>
12391267

1240-
<section>
1268+
<section id="schema_fingerprints">
12411269
<title>Schema Fingerprints</title>
12421270

12431271
<p>"[A] fingerprinting algorithm is a procedure that maps an

0 commit comments

Comments
 (0)