The type byte describes the sign of the number as well as the number of bytes used to specify the byte length of the
mantissa. As usual, if V is the type byte, then V - `0xc7` (in the positive case) or V - `0xcf` (in the negative case)
bytes are used for the length of the mantissa, stored as little endian unsigned integer directly after the byte length.
After this follow exactly 4 bytes (little endian signed two's complement integer) to specify the exponent. After the
exponent, the actual mantissa bytes follow.
Packed BCD is used, so that each byte stores exactly 2 decimal digits as in `0x34` for the decimal digits 34. Therefore,
the mantissa always has an even number of decimal digits. Note that the mantissa is stored in big endian form, to make
parsing and dumping efficient. This leads to the
"unholy nibble problem": When a JSON parser sees the beginning of a longish number, it does not know whether an even or
odd number of digits follow. However, for efficiency reasons it wants to start writing bytes to the output as it reads
the input. This is, where the exponent comes to the rescue, which is illustrated by the following example. 12345 decimal
can be encoded as:
The type byte describes the sign of the number as well as the number of
bytes used to specify the byte length of the mantissa. As usual, if V is
the type byte, then V - `0xc7` (in the positive case) or V - `0xcf` (in the
negative case) bytes are used for the length of the mantissa, stored as
little endian unsigned integer directly after the byte length. After
this follow exactly 4 bytes (little endian signed two's complement
integer) to specify the exponent. After the exponent, the actual
mantissa bytes follow.
Packed BCD is used, so that each byte stores exactly 2 decimal digits as
in `0x34` for the decimal digits 34. Therefore, the mantissa always has an
even number of decimal digits. Note that the mantissa is stored in big
endian form, to make parsing and dumping efficient. This leads to the
"unholy nibble problem": When a JSON parser sees the beginning of a
longish number, it does not know whether an even or odd number of digits
follow. However, for efficiency reasons it wants to start writing bytes
to the output as it reads the input. This is, where the exponent comes
to the rescue, which is illustrated by the following example.
12345 decimal can be encoded as:
c8 03 00 00 00 00 01 23 45
c8 03 ff ff ff ff 12 34 50
The former encoding puts a leading 0 in the first byte and uses exponent 0, the latter encoding directly starts putting
two decimal digits in one byte and then in the end has to "erase" the trailing 0 by using exponent -1, encoded by the 4
byte sequence `ff ff ff ff`.
The former encoding puts a leading 0 in the first byte and uses exponent
0, the latter encoding directly starts putting two decimal digits in one
byte and then in the end has to "erase" the trailing 0 by using exponent
-1, encoded by the 4 byte sequence `ff ff ff ff`.
Therefore, the unholy nibble problem is solved and parsing (and indeed dumping) can be efficient.
Therefore, the unholy nibble problem is solved and parsing (and indeed
dumping) can be efficient.
## Tagging
Types `0xee`-`0xef` are used for tagging of values to implement logical types.
Types `0xee`-`0xef` are used for tagging of values to implement logical
types.
For example, if type `0x1c` did not exist, the database driver could serialize a timestamp object (Date in JavaScript,
Instant in Java, etc)
into a Unix timestamp, a 64-bit integer. Assuming the lack of schema, upon deserialization it would not be possible to
tell an integer from a timestamp and deserialize the value accordingly.
For example, if type `0x1c` did not exist, the database driver could
serialize a timestamp object (Date in JavaScript, Instant in Java, etc)
into a Unix timestamp, a 64-bit integer. Assuming the lack of schema,
upon deserialization it would not be possible to tell an integer from
a timestamp and deserialize the value accordingly.
Type tagging resolves this by attaching an integer tag to values that can then be read when deserializing the value,
e.g. that tag=1 is a timestamp and the relevant timestamp class should be used.
Type tagging resolves this by attaching an integer tag to values that
can then be read when deserializing the value, e.g. that tag=1 is a
timestamp and the relevant timestamp class should be used.
The tag values are specified separately and applications can also specify their own to have the database driver
deserialize their specific data types into the appropriate classes (including models).
The tag values are specified separately and applications can also
specify their own to have the database driver deserialize their specific
data types into the appropriate classes (including models).
Essentially this is object-relational mapping for parts of documents.
@ -503,43 +599,56 @@ The following user-defined types exist:
- `0xf1` : 2 bytes payload, directly following the type byte
- `0xf2` : 4 bytes payload, directly following the type byte
- `0xf3` : 8 bytes payload, directly following the type byte
- `0xf4`-`0xf6` : length of the payload is described by a single further unsigned byte directly following the type byte,
the payload of that many bytes follows
- `0xf7`-`0xf9` : length of the payload is described by two bytes (little endian unsigned integer) directly following
the type byte, the payload of that many bytes follows
- `0xfa`-`0xfc` : length of the payload is described by four bytes (little endian unsigned integer) directly following
the type byte, the payload of that many bytes follows
- `0xfd`-`0xff` : length of the payload is described by eight bytes (little endian unsigned integer) directly following
the type byte, the payload of that many bytes follows
Note: In types `0xf4` to `0xff` the "payload" refers to the actual data not including the length specification.
- `0xf4`-`0xf6` : length of the payload is described by a single further
unsigned byte directly following the type byte, the
payload of that many bytes follows
- `0xf7`-`0xf9` : length of the payload is described by two bytes (little
endian unsigned integer) directly following the type
byte, the payload of that many bytes follows
- `0xfa`-`0xfc` : length of the payload is described by four bytes (little
endian unsigned integer) directly following the type
byte, the payload of that many bytes follows
- `0xfd`-`0xff` : length of the payload is described by eight bytes (little
endian unsigned integer) directly following the type
byte, the payload of that many bytes follows
Note: In types `0xf4` to `0xff` the "payload" refers to the actual data not
including the length specification.
## Portability
Serialized booleans, integers, strings, arrays, objects etc. all have a defined endianess and length, which is
platform-independent. These types are fully portable in serialized VelocyPack.
Serialized booleans, integers, strings, arrays, objects etc. all have a
defined endianess and length, which is platform-independent. These types are
fully portable in serialized VelocyPack.
There are still a few caveats when it comes to portability:
It is possible to build up very large values on a 64 bit system, but it may not be possible to read them back on a 32
bit system. This is because the maximum memory allocation size on a 32 bit system may be severely limited compared to a
64 bit system, i.e. a 32 bit OS may simply not allow to allocate buffers larger than 4 GB. This is not a limitation of
VelocyPack, but a limitation of 32 bit architectures. If all VelocyPack values are kept small enough so that they are
well below the 32 bit length boundaries, this should not matter though.
The VelocyPack type *External* contains just a raw pointer to memory, which should only be used during the buildup of
VelocyPack values in memory. The *External* type is not supposed to be used in VelocyPack values that are serialized and
stored persistently, and then later read back from persistence. Doing it anyway is not portable and will also pose a
security risk. Not using the *External* type for any data that is serialized will avoid this problem entirely.
The VelocyPack type *Custom* is completely user-defined, and there is no default implementation for them. So it is up to
the embedder to make these custom type bindings portable if portability of them is a concern.
VelocyPack *Double* values are serialized as integer equivalents in a specific way, and unserialized back into integers
that overlay a IEEE-754 double-precision floating point value in memory. We found this to be sufficiently portable for
our needs, although at least in theory there may be portability issues with some systems.
The [following](https://en.wikipedia.org/wiki/Endianness#Floating_point) was used as a backing for our "reasonably
portable in the real world" assumptions:
It is possible to build up very large values on a 64 bit system, but it may not be
possible to read them back on a 32 bit system. This is because the maximum memory
allocation size on a 32 bit system may be severely limited compared to a 64 bit system,
i.e. a 32 bit OS may simply not allow to allocate buffers larger than 4 GB. This
is not a limitation of VelocyPack, but a limitation of 32 bit architectures.
If all VelocyPack values are kept small enough so that they are well below the
32 bit length boundaries, this should not matter though.
The VelocyPack type *External* contains just a raw pointer to memory, which should
only be used during the buildup of VelocyPack values in memory. The *External* type
is not supposed to be used in VelocyPack values that are serialized and stored
persistently, and then later read back from persistence. Doing it anyway is not
portable and will also pose a security risk.
Not using the *External* type for any data that is serialized will avoid this problem
entirely.
The VelocyPack type *Custom* is completely user-defined, and there is no default
implementation for them. So it is up to the embedder to make these custom type
bindings portable if portability of them is a concern.
VelocyPack *Double* values are serialized as integer equivalents in a specific way,
and unserialized back into integers that overlay a IEEE-754 double-precision
floating point value in memory. We found this to be sufficiently portable for our
needs, although at least in theory there may be portability issues with some systems.
The [following](https://en.wikipedia.org/wiki/Endianness#Floating_point) was used as
a backing for our "reasonably portable in the real world" assumptions:
> It may therefore appear strange that the widespread IEEE 754 floating-point standard does not specify endianness.[17] Theoretically, this means that even standard IEEE floating-point data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may in practice safely assume that the endianness is the same for floating-point numbers as for integers, making the conversion straightforward regardless of data type.