forked from Shiloh/githaven
178 lines
8.4 KiB
Markdown
178 lines
8.4 KiB
Markdown
|
# ZAP File Format
|
||
|
|
||
|
## Legend
|
||
|
|
||
|
### Sections
|
||
|
|
||
|
|========|
|
||
|
| | section
|
||
|
|========|
|
||
|
|
||
|
### Fixed-size fields
|
||
|
|
||
|
|--------| |----| |--| |-|
|
||
|
| | uint64 | | uint32 | | uint16 | | uint8
|
||
|
|--------| |----| |--| |-|
|
||
|
|
||
|
### Varints
|
||
|
|
||
|
|~~~~~~~~|
|
||
|
| | varint(up to uint64)
|
||
|
|~~~~~~~~|
|
||
|
|
||
|
### Arbitrary-length fields
|
||
|
|
||
|
|--------...---|
|
||
|
| | arbitrary-length field (string, vellum, roaring bitmap)
|
||
|
|--------...---|
|
||
|
|
||
|
### Chunked data
|
||
|
|
||
|
[--------]
|
||
|
[ ]
|
||
|
[--------]
|
||
|
|
||
|
## Overview
|
||
|
|
||
|
Footer section describes the configuration of particular ZAP file. The format of footer is version-dependent, so it is necessary to check `V` field before the parsing.
|
||
|
|
||
|
|==================================================|
|
||
|
| Stored Fields |
|
||
|
|==================================================|
|
||
|
|-----> | Stored Fields Index |
|
||
|
| |==================================================|
|
||
|
| | Dictionaries + Postings + DocValues |
|
||
|
| |==================================================|
|
||
|
| |---> | DocValues Index |
|
||
|
| | |==================================================|
|
||
|
| | | Fields |
|
||
|
| | |==================================================|
|
||
|
| | |-> | Fields Index |
|
||
|
| | | |========|========|========|========|====|====|====|
|
||
|
| | | | D# | SF | F | FDV | CF | V | CC | (Footer)
|
||
|
| | | |========|====|===|====|===|====|===|====|====|====|
|
||
|
| | | | | |
|
||
|
|-+-+-----------------| | |
|
||
|
| |--------------------------| |
|
||
|
|-------------------------------------|
|
||
|
|
||
|
D#. Number of Docs.
|
||
|
SF. Stored Fields Index Offset.
|
||
|
F. Field Index Offset.
|
||
|
FDV. Field DocValue Offset.
|
||
|
CF. Chunk Factor.
|
||
|
V. Version.
|
||
|
CC. CRC32.
|
||
|
|
||
|
## Stored Fields
|
||
|
|
||
|
Stored Fields Index is `D#` consecutive 64-bit unsigned integers - offsets, where relevant Stored Fields Data records are located.
|
||
|
|
||
|
0 [SF] [SF + D# * 8]
|
||
|
| Stored Fields | Stored Fields Index |
|
||
|
|================================|==================================|
|
||
|
| | |
|
||
|
| |--------------------| ||--------|--------|. . .|--------||
|
||
|
| |-> | Stored Fields Data | || 0 | 1 | | D# - 1 ||
|
||
|
| | |--------------------| ||--------|----|---|. . .|--------||
|
||
|
| | | | |
|
||
|
|===|============================|==============|===================|
|
||
|
| |
|
||
|
|-------------------------------------------|
|
||
|
|
||
|
Stored Fields Data is an arbitrary size record, which consists of metadata and [Snappy](https://github.com/golang/snappy)-compressed data.
|
||
|
|
||
|
Stored Fields Data
|
||
|
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
|
||
|
| MDS | CDS | MD | CD |
|
||
|
|~~~~~~~~|~~~~~~~~|~~~~~~~~...~~~~~~~~|~~~~~~~~...~~~~~~~~|
|
||
|
|
||
|
MDS. Metadata size.
|
||
|
CDS. Compressed data size.
|
||
|
MD. Metadata.
|
||
|
CD. Snappy-compressed data.
|
||
|
|
||
|
## Fields
|
||
|
|
||
|
Fields Index section located between addresses `F` and `len(file) - len(footer)` and consist of `uint64` values (`F1`, `F2`, ...) which are offsets to records in Fields section. We have `F# = (len(file) - len(footer) - F) / sizeof(uint64)` fields.
|
||
|
|
||
|
|
||
|
(...) [F] [F + F#]
|
||
|
| Fields | Fields Index. |
|
||
|
|================================|================================|
|
||
|
| | |
|
||
|
| |~~~~~~~~|~~~~~~~~|---...---|||--------|--------|...|--------||
|
||
|
||->| Dict | Length | Name ||| 0 | 1 | | F# - 1 ||
|
||
|
|| |~~~~~~~~|~~~~~~~~|---...---|||--------|----|---|...|--------||
|
||
|
|| | | |
|
||
|
||===============================|==============|=================|
|
||
|
| |
|
||
|
|----------------------------------------------|
|
||
|
|
||
|
|
||
|
## Dictionaries + Postings
|
||
|
|
||
|
Each of fields has its own dictionary, encoded in [Vellum](https://github.com/couchbase/vellum) format. Dictionary consists of pairs `(term, offset)`, where `offset` indicates the position of postings (list of documents) for this particular term.
|
||
|
|
||
|
|================================================================|- Dictionaries +
|
||
|
| | Postings +
|
||
|
| | DocValues
|
||
|
| Freq/Norm (chunked) |
|
||
|
| [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
|
||
|
| |->[ Freq | Norm (float32 under varint) ] |
|
||
|
| | [~~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~] |
|
||
|
| | |
|
||
|
| |------------------------------------------------------------| |
|
||
|
| Location Details (chunked) | |
|
||
|
| [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | |
|
||
|
| |->[ Size | Pos | Start | End | Arr# | ArrPos | ... ] | |
|
||
|
| | [~~~~~~|~~~~~|~~~~~~~|~~~~~|~~~~~~|~~~~~~~~|~~~~~] | |
|
||
|
| | | |
|
||
|
| |----------------------| | |
|
||
|
| Postings List | | |
|
||
|
| |~~~~~~~~|~~~~~|~~|~~~~~~~~|-----------...--| | |
|
||
|
| |->| F/N | LD | Length | ROARING BITMAP | | |
|
||
|
| | |~~~~~|~~|~~~~~~~~|~~~~~~~~|-----------...--| | |
|
||
|
| | |----------------------------------------------| |
|
||
|
| |--------------------------------------| |
|
||
|
| Dictionary | |
|
||
|
| |~~~~~~~~|--------------------------|-...-| |
|
||
|
| |->| Length | VELLUM DATA : (TERM -> OFFSET) | |
|
||
|
| | |~~~~~~~~|----------------------------...-| |
|
||
|
| | |
|
||
|
|======|=========================================================|- DocValues Index
|
||
|
| | |
|
||
|
|======|=========================================================|- Fields
|
||
|
| | |
|
||
|
| |~~~~|~~~|~~~~~~~~|---...---| |
|
||
|
| | Dict | Length | Name | |
|
||
|
| |~~~~~~~~|~~~~~~~~|---...---| |
|
||
|
| |
|
||
|
|================================================================|
|
||
|
|
||
|
## DocValues
|
||
|
|
||
|
DocValues Index is `F#` pairs of varints, one pair per field. Each pair of varints indicates start and end point of DocValues slice.
|
||
|
|
||
|
|================================================================|
|
||
|
| |------...--| |
|
||
|
| |->| DocValues |<-| |
|
||
|
| | |------...--| | |
|
||
|
|==|=================|===========================================|- DocValues Index
|
||
|
||~|~~~~~~~~~|~~~~~~~|~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~||
|
||
|
|| DV1 START | DV1 STOP | . . . . . | DV(F#) START | DV(F#) END ||
|
||
|
||~~~~~~~~~~~|~~~~~~~~~~| |~~~~~~~~~~~~~~|~~~~~~~~~~~~||
|
||
|
|================================================================|
|
||
|
|
||
|
DocValues is chunked Snappy-compressed values for each document and field.
|
||
|
|
||
|
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
|
||
|
[ Doc# in Chunk | Doc1 | Offset1 | ... | DocN | OffsetN | SNAPPY COMPRESSED DATA ]
|
||
|
[~~~~~~~~~~~~~~~|~~~~~~|~~~~~~~~~|-...-|~~~~~~|~~~~~~~~~|--------------------...-]
|
||
|
|
||
|
Last 16 bytes are description of chunks.
|
||
|
|
||
|
|~~~~~~~~~~~~...~|----------------|----------------|
|
||
|
| Chunk Sizes | Chunk Size Arr | Chunk# |
|
||
|
|~~~~~~~~~~~~...~|----------------|----------------|
|