Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leveldb: Add LevelDB support #824

Merged
merged 18 commits into from
Dec 9, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ ipv6_packet,
jpeg,
json,
jsonl,
[leveldb_ldb](doc/formats.md#leveldb_ldb),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a make doc to update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly, the make doc commands changes a lot of content in the SVGs. I left them out in the commit, as it seems they shouldn't change.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha yes skip those, i will fix. I did some changes to https://github.com/wader/ansisvg some days ago

[luajit](doc/formats.md#luajit),
[macho](doc/formats.md#macho),
macho_fat,
Expand Down
20 changes: 19 additions & 1 deletion doc/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
|`jpeg` |Joint&nbsp;Photographic&nbsp;Experts&nbsp;Group&nbsp;file |<sub>`exif` `icc_profile`</sub>|
|`json` |JavaScript&nbsp;Object&nbsp;Notation |<sub></sub>|
|`jsonl` |JavaScript&nbsp;Object&nbsp;Notation&nbsp;Lines |<sub></sub>|
|[`leveldb_ldb`](#leveldb_ldb) |LevelDB&nbsp;Table |<sub></sub>|
|[`luajit`](#luajit) |LuaJIT&nbsp;2.0&nbsp;bytecode |<sub></sub>|
|[`macho`](#macho) |Mach-O&nbsp;macOS&nbsp;executable |<sub></sub>|
|`macho_fat` |Fat&nbsp;Mach-O&nbsp;macOS&nbsp;executable&nbsp;(multi-architecture) |<sub>`macho`</sub>|
Expand Down Expand Up @@ -131,7 +132,7 @@
|`ip_packet` |Group |<sub>`icmp` `icmpv6` `tcp_segment` `udp_datagram`</sub>|
|`link_frame` |Group |<sub>`bsd_loopback_frame` `ether8023_frame` `ipv4_packet` `ipv6_packet` `sll2_packet` `sll_packet`</sub>|
|`mp3_frame_tags` |Group |<sub>`mp3_frame_vbri` `mp3_frame_xing`</sub>|
|`probe` |Group |<sub>`adts` `aiff` `apple_bookmark` `ar` `avi` `avro_ocf` `bitcoin_blkdat` `bplist` `bzip2` `caff` `elf` `flac` `gif` `gzip` `html` `jpeg` `json` `jsonl` `luajit` `macho` `macho_fat` `matroska` `moc3` `mp3` `mp4` `mpeg_ts` `ogg` `opentimestamps` `pcap` `pcapng` `png` `tar` `tiff` `toml` `tzif` `wasm` `wav` `webp` `xml` `yaml` `zip`</sub>|
|`probe` |Group |<sub>`adts` `aiff` `apple_bookmark` `ar` `avi` `avro_ocf` `bitcoin_blkdat` `bplist` `bzip2` `caff` `elf` `flac` `gif` `gzip` `html` `jpeg` `json` `jsonl` `leveldb_ldb` `luajit` `macho` `macho_fat` `matroska` `moc3` `mp3` `mp4` `mpeg_ts` `ogg` `opentimestamps` `pcap` `pcapng` `png` `tar` `tiff` `toml` `tzif` `wasm` `wav` `webp` `xml` `yaml` `zip`</sub>|
|`tcp_stream` |Group |<sub>`dns_tcp` `rtmp` `tls`</sub>|
|`udp_payload` |Group |<sub>`dns`</sub>|

Expand Down Expand Up @@ -690,6 +691,23 @@ $ fq -n -d html '[inputs | {key: input_filename, value: .html.head.title?}] | fr
$ fq -r -o array=true -d html '.. | select(.[0] == "a" and .[1].href)?.[1].href' file.html
```

## leveldb_ldb

### Limitations

- no Meta Blocks (like "filter") are decoded yet.
- Zstandard uncompression is not implemented yet.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less just depend on some zstd package? i've looked at https://github.com/klauspost/compress a couple of times, it has zstd and lots of other formats and i also suspect the api:s are a bit more low level then golang stdlib so might fit fq better. It can maybe replace github.com/golang/snappy also? impressively it seems it also has zero dependencies on other packages

Copy link
Contributor Author

@mikez mikez Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how common zstd is in the wild. I've never come across it so far in the LevelDB samples I've seen.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then i think we can skip it for now, probably good to not make this PR grow too much also


### Authors

- [@mikez](https://github.com/mikez), original author

### References

- https://github.com/google/leveldb/blob/main/doc/table_format.md
- https://github.com/google/leveldb/blob/main/doc/impl.md
- https://github.com/google/leveldb/blob/main/doc/index.md

## luajit

### Authors
Expand Down
6 changes: 6 additions & 0 deletions format/all/all.fqtest
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ $ fq -n _registry.groups.probe
"gif",
"gzip",
"jpeg",
"leveldb_descriptor",
"leveldb_log",
"leveldb_table",
"luajit",
"macho",
"macho_fat",
Expand Down Expand Up @@ -111,6 +114,9 @@ ipv6_packet Internet protocol v6 packet
jpeg Joint Photographic Experts Group file
json JavaScript Object Notation
jsonl JavaScript Object Notation Lines
leveldb_descriptor LevelDB Descriptor
leveldb_log LevelDB Log
leveldb_table LevelDB Table
luajit LuaJIT 2.0 bytecode
macho Mach-O macOS executable
macho_fat Fat Mach-O macOS executable (multi-architecture)
Expand Down
1 change: 1 addition & 0 deletions format/all/all.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ import (
_ "github.com/wader/fq/format/inet"
_ "github.com/wader/fq/format/jpeg"
_ "github.com/wader/fq/format/json"
_ "github.com/wader/fq/format/leveldb"
_ "github.com/wader/fq/format/luajit"
_ "github.com/wader/fq/format/markdown"
_ "github.com/wader/fq/format/math"
Expand Down
3 changes: 3 additions & 0 deletions format/format.go
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,9 @@ var (
JPEG = &decode.Group{Name: "jpeg"}
JSON = &decode.Group{Name: "json"}
JSONL = &decode.Group{Name: "jsonl"}
LevelDB_Descriptor = &decode.Group{Name: "leveldb_descriptor"}
LevelDB_LDB = &decode.Group{Name: "leveldb_table"}
LevelDB_LOG = &decode.Group{Name: "leveldb_log"}
LuaJIT = &decode.Group{Name: "luajit"}
MachO = &decode.Group{Name: "macho"}
MachO_Fat = &decode.Group{Name: "macho_fat"}
Expand Down
121 changes: 121 additions & 0 deletions format/leveldb/leveldb_descriptor.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
package leveldb

// https://github.com/google/leveldb/blob/main/doc/impl.md#manifest
// https://github.com/google/leveldb/blob/main/db/version_edit.cc
//
// Files in LevelDB using this format include:
// - MANIFEST-*

import (
"embed"

"github.com/wader/fq/format"
"github.com/wader/fq/pkg/decode"
"github.com/wader/fq/pkg/interp"
"github.com/wader/fq/pkg/scalar"
)

//go:embed leveldb_log.md
var leveldbDescriptorFS embed.FS

func init() {
interp.RegisterFormat(
format.LevelDB_Descriptor,
&decode.Format{
Description: "LevelDB Descriptor",
DecodeFn: ldbDescriptorDecode,
})
interp.RegisterFS(leveldbDescriptorFS)
}

const (
tagTypeComparator = 1
tagTypeLogNumber = 2
tagTypeNextFileNumber = 3
tagTypeLastSequence = 4
tagTypeCompactPointer = 5
tagTypeDeletedFile = 6
tagTypeNewFile = 7
// 8 not used anymore
tagTypePrevLogNumber = 9
)

var tagTypes = scalar.UintMapSymStr{
tagTypeComparator: "comparator",
tagTypeLogNumber: "log_number",
tagTypeNextFileNumber: "next file number",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use underscore for these? i usually try use jq-ish constant unless the specification is very explicit about the naming

tagTypeLastSequence: "last sequence",
tagTypeCompactPointer: "compact pointer",
tagTypeDeletedFile: "deleted file",
tagTypeNewFile: "new file",
tagTypePrevLogNumber: "previous log number",
}

func ldbDescriptorDecode(d *decode.D) any {
rro := recordReadOptions{readDataFn: func(size int64, recordType int, d *decode.D) {
if recordType == recordTypeFull {
d.FieldStruct("data", func(d *decode.D) {
d.LimitedFn(size, readManifest)
})
} else {
d.FieldRawLen("data", size)
}
}}
readBlockSequence(rro, d)

return nil
}

// List of sorted tables for each level involving key ranges and other metadata.
func readManifest(d *decode.D) {
d.FieldArray("tags", func(d *decode.D) {
for !d.End() {
d.FieldStruct("tag", func(d *decode.D) {
tag := d.FieldULEB128("key", tagTypes)
switch tag {
case tagTypeComparator:
readLengthPrefixedString("value", d)
case tagTypeLogNumber,
tagTypePrevLogNumber,
tagTypeNextFileNumber,
tagTypeLastSequence:
d.FieldULEB128("value")
case tagTypeCompactPointer:
d.FieldStruct("value", func(d *decode.D) {
d.FieldULEB128("level")
readTagInternalKey("internal_key", d)
})
case tagTypeDeletedFile:
d.FieldStruct("value", func(d *decode.D) {
d.FieldULEB128("level")
d.FieldULEB128("file_number")
})
case tagTypeNewFile:
d.FieldStruct("value", func(d *decode.D) {
d.FieldULEB128("level")
d.FieldULEB128("file_number")
d.FieldULEB128("file_size")
readTagInternalKey("smallest_internal_key", d)
readTagInternalKey("largest_internal_key", d)
})
default:
d.Fatalf("unknown tag: %d", tag)
}
})
}
})
}

func readLengthPrefixedString(name string, d *decode.D) {
d.FieldStruct(name, func(d *decode.D) {
length := d.FieldULEB128("length")
d.FieldUTF8("data", int(length))
})
}

func readTagInternalKey(name string, d *decode.D) {
d.FieldStruct(name, func(d *decode.D) {
length := d.FieldULEB128("length")
readInternalKey("data", int64(length), d)
})
}
13 changes: 13 additions & 0 deletions format/leveldb/leveldb_descriptor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
### Limitations

- fragmented non-"full" records are not decoded further.


### Authors

- [@mikez](https://github.com/mikez), original author

### References

- https://github.com/google/leveldb/blob/main/doc/impl.md#manifest
- https://github.com/google/leveldb/blob/main/db/version_edit.cc
151 changes: 151 additions & 0 deletions format/leveldb/leveldb_log.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
package leveldb

// https://github.com/google/leveldb/blob/main/doc/log_format.md
//
// Files in LevelDB using this format include:
// - *.log
// - MANIFEST-*

import (
"embed"

"github.com/wader/fq/format"
"github.com/wader/fq/internal/mathex"
"github.com/wader/fq/pkg/decode"
"github.com/wader/fq/pkg/interp"
"github.com/wader/fq/pkg/scalar"
)

//go:embed leveldb_log.md
var leveldbLogFS embed.FS

func init() {
interp.RegisterFormat(
format.LevelDB_LOG,
&decode.Format{
Description: "LevelDB Log",
DecodeFn: ldbLogDecode,
})
interp.RegisterFS(leveldbLogFS)
}

type recordReadOptions struct {
// Both .log- and MANIFEST-files use the Log-format,
// i.e., a sequence of records split into 32KB blocks.
// However, the format of the data within the records differ.
// This function specifies how to read said data.
readDataFn func(size int64, recordType int, d *decode.D)
}

// https://github.com/google/leveldb/blob/main/db/log_format.h
const (
// checksum (4 bytes) + length (2 bytes) + record type (1 byte)
headerSize = (4 + 2 + 1) * 8

blockSize = (32 * 1024) * 8 // 32KB

recordTypeZero = 0 // preallocated file regions
recordTypeFull = 1
recordTypeFirst = 2 // fragments
recordTypeMiddle = 3
recordTypeLast = 4
)

var recordTypes = scalar.UintMapSymStr{
recordTypeZero: "zero",
recordTypeFull: "full",
recordTypeFirst: "first",
recordTypeMiddle: "middle",
recordTypeLast: "last",
}

func ldbLogDecode(d *decode.D) any {
rro := recordReadOptions{readDataFn: func(size int64, recordType int, d *decode.D) {
d.FieldRawLen("data", size)
}}
readBlockSequence(rro, d)

return nil
}

// Read a sequence of 32KB-blocks (the last one may be less).
// https://github.com/google/leveldb/blob/main/db/log_reader.cc#L189
func readBlockSequence(rro recordReadOptions, d *decode.D) {
d.Endian = decode.LittleEndian

d.FieldArray("blocks", func(d *decode.D) {
for d.BitsLeft() >= headerSize {
d.LimitedFn(mathex.Min(blockSize, d.BitsLeft()), func(d *decode.D) {
d.FieldStruct("block", bind(readLogBlock, rro))
})
}
})

if d.BitsLeft() > 0 {
// The reference implementation says:
// "[...] if buffer_ is non-empty, we have a truncated header at the
// end of the file, which can be caused by the writer crashing in the
// middle of writing the header. Instead of considering this an error,
// just report EOF."
d.FieldRawLen("truncated_block", d.BitsLeft())
}
}

// Read a Log-block, consisting of up to 32KB of records and an optional trailer.
//
// block := record* trailer?
func readLogBlock(rro recordReadOptions, d *decode.D) {
if d.BitsLeft() > blockSize {
d.Fatalf("Bits left greater than maximum log-block size of 32KB.")
}
// record*
d.FieldArray("records", func(d *decode.D) {
for d.BitsLeft() >= headerSize {
d.FieldStruct("record", bind(readLogRecord, rro))
}
})
// trailer?
if d.BitsLeft() > 0 {
d.FieldRawLen("trailer", d.BitsLeft())
}
}

// Read a Log-record.
//
// checksum: uint32 // crc32c of type and data[] ; little-endian
// length: uint16 // little-endian
// type: uint8 // One of FULL, FIRST, MIDDLE, LAST
// data: uint8[length]
//
// via https://github.com/google/leveldb/blob/main/doc/log_format.md
func readLogRecord(rro recordReadOptions, d *decode.D) {
// header
var checksumValue *decode.Value
var length int64
var recordType int
d.LimitedFn(headerSize, func(d *decode.D) {
d.FieldStruct("header", func(d *decode.D) {
d.FieldU32("checksum", scalar.UintHex)
checksumValue = d.FieldGet("checksum")
length = int64(d.FieldU16("length"))
recordType = int(d.FieldU8("record_type", recordTypes))
})
})

// verify checksum: record type (1 byte) + data (`length` bytes)
d.RangeFn(d.Pos()-8, (1+length)*8, func(d *decode.D) {
bytesToCheck := d.Bits(int(d.BitsLeft()))
actualChecksum := computeChecksum(bytesToCheck)
_ = checksumValue.TryUintScalarFn(d.UintAssert(uint64(actualChecksum)))
})

// data
dataSize := length * 8
rro.readDataFn(dataSize, recordType, d)
}

func bind(f func(recordReadOptions, *decode.D), rro recordReadOptions) func(*decode.D) {
return func(d *decode.D) {
f(rro, d)
}
}
11 changes: 11 additions & 0 deletions format/leveldb/leveldb_log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
### Limitations

- individual records are not merged and its data further decoded.

### Authors

- [@mikez](https://github.com/mikez), original author

### References

- https://github.com/google/leveldb/blob/main/doc/log_format.md
Loading
Loading