-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leveldb: Add LevelDB support #824
Changes from 6 commits
fb910bd
efc59a8
b05aa99
78a3e94
2df0f0f
fe1099b
4283091
41f27a1
cc0d5a8
1ba8dec
3a396e1
287ed36
8665df5
e735cea
07ad940
e826f09
2f5f183
08e3d2d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -72,6 +72,7 @@ | |
|`jpeg` |Joint Photographic Experts Group file |<sub>`exif` `icc_profile`</sub>| | ||
|`json` |JavaScript Object Notation |<sub></sub>| | ||
|`jsonl` |JavaScript Object Notation Lines |<sub></sub>| | ||
|[`leveldb_ldb`](#leveldb_ldb) |LevelDB Table |<sub></sub>| | ||
|[`luajit`](#luajit) |LuaJIT 2.0 bytecode |<sub></sub>| | ||
|[`macho`](#macho) |Mach-O macOS executable |<sub></sub>| | ||
|`macho_fat` |Fat Mach-O macOS executable (multi-architecture) |<sub>`macho`</sub>| | ||
|
@@ -131,7 +132,7 @@ | |
|`ip_packet` |Group |<sub>`icmp` `icmpv6` `tcp_segment` `udp_datagram`</sub>| | ||
|`link_frame` |Group |<sub>`bsd_loopback_frame` `ether8023_frame` `ipv4_packet` `ipv6_packet` `sll2_packet` `sll_packet`</sub>| | ||
|`mp3_frame_tags` |Group |<sub>`mp3_frame_vbri` `mp3_frame_xing`</sub>| | ||
|`probe` |Group |<sub>`adts` `aiff` `apple_bookmark` `ar` `avi` `avro_ocf` `bitcoin_blkdat` `bplist` `bzip2` `caff` `elf` `flac` `gif` `gzip` `html` `jpeg` `json` `jsonl` `luajit` `macho` `macho_fat` `matroska` `moc3` `mp3` `mp4` `mpeg_ts` `ogg` `opentimestamps` `pcap` `pcapng` `png` `tar` `tiff` `toml` `tzif` `wasm` `wav` `webp` `xml` `yaml` `zip`</sub>| | ||
|`probe` |Group |<sub>`adts` `aiff` `apple_bookmark` `ar` `avi` `avro_ocf` `bitcoin_blkdat` `bplist` `bzip2` `caff` `elf` `flac` `gif` `gzip` `html` `jpeg` `json` `jsonl` `leveldb_ldb` `luajit` `macho` `macho_fat` `matroska` `moc3` `mp3` `mp4` `mpeg_ts` `ogg` `opentimestamps` `pcap` `pcapng` `png` `tar` `tiff` `toml` `tzif` `wasm` `wav` `webp` `xml` `yaml` `zip`</sub>| | ||
|`tcp_stream` |Group |<sub>`dns_tcp` `rtmp` `tls`</sub>| | ||
|`udp_payload` |Group |<sub>`dns`</sub>| | ||
|
||
|
@@ -690,6 +691,23 @@ $ fq -n -d html '[inputs | {key: input_filename, value: .html.head.title?}] | fr | |
$ fq -r -o array=true -d html '.. | select(.[0] == "a" and .[1].href)?.[1].href' file.html | ||
``` | ||
|
||
## leveldb_ldb | ||
|
||
### Limitations | ||
|
||
- no Meta Blocks (like "filter") are decoded yet. | ||
- Zstandard uncompression is not implemented yet. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is more or less just depend on some zstd package? i've looked at https://github.com/klauspost/compress a couple of times, it has zstd and lots of other formats and i also suspect the api:s are a bit more low level then golang stdlib so might fit fq better. It can maybe replace github.com/golang/snappy also? impressively it seems it also has zero dependencies on other packages There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know how common There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, then i think we can skip it for now, probably good to not make this PR grow too much also |
||
|
||
### Authors | ||
|
||
- [@mikez](https://github.com/mikez), original author | ||
|
||
### References | ||
|
||
- https://github.com/google/leveldb/blob/main/doc/table_format.md | ||
- https://github.com/google/leveldb/blob/main/doc/impl.md | ||
- https://github.com/google/leveldb/blob/main/doc/index.md | ||
|
||
## luajit | ||
|
||
### Authors | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
package leveldb | ||
|
||
// https://github.com/google/leveldb/blob/main/doc/impl.md#manifest | ||
// https://github.com/google/leveldb/blob/main/db/version_edit.cc | ||
// | ||
// Files in LevelDB using this format include: | ||
// - MANIFEST-* | ||
|
||
import ( | ||
"embed" | ||
|
||
"github.com/wader/fq/format" | ||
"github.com/wader/fq/pkg/decode" | ||
"github.com/wader/fq/pkg/interp" | ||
"github.com/wader/fq/pkg/scalar" | ||
) | ||
|
||
//go:embed leveldb_log.md | ||
var leveldbDescriptorFS embed.FS | ||
|
||
func init() { | ||
interp.RegisterFormat( | ||
format.LevelDB_Descriptor, | ||
&decode.Format{ | ||
Description: "LevelDB Descriptor", | ||
DecodeFn: ldbDescriptorDecode, | ||
}) | ||
interp.RegisterFS(leveldbDescriptorFS) | ||
} | ||
|
||
const ( | ||
tagTypeComparator = 1 | ||
tagTypeLogNumber = 2 | ||
tagTypeNextFileNumber = 3 | ||
tagTypeLastSequence = 4 | ||
tagTypeCompactPointer = 5 | ||
tagTypeDeletedFile = 6 | ||
tagTypeNewFile = 7 | ||
// 8 not used anymore | ||
tagTypePrevLogNumber = 9 | ||
) | ||
|
||
var tagTypes = scalar.UintMapSymStr{ | ||
tagTypeComparator: "comparator", | ||
tagTypeLogNumber: "log_number", | ||
tagTypeNextFileNumber: "next file number", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use underscore for these? i usually try use jq-ish constant unless the specification is very explicit about the naming |
||
tagTypeLastSequence: "last sequence", | ||
tagTypeCompactPointer: "compact pointer", | ||
tagTypeDeletedFile: "deleted file", | ||
tagTypeNewFile: "new file", | ||
tagTypePrevLogNumber: "previous log number", | ||
} | ||
|
||
func ldbDescriptorDecode(d *decode.D) any { | ||
rro := recordReadOptions{readDataFn: func(size int64, recordType int, d *decode.D) { | ||
if recordType == recordTypeFull { | ||
d.FieldStruct("data", func(d *decode.D) { | ||
d.LimitedFn(size, readManifest) | ||
}) | ||
} else { | ||
d.FieldRawLen("data", size) | ||
} | ||
}} | ||
readBlockSequence(rro, d) | ||
|
||
return nil | ||
} | ||
|
||
// List of sorted tables for each level involving key ranges and other metadata. | ||
func readManifest(d *decode.D) { | ||
d.FieldArray("tags", func(d *decode.D) { | ||
for !d.End() { | ||
d.FieldStruct("tag", func(d *decode.D) { | ||
tag := d.FieldULEB128("key", tagTypes) | ||
switch tag { | ||
case tagTypeComparator: | ||
readLengthPrefixedString("value", d) | ||
case tagTypeLogNumber, | ||
tagTypePrevLogNumber, | ||
tagTypeNextFileNumber, | ||
tagTypeLastSequence: | ||
d.FieldULEB128("value") | ||
case tagTypeCompactPointer: | ||
d.FieldStruct("value", func(d *decode.D) { | ||
d.FieldULEB128("level") | ||
readTagInternalKey("internal_key", d) | ||
}) | ||
case tagTypeDeletedFile: | ||
d.FieldStruct("value", func(d *decode.D) { | ||
d.FieldULEB128("level") | ||
d.FieldULEB128("file_number") | ||
}) | ||
case tagTypeNewFile: | ||
d.FieldStruct("value", func(d *decode.D) { | ||
d.FieldULEB128("level") | ||
d.FieldULEB128("file_number") | ||
d.FieldULEB128("file_size") | ||
readTagInternalKey("smallest_internal_key", d) | ||
readTagInternalKey("largest_internal_key", d) | ||
}) | ||
default: | ||
d.Fatalf("unknown tag: %d", tag) | ||
} | ||
}) | ||
} | ||
}) | ||
} | ||
|
||
func readLengthPrefixedString(name string, d *decode.D) { | ||
d.FieldStruct(name, func(d *decode.D) { | ||
length := d.FieldULEB128("length") | ||
d.FieldUTF8("data", int(length)) | ||
}) | ||
} | ||
|
||
func readTagInternalKey(name string, d *decode.D) { | ||
d.FieldStruct(name, func(d *decode.D) { | ||
length := d.FieldULEB128("length") | ||
readInternalKey("data", int64(length), d) | ||
}) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
### Limitations | ||
|
||
- fragmented non-"full" records are not decoded further. | ||
|
||
|
||
### Authors | ||
|
||
- [@mikez](https://github.com/mikez), original author | ||
|
||
### References | ||
|
||
- https://github.com/google/leveldb/blob/main/doc/impl.md#manifest | ||
- https://github.com/google/leveldb/blob/main/db/version_edit.cc |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
package leveldb | ||
|
||
// https://github.com/google/leveldb/blob/main/doc/log_format.md | ||
// | ||
// Files in LevelDB using this format include: | ||
// - *.log | ||
// - MANIFEST-* | ||
|
||
import ( | ||
"embed" | ||
|
||
"github.com/wader/fq/format" | ||
"github.com/wader/fq/internal/mathex" | ||
"github.com/wader/fq/pkg/decode" | ||
"github.com/wader/fq/pkg/interp" | ||
"github.com/wader/fq/pkg/scalar" | ||
) | ||
|
||
//go:embed leveldb_log.md | ||
var leveldbLogFS embed.FS | ||
|
||
func init() { | ||
interp.RegisterFormat( | ||
format.LevelDB_LOG, | ||
&decode.Format{ | ||
Description: "LevelDB Log", | ||
DecodeFn: ldbLogDecode, | ||
}) | ||
interp.RegisterFS(leveldbLogFS) | ||
} | ||
|
||
type recordReadOptions struct { | ||
// Both .log- and MANIFEST-files use the Log-format, | ||
// i.e., a sequence of records split into 32KB blocks. | ||
// However, the format of the data within the records differ. | ||
// This function specifies how to read said data. | ||
readDataFn func(size int64, recordType int, d *decode.D) | ||
} | ||
|
||
// https://github.com/google/leveldb/blob/main/db/log_format.h | ||
const ( | ||
// checksum (4 bytes) + length (2 bytes) + record type (1 byte) | ||
headerSize = (4 + 2 + 1) * 8 | ||
|
||
blockSize = (32 * 1024) * 8 // 32KB | ||
|
||
recordTypeZero = 0 // preallocated file regions | ||
recordTypeFull = 1 | ||
recordTypeFirst = 2 // fragments | ||
recordTypeMiddle = 3 | ||
recordTypeLast = 4 | ||
) | ||
|
||
var recordTypes = scalar.UintMapSymStr{ | ||
recordTypeZero: "zero", | ||
recordTypeFull: "full", | ||
recordTypeFirst: "first", | ||
recordTypeMiddle: "middle", | ||
recordTypeLast: "last", | ||
} | ||
|
||
func ldbLogDecode(d *decode.D) any { | ||
rro := recordReadOptions{readDataFn: func(size int64, recordType int, d *decode.D) { | ||
d.FieldRawLen("data", size) | ||
}} | ||
readBlockSequence(rro, d) | ||
|
||
return nil | ||
} | ||
|
||
// Read a sequence of 32KB-blocks (the last one may be less). | ||
// https://github.com/google/leveldb/blob/main/db/log_reader.cc#L189 | ||
func readBlockSequence(rro recordReadOptions, d *decode.D) { | ||
d.Endian = decode.LittleEndian | ||
|
||
d.FieldArray("blocks", func(d *decode.D) { | ||
for d.BitsLeft() >= headerSize { | ||
d.LimitedFn(mathex.Min(blockSize, d.BitsLeft()), func(d *decode.D) { | ||
d.FieldStruct("block", bind(readLogBlock, rro)) | ||
}) | ||
} | ||
}) | ||
|
||
if d.BitsLeft() > 0 { | ||
// The reference implementation says: | ||
// "[...] if buffer_ is non-empty, we have a truncated header at the | ||
// end of the file, which can be caused by the writer crashing in the | ||
// middle of writing the header. Instead of considering this an error, | ||
// just report EOF." | ||
d.FieldRawLen("truncated_block", d.BitsLeft()) | ||
} | ||
} | ||
|
||
// Read a Log-block, consisting of up to 32KB of records and an optional trailer. | ||
// | ||
// block := record* trailer? | ||
func readLogBlock(rro recordReadOptions, d *decode.D) { | ||
if d.BitsLeft() > blockSize { | ||
d.Fatalf("Bits left greater than maximum log-block size of 32KB.") | ||
} | ||
// record* | ||
d.FieldArray("records", func(d *decode.D) { | ||
for d.BitsLeft() >= headerSize { | ||
d.FieldStruct("record", bind(readLogRecord, rro)) | ||
} | ||
}) | ||
// trailer? | ||
if d.BitsLeft() > 0 { | ||
d.FieldRawLen("trailer", d.BitsLeft()) | ||
} | ||
} | ||
|
||
// Read a Log-record. | ||
// | ||
// checksum: uint32 // crc32c of type and data[] ; little-endian | ||
// length: uint16 // little-endian | ||
// type: uint8 // One of FULL, FIRST, MIDDLE, LAST | ||
// data: uint8[length] | ||
// | ||
// via https://github.com/google/leveldb/blob/main/doc/log_format.md | ||
func readLogRecord(rro recordReadOptions, d *decode.D) { | ||
// header | ||
var checksumValue *decode.Value | ||
var length int64 | ||
var recordType int | ||
d.LimitedFn(headerSize, func(d *decode.D) { | ||
d.FieldStruct("header", func(d *decode.D) { | ||
d.FieldU32("checksum", scalar.UintHex) | ||
checksumValue = d.FieldGet("checksum") | ||
length = int64(d.FieldU16("length")) | ||
recordType = int(d.FieldU8("record_type", recordTypes)) | ||
}) | ||
}) | ||
|
||
// verify checksum: record type (1 byte) + data (`length` bytes) | ||
d.RangeFn(d.Pos()-8, (1+length)*8, func(d *decode.D) { | ||
bytesToCheck := d.Bits(int(d.BitsLeft())) | ||
actualChecksum := computeChecksum(bytesToCheck) | ||
_ = checksumValue.TryUintScalarFn(d.UintAssert(uint64(actualChecksum))) | ||
}) | ||
|
||
// data | ||
dataSize := length * 8 | ||
rro.readDataFn(dataSize, recordType, d) | ||
} | ||
|
||
func bind(f func(recordReadOptions, *decode.D), rro recordReadOptions) func(*decode.D) { | ||
return func(d *decode.D) { | ||
f(rro, d) | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
### Limitations | ||
|
||
- individual records are not merged and its data further decoded. | ||
|
||
### Authors | ||
|
||
- [@mikez](https://github.com/mikez), original author | ||
|
||
### References | ||
|
||
- https://github.com/google/leveldb/blob/main/doc/log_format.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do a
make doc
to updateThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly, the
make doc
commands changes a lot of content in the SVGs. I left them out in the commit, as it seems they shouldn't change.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha yes skip those, i will fix. I did some changes to https://github.com/wader/ansisvg some days ago