Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leveldb: Add LevelDB support #824

Merged
merged 18 commits into from
Dec 9, 2023
Merged

leveldb: Add LevelDB support #824

merged 18 commits into from
Dec 9, 2023

Conversation

mikez
Copy link
Contributor

@mikez mikez commented Dec 4, 2023

No description provided.

Example:
go run . -d ldb d format/ldb/testdata/000005.ldb
Copy link
Owner

@wader wader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start! did some initial reviewing to get going

format/format.go Outdated
@@ -125,6 +125,7 @@ var (
JPEG = &decode.Group{Name: "jpeg"}
JSON = &decode.Group{Name: "json"}
JSONL = &decode.Group{Name: "jsonl"}
LDB = &decode.Group{Name: "ldb"}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about naming it "leveldb" for clarity?

Copy link
Contributor Author

@mikez mikez Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And maybe even go one step further and use the MPEG pattern here of doing leveldb_ldb (and eventually leveldb_log etc.)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap makes sense, see my other comments about it

var compressionTypes = scalar.UintMapSymStr{
compressionTypeNone: "none",
compressionTypeSnappy: "Snappy",
compressionTypeZstandard: "Zstandard",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep all lowercase?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly "zstd"? not sure what is most common

var indexSize int64
var metaIndexOffset int64
var metaIndexSize int64

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if use of d.LimitedFn or d.RangeFn could simplify some field length calculations? mostly thinking of the use of d.Pos/d.Len, is the padding dynamic in size somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer to these functions. Rewrote.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap they are quite nice sometimes so one don't have to keep track number of bits left etc. not that i think d.Pos() inside a one of these still return the absolute position in the "root" buffer but d.Len() will change behaviour. Maybe there should be d.RootPos()/d.RootLen() etc?

compressedSize := size
compressed := data
bb := &bytes.Buffer{}
_ = bb
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug leftover i guess :)


// index

d.SeekAbs(indexOffset * 8)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if d.FramedLen or d.RangeLen could be used here instead of seek and passing the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean d.FramedFn and d.RangeFn? It is not clear to me how this makes the code clearer.
We find these offsets in the footer, and then jump to them to read the data.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, only use if it feels it makes things clearer. d.RangeFn can sometimes be used instead of combination of seek and the limit

})
})
// TK: how do you make an empty entries-array appear _above_ the trailer?
// Right now, its omited if empty.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both ends up in a struct? then they will be sorted by bit range start... and i guess it might be that the entries array ends up with a zero range atm, hmm feels like a bug 🤔

Copy link
Contributor Author

@mikez mikez Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically a key-value block has (1) entries and (2) trailer. However, occasionally there's only a trailer and no entries. What to do in such scenarios?

  • Option 1: leave away the entries array all together. (Current solution.)
  • Option 2: Show an empty entries array. (Proposition. However, it ends up being shown below the trailer, since it's empty. We need to read the trailer first, to figure out if there's an entries array or not.)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would be nice to have an empty entries array for clarity? i will have to look closer at how to solve this. But don't think it's a blocker to merge, can be changed/fixed later.

d.FieldRawLen("raw", size*8)
}

func decodeVarInt(d *decode.D) uint64 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this LEB128? thinking maybe can use d.FieldULEB128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a remark in tryULEB128 that it's the same as "Base 128 Varint" (a term used in Google contexts).

@@ -0,0 +1,98 @@
$ fq -d ldb dv uncompressed.ldb/000005.ldb
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snappy test will be added later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@@ -0,0 +1,45 @@
# Make LevelDB data: both uncompressed and compressed.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice with code to generate test cases!

interp.RegisterFormat(
format.LDB,
&decode.Format{
Description: "LevelDB Table",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a .md file named the same as the format and add various tips, todos and author. see maybe bson.md and bson.go how it works. It will also be used for help in the CLI and for generating formats.md with make doc

@mikez
Copy link
Contributor Author

mikez commented Dec 5, 2023

@wader Thank you for the comments. Now also read dev.md and made updates accordingly.

@mikez
Copy link
Contributor Author

mikez commented Dec 5, 2023

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?

Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

format/format.go Outdated
@@ -125,6 +125,7 @@ var (
JPEG = &decode.Group{Name: "jpeg"}
JSON = &decode.Group{Name: "json"}
JSONL = &decode.Group{Name: "jsonl"}
LDB = &decode.Group{Name: "leveldb_ldb"}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to LevelB_LDB.

Do you know if there are other leveldb related formats that might be interesting in the future? thinking if this is leveldb_table (https://github.com/google/leveldb/blob/main/doc/table_format.md?) and then maybe there will be leveldb_log etc? but it's not a big deal to rename things later on anyway i think, also the format can be probed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader 👌, good idea. Besides the suggested modification, I now also added leveldb_log and leveldb_descriptor; thereby keeping the names as in the LevelDB leveldbutils dump command.

footerEncodedLength = (4*10 + 8) * 8
magicNumberLength = 8 * 8
// leading 64 bits of
// echo http://code.google.com/p/leveldb/ | sha1sum
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting trivia :)

0x1: "value",
}

type BlockHandle struct {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe blockHandle as it's internal?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe same for the members also?

Copy link
Contributor Author

@mikez mikez Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, privacy/exports in Go is solved by title-casing! Interesting language design decision.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, let's say you learn to be ok with it :) not sure how well it agrees with golang's principle of clarity and explicitness 🤔

}
d.Copy(bb, bytes.NewReader(decompressed))
default:
d.Fatalf("Unsupported compression type: %x", compressionType)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible to continue decoding even if there is an error you can use d.Errorf and then using -f will ignore the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Replaced all d.Fatalf with d.Errorf where it made sense.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lied btw, it's -o force=true not -f... but maybe there should be a shorthand for it :)

default:
d.Fatalf("Unsupported compression type: %x", compressionType)
}
d.FieldStructRootBitBufFn("uncompressed", bitio.NewBitReader(bb.Bytes(), -1), func(d *decode.D) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really should come up with a nicer API for the nested decoding stuff, feels a bit like a hack atm

result |= (b & 0x7f) << shift
if b&0x80 == 0 {
result |= (b & 0b01111111) << shift
if b&0b10000000 == 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clarifications here 👍

Groups: []*decode.Group{format.Probe},
DecodeFn: ldbDecode,
})
interp.RegisterFS(leveldbFS)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some formats has a torepr implementation which is a function to convert the decode tree inside more user representation, not sure if that is interesting for leveldb? for example bson has https://github.com/wader/fq/blob/master/format/bson/bson.jq also don't forget to register the format-overloaded--function https://github.com/wader/fq/blob/master/format/bson/bson.go#L24 (it's part for a hack to make is possible to have kind of "polymorphic" functions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers. It's not clear to me in which scenarios you'd use torepr.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably only useful to have if the format itself is used to encode some structure, if we take bson example again:

# this shows all the nitty gritty details of bson encoding
$ fq -d bson -o line_bytes=8 d format/bson/testdata/test.bson
     │00 01 02 03 04 05 06 07│01234567│.{}: format/bson/testdata/test.bson (bson)
0x000│41 01 00 00            │A...    │  size: 321
     │                       │        │  elements[0:17]:
     │                       │        │    [0]{}: element
0x000│            01         │    .   │      type: "double" (1) (64-bit binary floating point)
0x000│               64 6f 75│     dou│      name: "dou"
0x008│00                     │.       │
0x008│   29 5c 8f c2 f5 b0 58│ )\....X│      value: 98.765
0x010│40                     │@       │
     │                       │        │    [1]{}: element
0x010│   02                  │ .      │      type: "string" (2) (UTF-8 string)
0x010│      73 74 72 00      │  str.  │      name: "str"
0x010│                  0a 00│      ..│      length: 10
0x018│00 00                  │..      │
0x018│      6d 79 20 73 74 72│  my str│      value: "my string"
0x020│69 6e 67 00            │ing.    │
...

# this shows the value the bson structure "represents"
$ fq -d bson torepr format/bson/testdata/test.bson
{
...
  "dou": 98.765,
...
  "str": "my string",
...
}

So it will only make sense if whatever is encoded can be translate into a jq value and something a "end user" would expect kind of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader Ah, I think I get it! This was a helpful example for me. Thank you. :)
Yes, this applies here, especially for descriptors. However, this reminds me of another question I have. I will ask below.

### Limitations

- no Meta Blocks (like "filter") are decoded yet.
- Zstandard uncompression is not implemented yet.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more or less just depend on some zstd package? i've looked at https://github.com/klauspost/compress a couple of times, it has zstd and lots of other formats and i also suspect the api:s are a bit more low level then golang stdlib so might fit fq better. It can maybe replace github.com/golang/snappy also? impressively it seems it also has zero dependencies on other packages

Copy link
Contributor Author

@mikez mikez Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how common zstd is in the wild. I've never come across it so far in the LevelDB samples I've seen.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then i think we can skip it for now, probably good to not make this PR grow too much also

| | | key_delta{}: 0x611-0x61a (9)
0x610| 73 | s | user_key: "s" 0x611-0x612 (1)
0x610| 01 | . | type: "value" (0x1) 0x612-0x613 (1)
0x610| ff ff ff ff ff ff ff | ....... | sequence_number: 72057594037927935 0x613-0x61a (7)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format as hex? but maybe decimal make sense if it's a sequence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically sequence makes sense in decimal. They are sequential numbers. The only exception are these index and metaindex sections; especially the example you highlighted, where the maximum possible unsigned integer 0xffffffffffffff is chosen.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! keep as decimal. btw if 0xffffffffffffff has some special mening you can add description. For example the mp4 decoder does something like this for box size:

const (
	boxSizeRestOfFile   = 0
	boxSizeUse64bitSize = 1
)

d.FieldU32("size", scalar.UintMapDescription{
	boxSizeRestOfFile:   "Rest of file",
	boxSizeUse64bitSize: "Use 64 bit size",
})

In your case you could probably just have d.FieldU64("...", scalar.UintMapDescription{0xffffffffffffff: "blabla"})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader It's unclear to me if it has a special meaning. I only noticed so far that in the index and metaindex there are sometimes sequence numbers like that; it feels more like a hack though. The sequence numbers are typically used in the data section and there they have a well-defined meaning.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! then sounds like we should leave as it is

@wader
Copy link
Owner

wader commented Dec 5, 2023

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?

Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead?

@wader
Copy link
Owner

wader commented Dec 5, 2023

The failing test is about TestFormats/all/all.fqtest, probably just a matter of WRITE_ACTUAL=1 go test -run TestFormats/all/all.fqtest ./format. I've thought about redoing those tests... would be nice if adding a format would affect as little as possible outside the formats own directory

@wader wader changed the title ldb: Add LevelDB support leveldb: Add LevelDB support Dec 6, 2023
@mikez
Copy link
Contributor Author

mikez commented Dec 6, 2023

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?
Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead?

I'd like you to hear me differently.

Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

@wader
Copy link
Owner

wader commented Dec 6, 2023

I'd like you to hear me differently.

Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for?

@mikez
Copy link
Contributor Author

mikez commented Dec 6, 2023

@wader Question about the LevelDB log format and how to decompress it.

In essence the data structure is as follows:

  1. We have a sequence of records, each with some data. The user can iterate over these using the LevelDB API.
  2. In the .log-file itself, these records are split into many small pieces and put into blocks of 32KB. Each little piece has a marker if it's an entire record ("full") or only a fragment ("first", "middle", "last").
  3. In the LevelDB app itself, these small little pieces are put together by and the datastructure is parsed.

I hope that's clear so far.

Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats?

format/format.go Outdated
@@ -125,6 +125,9 @@ var (
JPEG = &decode.Group{Name: "jpeg"}
JSON = &decode.Group{Name: "json"}
JSONL = &decode.Group{Name: "jsonl"}
LevelDB_Descriptor = &decode.Group{Name: "leveldb_descriptor"}
LDB = &decode.Group{Name: "leveldb_table"}
LOG = &decode.Group{Name: "leveldb_log"}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix with LevelDB_

for {
if d.End() {
break
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be for !d.End() {?

format.LOG,
&decode.Format{
Description: "LevelDB Log",
Groups: []*decode.Group{format.Probe},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some things to be careful about when a format is in the probe group. For example make sure to "fail fast" if some magic etc is not found and to make sure that empty input does not succeed, ex if you have for !d.End() { at the "root" maybe it's good to count number of things successfully decoded and then after the loop make sure at least N was decoded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not quite aware what that format.Probe was doing (I copy pasted that part); now I am. Since the log files don't have magic strings of any kind in them, I think it's better to exclude them for now + from descriptors which is based on similar logic.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yeap there is no special probe code in fq, just that some decoders are tried if none is specified

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra trivia: all formats and group exist as jq functions so you can do this ex: ... | probe

@mikez
Copy link
Contributor Author

mikez commented Dec 6, 2023

@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving FieldFormatBitBuf.

@wader
Copy link
Owner

wader commented Dec 6, 2023

@wader Question about the LevelDB log format and how to decompress it.

In essence the data structure is as follows:

  1. We have a sequence of records, each with some data. The user can iterate over these using the LevelDB API.
  2. In the .log-file itself, these records are split into many small pieces and put into blocks of 32KB. Each little piece has a marker if it's an entire record ("full") or only a fragment ("first", "middle", "last").
  3. In the LevelDB app itself, these small little pieces are put together by and the datastructure is parsed.

I hope that's clear so far.

Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats?

Think i would try to be "true" to how the format works, so show the parts as some array with structs with raw data field for the partia data etc. This is how most formats in fq work that has to "demux" etc, ex ogg, gzip and the tcp reassembly. This also makes it quite nice to when deal with broken files that might have some partial data that is good, then you can use fq to concat parts, maybe prefix/append some missing data and decode or output the "repaired" data.

@wader
Copy link
Owner

wader commented Dec 6, 2023

@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving FieldFormatBitBuf.

I think it might be the thing i mentioned about probe, that the leveldb_description format succeeds when it shouldn't. So i guess this will get fixed when you remove it from probe.

@mikez
Copy link
Contributor Author

mikez commented Dec 6, 2023

I'd like you to hear me differently.
Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for?

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

handleLength := d.LimitedFn(footerEncodedLength, func(d *decode.D) {
// check for magic number and fail fast if it isn't there
d.SeekAbs(d.Len() - magicNumberLength)
d.FieldU64("magic_number", d.UintAssert(tableMagicNumber), scalar.UintHex)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can probably keep it like it was, as long as it does not decode megabytes of data before failing it should probably be ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new way better. It seems to me the code got clearer; also with the d.BitsLeft() at the end.

@wader
Copy link
Owner

wader commented Dec 6, 2023

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field?

@mikez
Copy link
Contributor Author

mikez commented Dec 6, 2023

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field?

use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip.

@wader
Copy link
Owner

wader commented Dec 7, 2023

use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip.

Yeap that would very interesting. I can imagine some kind of "IDE" with multiple visual tree:s, data views and REPL:s.

README.md Outdated
@@ -102,6 +102,7 @@ ipv6_packet,
jpeg,
json,
jsonl,
[leveldb_ldb](doc/formats.md#leveldb_ldb),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a make doc to update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oddly, the make doc commands changes a lot of content in the SVGs. I left them out in the commit, as it seems they shouldn't change.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha yes skip those, i will fix. I did some changes to https://github.com/wader/ansisvg some days ago

@@ -0,0 +1,57 @@
$ fq -d leveldb_descriptor dv uncompressed.ldb/MANIFEST-000004
|00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f|0123456789abcdef|.{}: uncompressed.ldb/MANIFEST-000004 (leveldb_descriptor) 0x0-0x57 (87)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relation between manifest and descriptor? name for same thing?

Copy link
Contributor Author

@mikez mikez Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears so. The term MANIFEST only seems to be used for the filename. In the codebase, they use the terms "descriptor" and "VersionEdit"; I find their inconsistency confusing here.

@wader
Copy link
Owner

wader commented Dec 7, 2023

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet.

For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291

@wader
Copy link
Owner

wader commented Dec 7, 2023

Stress tested a bit more using:

find ~/ -name "*.ldb" -print0 | xargs -0 go run . -d leveldb_table '._error | select(.) | input_filename, .error'

Seems chrome has a bunch of leveldb files, some fail with variations of "UTF8(user_key): failed at position 127 (read size 0 seek pos 0): tryText nBytes must be >= 0 (-1)

@mikez
Copy link
Contributor Author

mikez commented Dec 7, 2023

Stress tested a bit more using:

find ~/ -name "*.ldb" -print0 | xargs -0 go run . -d leveldb_table '._error | select(.) | input_filename, .error'

Seems chrome has a bunch of leveldb files, some fail with variations of "UTF8(user_key): failed at position 127 (read size 0 seek pos 0): tryText nBytes must be >= 0 (-1)

I appreciate the stress-testing and the command. I'll take a look tomorrow!

@wader
Copy link
Owner

wader commented Dec 7, 2023

I appreciate the stress-testing and the command. I'll take a look tomorrow!

👍 no stress! i think the PR is in very good shape now, some more testing and fix decode issues and we should be ready to merge. anything else you want to add?

In the LevelDB encoding, the internal key can be cut at any byte:
including the user_key, type, or sequence_number. The resulting prefix
is shared among subsequent keys and not specified explicitly by them.

This fixes a previous mistaken belief that cuts can't happen in the
last 8 bytes of the type & sequence number.

Tests are added.
@mikez
Copy link
Contributor Author

mikez commented Dec 8, 2023

@wader That one took me a while to track down and fix. :)

@mikez
Copy link
Contributor Author

mikez commented Dec 8, 2023

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet.

For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291

Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :)

err := readInternalKey(nil, int(length), d)
if err != nil {
// TK(2023-12-08): how do I propagate this
// error `err` into the `d` object?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is d.IOPanic that i think is similar? maybe there should be a similar one for generic error, or maybe d.Errorf should support %w error arguments somehow? have to read up on how that works. For now i think it can be fine to do d.Errof("blabla: %s", err) and format out the error as a string maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err.Error() seems to do the trick.

0x1d0| bd 03 | .. | value_length: 445 0x1d6-0x1d8 (2)
| | | key{}: 0x1d8-0x1e5 (13)
0x1d0| 69 70 73 75 6d | ipsum | user_key_suffix: raw bits 0x1d8-0x1dd (5)
| | | user_key: "lorem.ipsum" (inferred)
Copy link
Owner

@wader wader Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) i guess this will make it quite a bit easier to write queries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader This decoding has given me a new insight into everything LevelDB stores; not just the key and the value, but apparently also the history of key-value pairs.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳 yeap i've had similar experiences writing decoders, you bump into lots of fascinating tricks and legacy, ex i remember zip has so much legacy to handle floppy fdisks and appending and also the date format uses msdos-timestamps (2 second precision!)

0x1d0| 0d | . | unshared_bytes: 13 0x1d5-0x1d6 (1)
0x1d0| bd 03 | .. | value_length: 445 0x1d6-0x1d8 (2)
| | | key{}: 0x1d8-0x1e5 (13)
0x1d0| 69 70 73 75 6d | ipsum | user_key_suffix: raw bits 0x1d8-0x1dd (5)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader They are; I couldn't figure out if you had a scalar method I could plug into d.FieldRawLen... since I need the bytes as well:

br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha ok, the suffix bytes are not always valid utf8? anyways i think the current way is fine! but i do wonder if there should be a Raw variant that return bytes? (maybe there is?) the reason Raw returns a bitio.BitReader is that they can be used for very large bit ranges so that the bits will only be read if really needed to save on IO and memory.

@wader
Copy link
Owner

wader commented Dec 8, 2023

Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :)

Aha sorry should have been clearer :)

BTW now all *.ldb files in my home directory decodes fine and looks beautiful!

// case 2: type and sequence_number fit fully in unshared: simulate user_key value.
if unsharedSize >= typeAndSequenceNumberSize {
br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)
d.FieldValueStr("user_key", string(append(sharedBytes, d.ReadAllBits(br)...)), strInferred)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be something like?

suffix := d.FieldUTF8("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize))
d.FieldValueStr("user_key", string(sharedBytes)+suffix), strInferred)

An alternativ to using d.FieldValueStr could be to only have one user_key field and let the symbolic value for that one string be the inferred string. But i think i like how it is now which i guess is more true to how thing actually work, that it's a suffix, then and just add a bit extra to make it convenient.

btw append(sharedBytes, d.ReadAllBits(br)...) can something in go be a bit tricky, if i remember correctly the sharedBytes slice might end up being reused if it as capacity, but in this case it should be fine i think as it get turned into a new string? (which immutable and not share the slice bytes)

Copy link
Contributor Author

@mikez mikez Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader I like your solution.

[...] the sharedBytes slice might end up being reused [...]

the append() of Go I don't quite understand yet. Apparently it is an amoritized-O(1) operation yet it seems sharedBytes doesn't get mutated? (So it's not doing .append(....) like in Python or .push like in JavaScript.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader Although, on second thought, if the user_key is cut at some UTF-8 multi-byte character, this might fail. However, maybe it is good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing, I see it does indeed fail:

image

Copy link
Contributor Author

@mikez mikez Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LevelDB, both keys and values are bytestrings; so a UTF-8 encoding would typically be used, but must not necessarily.

Copy link
Contributor Author

@mikez mikez Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader I made some own functions for now to handle this. See recent commit.

}
err := keyCallbackFn(keyPrefix, int(unshared), d)
if err != nil {
d.Errorf(err.Error())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these should be d.Errorf("%s", err) to be sure some random %s etc in the error string is seen as a format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer. I just saw earlier that I'd used some d.Errorf("%v", err) code I had copy-pasted from avro_ocf.go without remembering.

d.FieldStruct("data", func(d *decode.D) {
err := readInternalKey(nil, int(length), d)
if err != nil {
d.Errorf(err.Error())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about d.Errorf(err.Error())

@wader
Copy link
Owner

wader commented Dec 9, 2023

Great work! just some tiny things left i think, after that just let me know if you think your read to merge

Hope it was a good review experience too 😄

@mikez
Copy link
Contributor Author

mikez commented Dec 9, 2023

Great work! just some tiny things left i think, after that just let me know if you think your read to merge

Hope it was a good review experience too 😄

Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well.

decode unfragmented .log files:

 - break leveldb_log.go into leveldb_log_blocks.go and leveldb_log.go;
   the former is used by both .MANIFEST (descriptor) and .LOG.
 - in leveldb_log, introduce readBatch that decodes further

 fix UTF8 decoding:

 - introduce fieldUTF8ReturnBytes and stringify to handle multi-byte
   UTF8-encodings correctly.
@@ -441,3 +446,21 @@ func mask(crc uint32) uint32 {
// Rotate right by 15 bits and add a constant.
return ((crc >> 15) | (crc << 17)) + kMaskDelta
}

// Concatinate byteslices and convert into a string.
func stringify(byteSlices ...[]byte) string {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to save on allocations you could use a strings.Builder or possible sum all lengths allocate once with make() and then append. But i have feeling it wont make much difference in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

strings.Builder seems to do the same thing. I suppose the resizing would be made in an amortized-constant way (similar to hash tables)? I can still preallocate though. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Ok, this explains more what's going on. It seems sometimes it's mutated (and then I assume that's amortized constant)... and sometimes it's created from scratch (which is costly). It is not clear to me which one happens when though.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh i assumed strings.Builder did more smart things by growing smarty but maybe the only smartness is in the String() method? but i do wonder if append might have some smartness about how to grow so it won't end up reallocating/copy all the time?

Copy link
Contributor Author

@mikez mikez Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader I'm trying to track down how append works precisely; this is closest I've found: https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssagen/ssa.go#L3503
That code has too many concepts I don't understand; don't have more clarity now. :)

Copy link
Owner

@wader wader Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/rEPerb3b4 maybe interesting. i would have assumed that append would just end up being a call to something but seem there lots of tricky going on, maybe be speed up common cases etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wader Oh, checking the compiler explorer was very insightful!! runtime.growslice(SB) clearly shows the connection to slice.go now. I don't quite see the big picture yet, but now I know where to start if I wanted to dive deeper. :)

return string(result)
}

func fieldUTF8ReturnBytes(name string, nBytes int, d *decode.D) []byte {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've experiment and thought a bit about doing a decode api v2 that would use method chaining then things could probably be nicer, something like:

d.FieldUTF8("bla", nBytes).Bytes()
d.FieldUTF8("bla", nBytes).BitReader()
d.FieldUTF8("bla", nBytes).Map(...).Bytes()

Think that would make it a bit more flexible, would probably also cut down a bit on amount of generated code. Also! it can probably help make the api more type safe as ex d.FieldUTF8 would return some type forcing things to use string. Should think more about this :)

br := d.FieldRawLen("user_key_suffix", int64(unsharedSize-typeAndSequenceNumberSize)*8)
d.FieldValueStr("user_key", string(append(sharedBytes, d.ReadAllBits(br)...)), strInferred)
suffix := fieldUTF8ReturnBytes("user_key_suffix", unsharedSize-typeAndSequenceNumberSize, d)
d.FieldValueStr("user_key", stringify(sharedBytes, suffix), strInferred)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if i understand correctly: shared and suffix are seen as bytes so they might not be valid utf8, so reading them as utf8 strings wont preserve the raw bytes if illegal etc? normal golang strings i think can store raw bytes but fq's d.UTF8 use unicode.UTF8BOM which i suspect might replace illegal bytes with error or replacement runes? anyways if this seem to work keep it like this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting point with UTF8BOM (0xEF, 0xBB, 0xBF)!

I hadn't come across this before nor thought about it.
It seems the Python encoder natively skips it (unless specified explicitly), also some of the LevelDB dumps I've observed so far don't use the UTF8BOM.
That said, since LevelDB takes bytestrings as keys and values, that wouldn't stop anyone from using UTF8BOM if they wanted to.

So maybe the utf-8 encoder will be good enough in most cases?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safest i think is probably to keep it as is, treat the parts as raw bytes. It's a bit weird that UTF8 with a BOM is even a thing as UTF8 does not really have a byte order :) but it seems to happen 🤷 the encoding used by d.UTF8 is choose to be quite liberal with weird stuff :)

BTW if you want to have full control over text encoding then there is d.FieldStr("...", <encoding>) where you can plugin any encoding.Encoding.

@wader
Copy link
Owner

wader commented Dec 9, 2023

Great work! just some tiny things left i think, after that just let me know if you think your read to merge
Hope it was a good review experience too 😄

Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well.

Happy to hear that! i can sometimes end up thinking that i might come a cross as "naggy" when there is a lot of back and forth, i'm not! superhappy someone wants to help out :) also a bit tricky to review things about the format itself as i have very little experience with it, trying to mostly focus on helping it fit well into rest of fq :)

"lorem.lorem": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
"lorem.ipsum": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
"lorem.dolor": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
"row": "Row, row, row your boat\nGently down the stream.\nMerrily, merrily, merrily, merrily,\nLife is but a dream. 🚣‍♂️"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now i noticed the 🚣‍♂️ :)

@wader
Copy link
Owner

wader commented Dec 9, 2023

Played around a bit more with chrome leveldb:s, seems to work fine! is a bit confusing to look at web storage databases, lots of weird stuff stored in them :)

Think i'm kind of ready to merge if you are

@mikez
Copy link
Contributor Author

mikez commented Dec 9, 2023

@wader I'm ready! :)

@wader wader merged commit b05c7ec into wader:master Dec 9, 2023
5 checks passed
@wader
Copy link
Owner

wader commented Dec 9, 2023

🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants