Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs dag diff #4801

Open
Stebalien opened this issue Mar 10, 2018 · 16 comments · May be fixed by #8711
Open

ipfs dag diff #4801

Stebalien opened this issue Mar 10, 2018 · 16 comments · May be fixed by #8711
Assignees
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/feature A new feature P2 Medium: Good to have, but can wait until someone steps up
Milestone

Comments

@Stebalien
Copy link
Member

Stebalien commented Mar 10, 2018

Use-case: replace ipfs object diff with a generalized version that works for generalized IPLD DAGs.

Text syntax:

  • Use + for added items.
  • Use - for removed items.
  • Use ~ for replaced items.
  • Syntax: OP PATH VALUE

JSON syntax: Use JSON patch.
Generalized syntax: Basically, JSON patch with extra CBOR types.

Blocker: Pathing through DagPB nodes in go-ipfs isn't correct per the IPLD spec (paths don't reflect the object structure). See ipld/specs#55.

@lidel
Copy link
Member

lidel commented May 11, 2021

Pinning services would really like to have this ability to compare DAGs behind two CIDs and get a list of block CIDs for additions and removals, as that would enable them to optimize the replace pin operation in their internal infra.

Sounds like something we may want to look into after ipld-prime refactor is done.
@warpfork pathing through dag-pb may be solved by representing it as ADL, right? (I remember we touched upon that during HAMT discussions)

@lidel
Copy link
Member

lidel commented Jan 21, 2022

I know dag-pb is not compatible with ADLs yet (ipld/go-ipld-prime#258), and it may not even be a proper abstraction (since we want low-level dag inspection.

Perhaps we could special-case/shim it for now, and make ipfs dag diff work for our users as a better diff than the deprecated one.

@schomatis
Copy link
Contributor

Working on some special case variant that will at least cover the basic object diff use case.

@lidel
Copy link
Member

lidel commented Jan 28, 2022

Some additional thoughts on ipfs dag diff after looking at dag-pb-specific behaviors in ipfs/go-merkledag#82

On operating at the right abstraction level

The output of ipfs object diff was focused on and limited to files and directories (unixfs)

The ipfs dag diff (#4801) will be a new command where in the future we want to add support for dag-json, dag-cbor, or any other IPLD codec that follows IPLD Data Model.

My take is that ipfs dag diff operates at a fairly low level. When diffing unixfs dags the output has to be very explicit and unambiguous – people should compare two DAGs and see key differences at DAG link level, without being shadowed by higher abstractions, like HAMT-sharding (note: seeing internal HAMT structure is a feature, not a bug here).

This means the dag diff command should effectively normalize both inputs to IPLD Data Model and the diff should be the delta between the two.

On output format

I believe --enc=json should be the default, no need for inventing text representation for such low-level command, people are very good at consuming JSON with jq.

The returned diff would be in JSON Patch format, as specified in RFC 6902 from the IETF.

Rationale: IPLD Data Model is closer to JSON than anything else, and we should reuse existing RFC standards whenever possible.

On diffing DagPB nodes using Logical Format

Blocker: Pathing through DagPB nodes in go-ipfs isn't correct per the IPLD spec (paths don't reflect the object structure).

We now have IPLD Schema that describes Logical Format of dag-pb :

type PBNode struct {
  Links [PBLink]
  Data optional Bytes
}

type PBLink struct {
  Hash Link
  Name optional String
  Tsize optional Int
}

I think a safe approach here is to special-case dag-pb and in the ipfs dag diff output try to differentiate between Data and Links using this format, effectively acting as if dag-pb node is a dag-cbor with Data and Links` fields.

This is already what we do in ipfs dag get:

A dag-pb block passed to ipfs dag get --output-codec=dag-json Qmfoobar | jq we get a valid JSON:

{
  "Data": {
    "/": {
      "bytes": "CAIYBw"
    }
  },
  "Links": [
    {
      "Hash": {
        "/": "bafyfoo"
      },
      "Name": "foo",
      "Tsize": 42
    }
  ]
}

Unixfs is also raw blocks, and those return:

{
    "/": {
      "bytes": "CAIYBw"
    }
}

With this convention, comparing JSON representations of DAG-PB Logical Format, we have JSON Patch output like this:

{ "op": "replace", "path": "Data", "value":  { "/": { "bytes": "CAIYBw" } } }

If diff is in file name, size, or cid of the first linked node:

{ "op": "replace", "path": "Links/0/Name", "value": "NewName" }
{ "op": "replace", "path": "Links/0/Tsize", "value": "42" }
{ "op": "replace", "path": "Links/0/Hash", "value": { "/": "bafyfoo" } }

etc.

By using Logical Format for dag-pb the dag diff will also catch the case where links are the same, but in different order (which also produces different root CID).

cc @Stebalien @warpfork @achingbrain @rvagg for sanity check around this approach for unixfs

@lidel lidel changed the title Ipld DAG diff ipfs dag diff Jan 28, 2022
@lidel lidel added exp/expert Having worked on the specific codebase is important P2 Medium: Good to have, but can wait until someone steps up effort/days Estimated to take multiple days, but less than a week and removed status/blocked Unable to be worked further until needs are met labels Jan 28, 2022
@lidel lidel added this to In Progress in Maintenance Priorities - Go Jan 28, 2022
@rvagg
Copy link
Member

rvagg commented Jan 31, 2022

Yep, all reasonable, and I'd like to see what we're calling the "Logical Format" become standard - already with the new DAG API changes it is because go-codec-dagpb instantiates it this way, the older format exists in the node implementations in go-merkledag but mainly exists to serve named pathing. So a dag diff should follow dag get in showing this format for dag-pb.

As for diff output format, I don't have strong opinions and json patch doesn't seem unreasonable to me, at least it's an existing standard and we can just run the patch data through dag-json to get proper bytes and cid output with it.

@warpfork I'm pretty sure you have some existing opinions here and some experimentation - and doesn't this overlap with your ipld-prime patch API work? Can we leverage that here at all?

@warpfork
Copy link
Member

I'm pretty much thumbs up across the board, but I'll highlight bits and thumbs up them individually since there was a lot in that post. ^^

the dag diff command should effectively normalize both inputs to IPLD Data Model and the diff should be the delta between the two.

👍 👍 👍

(note: seeing internal HAMT structure is a feature, not a bug here).

👍

JSON Patch

👍

I wrote up an issue about a hypothetical "IPLD Patch" recently but it mostly says "JSON Patch is pretty good and we should follow it".

[json patch by default]

👍 yeah it's honestly pretty readable.

We now have IPLD Schema that describes Logical Format of dag-pb

👍 🙌 and I totally love that we're using it to describe this.

Nit: I'm a little confused by the line after that where it's described as "special-case" for dag-pb... I think all the rest of the examples you put there are not special cases now :) This is the structure that go-codec-dagpb will give you now, as @rvagg said. No fuss :D

By using Logical Format for dag-pb the dag diff will also catch the case where links are the same, but in different order (which also produces different root CID).

🙌 🙌 🙌 🙌

Yep. That's exactly the kind of clarity I'd really want to be able to get from a gadget called 'diff'.

We might have more than one mode later, and maybe some that do higher level stuff, but ISTM that this should indeed be the base case.

@lidel
Copy link
Member

lidel commented Feb 1, 2022

👍

@schomatis any preference around JSON Patch library?

I've found https://github.com/evanphx/json-patch – actively maintained and supports both JSON Patch (RFC 6902), and also JSON Merge Patch (RFC 7396, which may be useful later for #4782)

@warpfork
Copy link
Member

warpfork commented Feb 1, 2022

I just gave some looks at https://github.com/evanphx/json-patch --

  • it looks like it can create jsonmergepatch (rfc7396)... but not jsonpatch (rfc6901). :( I think the latter is what we want.
  • that library's pretty tightly bound to JSON. I'm skimming the code and it seems to be doing JSON syntax and the actual operational logic very tightly entangled. No AST phase or anything.

I'm not opposed to anything that makes solutions happen (especially if the result is something that has a stable API that we can happily keep to even while replacing the implementation), and it seems like the "tightly bound to JSON" thing can be ignored then. But I don't know if that library will do the right kind of patch generation at all, unfortunately.

@schomatis
Copy link
Contributor

No experience with this so no preference. From the library recommendations in jsonpatch that seems the more maintained one so let's go with that. Ideally the library used to perform the diff shouldn't be too entangled with the code here so changing down the road if we want won't be a complex operation.

@schomatis
Copy link
Contributor

Per Lidel's request in ipfs/go-merkledag#82 (comment) going with diffing JSON output instead of implementing a specialized DAG-PB diff logic.

@lidel
Copy link
Member

lidel commented Feb 2, 2022

👍

A quick explainer of my thinking is we want a basic poc to see if we are happy with JSON Patch output / ergonomics, and then decide on next steps, namely:

  • start with shallow diff between JSON objects following IPLD Data Model
  • (discussion happens)
    • q0: are we happy with JSON Patch output?
    • q1: can we ship this shallow version, or do we need to add DAG traversal as --deep [<n>] (opt-in vs opt-out), so user can decide to traverse DAGs end-to-end, or only to some depth etc
      • q2: do we default to deep traversal by default, or should the default be inexpesive diff of top objects?
    • q3: is the output good enough when diffing unixfs, or do we need some additional porcelain for that specific case (ipfs files diff, which abstracts away HAMT and filename-based pathing)

@schomatis
Copy link
Contributor

It seems that library doesn't create patches/diffs, just applies them or checks for equality, trying https://github.com/mattbaird/jsonpatch as a first attempt.

@schomatis
Copy link
Contributor

start with shallow diff between JSON objects following IPLD Data Model

Agreed; that's what I'll provide. A basic comparison of JSON output from dag get, after that is up to you how to shape it.

@schomatis schomatis linked a pull request Feb 2, 2022 that will close this issue
@schomatis
Copy link
Contributor

@lidel #8711

@schomatis
Copy link
Contributor

it looks like it can create jsonmergepatch (rfc7396)... but not jsonpatch (rfc6901). :( I think the latter is what we want.

Sorry @warpfork, I see that you were already flagging this. I got confused by all the different names and rfcs.

@lidel
Copy link
Member

lidel commented Jun 20, 2022

Relevant IPLD spec: https://ipld.io/specs/patch/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort/days Estimated to take multiple days, but less than a week exp/expert Having worked on the specific codebase is important kind/feature A new feature P2 Medium: Good to have, but can wait until someone steps up
Projects
No open projects
Status: 🔎 In Review
Development

Successfully merging a pull request may close this issue.

7 participants