Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata schema equality and hence table equality is not well-defined #763

Closed
petrelharp opened this issue Aug 11, 2020 · 21 comments · Fixed by #764
Closed

metadata schema equality and hence table equality is not well-defined #763

petrelharp opened this issue Aug 11, 2020 · 21 comments · Fixed by #764
Milestone

Comments

@petrelharp
Copy link
Contributor

petrelharp commented Aug 11, 2020

Maybe "not well-defined" isn't the right phrase, but it is currently pretty hard to get two tables or table collections produced by different sources to be equal to each other, because their metadata schema must match at the byte level, and so rely on whitespace and ordering of json objects matching. Furthermore, in python it seems impossible to diagnose the problem, since (a) equality of MetadataSchema objects is identity-only equality, (b) their dict or string representations may match even if the underlying bytes do not.

So, for instance:

>>> m1 = tskit.MetadataSchema({'codec': 'json', 'type':'object'})
>>> m2 = tskit.MetadataSchema({'type':'object', 'codec': 'json'})
>>> t = msprime.simulate(4).tables
>>> t2 = t1.copy()
>>> t1 == t2
True
>>> t1.metadata_schema = m1
>>> t2.metadata_schema = m2
>>> t1 == t2
False

See below for other examples.

This is being pretty bothersome for testing things in pyslim: the schema as set by SLiM differs from that set by pyslim in ordering and whitespace, even though the source code for the two are as identical as possible. Testing for equality of tables and table collections is important other places, too.

To do this properly we'd have to parse the json in C, right? Which we don't really want to do - any other ideas? Even if we didn't check metadata schema when testing equality, we'll run into similar problems with top-level metadata, if we are proposing that applications edit the top-level metadata schema to add new keys - it could easily be that two operations commute except for the adding to the metadata schema.

@petrelharp
Copy link
Contributor Author

petrelharp commented Aug 11, 2020

Here is a more puzzling example. I have two NodeTables, n1 and n2, such that:

>>> n1 == n2
False
>>> n1.num_rows == n2.num_rows
True
>>> all([a == b for a, b in zip(n1, n2)])
True
>>> str(n1.metadata_schema) == str(n2.metadata_schema)
True
>>> n1.metadata_schema = n2.metadata_schema
>>> n1 == n2
False
>>> n1.metadata_schema = tskit.MetadataSchema(n2.metadata_schema.schema)
>>> n2.metadata_schema = tskit.MetadataSchema(n2.metadata_schema.schema)
>>> n1 == n2
True

Note that if I'd done n2.metadata_schema = n1.metadata_schema they'd have been equal. So, assignment of metadata schema must not actually preserve the underlying bytes.

I'm not providing the code to make these because it's produced by development SLiM. Doing e.g. kastore dump t1.trees nodes/metadata_schema on the two files tells me that the schema differ only in a bunch of 32s, i.e., spaces.

@petrelharp petrelharp changed the title metadata schema equality depends on ordering of dicts metadata schema equality and hence table equality is not well-defined Aug 11, 2020
@petrelharp
Copy link
Contributor Author

petrelharp commented Aug 11, 2020

Hm, another difficulty: comparing dicts (i.e., the schema property) of two MetadataSchema objects - one as loaded from a SLiM-created tree sequence, the other created in python, but both from identical json - can fail because the python-created one has added in the required = [...] and 'additionalProperties': False attributes. Hm, and I'm not sure why it's done that, either.

@grahamgower
Copy link
Member

If the __eq__ method isn't implemented on a class, then by default an equality comparison behaves like is, which checks if the two objects have the same memory location. This probably injects some confusion here.

@petrelharp
Copy link
Contributor Author

Yep - that did inject some confusion, but it's a minor bit at this point. Thanks!

@petrelharp
Copy link
Contributor Author

To show what I'd like to avoid: here's what I'm currently doing to compare table collections: first, I made sure all my schema had required and additionalProperties keys (although these weren't being inserted into all of them for some reason?), compared them as dicts, and cleared them: https://github.com/tskit-dev/pyslim/pull/89/files#diff-498cf53d35427897613cdfc4b76fc6eaR235

@jeromekelleher
Copy link
Member

This is tricky! For comparing things in C I don't think we can do any better than just byte-by-byte comparisons, but we can surely do better than this in Python. We can change the definition of __eq__ on the tables so that we compare a canonicalised version of the metadata schemas, and then use np.array_equal on the rest of the columns. It's a bit of drudge work, but it wouldn't take all that long to do.

So, there'll be some discrepancies between the definition of equality in C and Python, but I think we can get over that?

I think we should probably catch this for 0.3.0, since it's a key interoperability issue.

@benjeffery, any thoughts?

@jeromekelleher jeromekelleher added this to the Python 0.3.0 milestone Aug 11, 2020
@benjeffery
Copy link
Member

Hmm, thinking about this there is a way to canonicalise the strings generated by the python using sort_keys=True on dump and object_pairs_hook=OrderedDict on load, such that byte equality in C isn't an issue as long as the strings have been generated in python. I think this is also a way to remove the need for index in the struct codec. I hadn't realised this was possible before.

@benjeffery
Copy link
Member

I'm making a PR with the canonicalised strings, but keeping index as we would need to enforce the use of OrderedDict in the metadata schema constructor argument.

@jeromekelleher
Copy link
Member

Sounds good @benjeffery, and simple canonical form would help a lot.

@benjeffery
Copy link
Member

Take a look at #764 and see if it is enough to satisfy?

@petrelharp
Copy link
Contributor Author

Hello! Thanks for the quick reply. #764 would make some of this make more sense, but doesn't fix the bigger problem, which is table equality. So, the proposal is to

  • remove these lines to make tsk_table_collection_equals not check metadata schema
  • modify the TableCollection.__eq__ method to also check for equality of metadata schema
    ... ? This would fix up this particular headache, and you're right, I think we can live with C equality not being quite the same. I guess I think we ought to also remove the check that metadata is the same since that's a dictionary also.

@benjeffery
Copy link
Member

If both the schema and metadata are in canonical representation doesn't that fix the table equality issue? I'd really like to avoid special casing equality.

@petrelharp
Copy link
Contributor Author

Yes, it does! But, I'm not sure if I have a good way to get the metadata schema in to the canonical representation, since I'm writing it in SLiM's code (using this json library). I guess the question is whether that library can match json::dumps, since that's what's defining the canonical representation.

Note that I could (and probably should) just use json::dumps to dump a string representation of the table metadata schemas and include those in SLiM's code, rather than using a json parser - except that won't work for the top-level metadata and metadata schema, if we want to be able to edit those. Maybe we should abandon the goal of being able to edit those for the moment?

@jeromekelleher
Copy link
Member

@petrelharp - are you gaining much from writing out JSON using an external library in C++? If you write the JSON by hand you can control the ordering. Writing JSON is pretty simple.

@petrelharp
Copy link
Contributor Author

petrelharp commented Aug 11, 2020

Well, it's not the writing out that's helpful, it's the parsing, which we only need for the top-level metadata. I think that the plan is for different programs to co-exist in the top-level metadata, so that if SLiM reads in a tree sequence, it should keep whatever is in the metadata already and only mess with the "SLiM" key. But maybe I should just not worry about that for the moment, since there's nothing else that actually uses the top-level metadata?

EDIT: Seems @benjeffery and I were typing at the same time. =)

@benjeffery
Copy link
Member

benjeffery commented Aug 11, 2020

@jeromekelleher The issue here is adding extra keys to an exisiting tree sequence, hence the need to parse, add keys then dump. From this issue it seems like the library @petrelharp is using might be fine as it alphabetises by default! nlohmann/json#2179

EDIT: Seems @petrelharp and I were typing at the same time.

@jeromekelleher
Copy link
Member

But maybe I should just not worry about that for the moment, since there's nothing else that actually uses the top-level metadata?

Let's not worry about this for now. We're going to need to coordinate on some shared vocabulary here if we want the metadata to be interoperable, so let's just use it as we like for the moment and figure out what's useful later.

@petrelharp
Copy link
Contributor Author

From this issue it seems like the library @petrelharp is using might be fine as it alphabetises by default!

We might be - but it depends on whitespace handling also, so it seems fragile...

@petrelharp
Copy link
Contributor Author

petrelharp commented Aug 11, 2020

Ok - I'm abandoning editing of existing top-level metadata in SLiM's code for now, or the json parsing of any metadata schema at all. It looks like #764 plus carefully pasting in the text output from json.dumps() to the C++ code fixes most of the problems.

However, I'm still using the json library to write out the metadata itself, and running into the same issue: table collections don't match because equality requres equality of the underlying string representations of top-level metadata, written by different programs. It doesn't seem possible to get SLiM's json library to output something matching json.dumps(x, sort_keys=True); for instance:

# python
b'{"SLiM": {"file_version": "0.5", "generation": 10, "model_type": "WF", "nucleotide_based": false, "separate_sexes": false, "spatial_dimensionality": "", "spatial_periodicity": ""}}'
# nlohmann dump(0)
b'{\n"SLiM": {\n"file_version": "0.5",\n"generation": 10,\n"model_type": "WF",\n"nucleotide_based": false,\n"separate_sexes": false,\n"spatial_dimensionality": "",\n"spatial_periodicity": ""\n}\n}'
# nlohmann dump()
b'{"SLiM":{"file_version":"0.5","generation":10,"model_type":"WF","nucleotide_based":false,"separate_sexes":false,"spatial_dimensionality":"","spatial_periodicity":""}}'

Notice that json.dumps( ) omits linebreaks but has spaces, while the nlohmann library has either linebreaks+spaces or not.
This behavior is controllable: we could do json.dumps(x, sort_keys=True, separators=(',', ':')).

Er, talking through this: the issue is that when we recapitate, we do like new_ts = msprime.simulate(from_ts=ts); then, rts has no metadata. We want it to have the same metadata as ts, so currently I'm doing

new_tables.metadata_schema = tables.metadata_schema
new_tables.metadata = tables.metadata

The second line is doing the following:

  1. getting a dict from the underlying string using json.loads
  2. turning that dict back into a string now using json.dumps and assigning it

To make this thing work, I just need to actually copy the underlying bytes, which I could do by

ms = tables.metadata_schema
tables.metadata_schema = tskit.MetadataSchema(None)
new_tables.metadata = tables.metadata
new_tables.metadata_schema = ms

Maybe in the future we want a way in python to get (and, set?) the "metadata bytes" directly? Same for the metadata schema?

I would like to be able to set metadata in python and have it match metadata set by SLiM. To do this, we need to at least add sort_keys=True to JSONCodec.encode, I think. And, either we need to also ask json to remove spaces (with separators=(',', ':')) or I need to abandon the existing json library in SLiM's code. I'm inclined to do the latter, since writing code to output this very simple JSON is easy. But if anyone sees a better way to deal with all this, let me know.

Edit: it looks like setting the indent parameter in json.dumps matches what the json library we're using does if you ask it for indenting. And makes it more readable. I'll bring this up over at #764 .

@petrelharp
Copy link
Contributor Author

Ok - I've got everything to work (using #764), and hopefully we remember this discussion when we next need to think about top-level metadata. =) Thanks!

@jeromekelleher
Copy link
Member

I'm going to reopen this, as I'm not convinced adding indentation is the right thing to do: #764 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants