metadata schema equality and hence table equality is not well-defined #763

petrelharp · 2020-08-11T00:20:05Z

Maybe "not well-defined" isn't the right phrase, but it is currently pretty hard to get two tables or table collections produced by different sources to be equal to each other, because their metadata schema must match at the byte level, and so rely on whitespace and ordering of json objects matching. Furthermore, in python it seems impossible to diagnose the problem, since (a) equality of MetadataSchema objects is identity-only equality, (b) their dict or string representations may match even if the underlying bytes do not.

So, for instance:

>>> m1 = tskit.MetadataSchema({'codec': 'json', 'type':'object'})
>>> m2 = tskit.MetadataSchema({'type':'object', 'codec': 'json'})
>>> t = msprime.simulate(4).tables
>>> t2 = t1.copy()
>>> t1 == t2
True
>>> t1.metadata_schema = m1
>>> t2.metadata_schema = m2
>>> t1 == t2
False

See below for other examples.

This is being pretty bothersome for testing things in pyslim: the schema as set by SLiM differs from that set by pyslim in ordering and whitespace, even though the source code for the two are as identical as possible. Testing for equality of tables and table collections is important other places, too.

To do this properly we'd have to parse the json in C, right? Which we don't really want to do - any other ideas? Even if we didn't check metadata schema when testing equality, we'll run into similar problems with top-level metadata, if we are proposing that applications edit the top-level metadata schema to add new keys - it could easily be that two operations commute except for the adding to the metadata schema.

The text was updated successfully, but these errors were encountered:

petrelharp · 2020-08-11T01:54:13Z

Here is a more puzzling example. I have two NodeTables, n1 and n2, such that:

>>> n1 == n2
False
>>> n1.num_rows == n2.num_rows
True
>>> all([a == b for a, b in zip(n1, n2)])
True
>>> str(n1.metadata_schema) == str(n2.metadata_schema)
True
>>> n1.metadata_schema = n2.metadata_schema
>>> n1 == n2
False
>>> n1.metadata_schema = tskit.MetadataSchema(n2.metadata_schema.schema)
>>> n2.metadata_schema = tskit.MetadataSchema(n2.metadata_schema.schema)
>>> n1 == n2
True

Note that if I'd done n2.metadata_schema = n1.metadata_schema they'd have been equal. So, assignment of metadata schema must not actually preserve the underlying bytes.

I'm not providing the code to make these because it's produced by development SLiM. Doing e.g. kastore dump t1.trees nodes/metadata_schema on the two files tells me that the schema differ only in a bunch of 32s, i.e., spaces.

petrelharp · 2020-08-11T05:28:59Z

Hm, another difficulty: comparing dicts (i.e., the schema property) of two MetadataSchema objects - one as loaded from a SLiM-created tree sequence, the other created in python, but both from identical json - can fail because the python-created one has added in the required = [...] and 'additionalProperties': False attributes. Hm, and I'm not sure why it's done that, either.

grahamgower · 2020-08-11T05:44:49Z

If the __eq__ method isn't implemented on a class, then by default an equality comparison behaves like is, which checks if the two objects have the same memory location. This probably injects some confusion here.

petrelharp · 2020-08-11T06:06:57Z

Yep - that did inject some confusion, but it's a minor bit at this point. Thanks!

petrelharp · 2020-08-11T06:08:32Z

To show what I'd like to avoid: here's what I'm currently doing to compare table collections: first, I made sure all my schema had required and additionalProperties keys (although these weren't being inserted into all of them for some reason?), compared them as dicts, and cleared them: https://github.com/tskit-dev/pyslim/pull/89/files#diff-498cf53d35427897613cdfc4b76fc6eaR235

jeromekelleher · 2020-08-11T07:38:16Z

This is tricky! For comparing things in C I don't think we can do any better than just byte-by-byte comparisons, but we can surely do better than this in Python. We can change the definition of __eq__ on the tables so that we compare a canonicalised version of the metadata schemas, and then use np.array_equal on the rest of the columns. It's a bit of drudge work, but it wouldn't take all that long to do.

So, there'll be some discrepancies between the definition of equality in C and Python, but I think we can get over that?

I think we should probably catch this for 0.3.0, since it's a key interoperability issue.

@benjeffery, any thoughts?

benjeffery · 2020-08-11T11:44:42Z

Hmm, thinking about this there is a way to canonicalise the strings generated by the python using sort_keys=True on dump and object_pairs_hook=OrderedDict on load, such that byte equality in C isn't an issue as long as the strings have been generated in python. I think this is also a way to remove the need for index in the struct codec. I hadn't realised this was possible before.

benjeffery · 2020-08-11T12:39:44Z

I'm making a PR with the canonicalised strings, but keeping index as we would need to enforce the use of OrderedDict in the metadata schema constructor argument.

jeromekelleher · 2020-08-11T12:43:06Z

Sounds good @benjeffery, and simple canonical form would help a lot.

benjeffery · 2020-08-11T13:25:56Z

Take a look at #764 and see if it is enough to satisfy?

petrelharp · 2020-08-11T14:05:14Z

Hello! Thanks for the quick reply. #764 would make some of this make more sense, but doesn't fix the bigger problem, which is table equality. So, the proposal is to

remove these lines to make tsk_table_collection_equals not check metadata schema
modify the TableCollection.__eq__ method to also check for equality of metadata schema
... ? This would fix up this particular headache, and you're right, I think we can live with C equality not being quite the same. I guess I think we ought to also remove the check that metadata is the same since that's a dictionary also.

benjeffery · 2020-08-11T14:23:16Z

If both the schema and metadata are in canonical representation doesn't that fix the table equality issue? I'd really like to avoid special casing equality.

petrelharp · 2020-08-11T14:34:29Z

Yes, it does! But, I'm not sure if I have a good way to get the metadata schema in to the canonical representation, since I'm writing it in SLiM's code (using this json library). I guess the question is whether that library can match json::dumps, since that's what's defining the canonical representation.

Note that I could (and probably should) just use json::dumps to dump a string representation of the table metadata schemas and include those in SLiM's code, rather than using a json parser - except that won't work for the top-level metadata and metadata schema, if we want to be able to edit those. Maybe we should abandon the goal of being able to edit those for the moment?

jeromekelleher · 2020-08-11T14:44:09Z

@petrelharp - are you gaining much from writing out JSON using an external library in C++? If you write the JSON by hand you can control the ordering. Writing JSON is pretty simple.

petrelharp · 2020-08-11T14:59:44Z

Well, it's not the writing out that's helpful, it's the parsing, which we only need for the top-level metadata. I think that the plan is for different programs to co-exist in the top-level metadata, so that if SLiM reads in a tree sequence, it should keep whatever is in the metadata already and only mess with the "SLiM" key. But maybe I should just not worry about that for the moment, since there's nothing else that actually uses the top-level metadata?

EDIT: Seems @benjeffery and I were typing at the same time. =)

benjeffery · 2020-08-11T14:59:50Z

@jeromekelleher The issue here is adding extra keys to an exisiting tree sequence, hence the need to parse, add keys then dump. From this issue it seems like the library @petrelharp is using might be fine as it alphabetises by default! nlohmann/json#2179

EDIT: Seems @petrelharp and I were typing at the same time.

jeromekelleher · 2020-08-11T15:12:01Z

But maybe I should just not worry about that for the moment, since there's nothing else that actually uses the top-level metadata?

Let's not worry about this for now. We're going to need to coordinate on some shared vocabulary here if we want the metadata to be interoperable, so let's just use it as we like for the moment and figure out what's useful later.

petrelharp · 2020-08-11T15:12:02Z

From this issue it seems like the library @petrelharp is using might be fine as it alphabetises by default!

We might be - but it depends on whitespace handling also, so it seems fragile...

petrelharp · 2020-08-11T18:58:09Z

Ok - I'm abandoning editing of existing top-level metadata in SLiM's code for now, or the json parsing of any metadata schema at all. It looks like #764 plus carefully pasting in the text output from json.dumps() to the C++ code fixes most of the problems.

However, I'm still using the json library to write out the metadata itself, and running into the same issue: table collections don't match because equality requres equality of the underlying string representations of top-level metadata, written by different programs. It doesn't seem possible to get SLiM's json library to output something matching json.dumps(x, sort_keys=True); for instance:

# python
b'{"SLiM": {"file_version": "0.5", "generation": 10, "model_type": "WF", "nucleotide_based": false, "separate_sexes": false, "spatial_dimensionality": "", "spatial_periodicity": ""}}'
# nlohmann dump(0)
b'{\n"SLiM": {\n"file_version": "0.5",\n"generation": 10,\n"model_type": "WF",\n"nucleotide_based": false,\n"separate_sexes": false,\n"spatial_dimensionality": "",\n"spatial_periodicity": ""\n}\n}'
# nlohmann dump()
b'{"SLiM":{"file_version":"0.5","generation":10,"model_type":"WF","nucleotide_based":false,"separate_sexes":false,"spatial_dimensionality":"","spatial_periodicity":""}}'

Notice that json.dumps( ) omits linebreaks but has spaces, while the nlohmann library has either linebreaks+spaces or not.
This behavior is controllable: we could do json.dumps(x, sort_keys=True, separators=(',', ':')).

Er, talking through this: the issue is that when we recapitate, we do like new_ts = msprime.simulate(from_ts=ts); then, rts has no metadata. We want it to have the same metadata as ts, so currently I'm doing

new_tables.metadata_schema = tables.metadata_schema
new_tables.metadata = tables.metadata

The second line is doing the following:

getting a dict from the underlying string using json.loads
turning that dict back into a string now using json.dumps and assigning it

To make this thing work, I just need to actually copy the underlying bytes, which I could do by

ms = tables.metadata_schema
tables.metadata_schema = tskit.MetadataSchema(None)
new_tables.metadata = tables.metadata
new_tables.metadata_schema = ms

Maybe in the future we want a way in python to get (and, set?) the "metadata bytes" directly? Same for the metadata schema?

I would like to be able to set metadata in python and have it match metadata set by SLiM. To do this, we need to at least add sort_keys=True to JSONCodec.encode, I think. And, either we need to also ask json to remove spaces (with separators=(',', ':')) or I need to abandon the existing json library in SLiM's code. I'm inclined to do the latter, since writing code to output this very simple JSON is easy. But if anyone sees a better way to deal with all this, let me know.

Edit: it looks like setting the indent parameter in json.dumps matches what the json library we're using does if you ask it for indenting. And makes it more readable. I'll bring this up over at #764 .

petrelharp · 2020-08-12T05:03:38Z

Ok - I've got everything to work (using #764), and hopefully we remember this discussion when we next need to think about top-level metadata. =) Thanks!

jeromekelleher · 2020-08-12T08:13:17Z

I'm going to reopen this, as I'm not convinced adding indentation is the right thing to do: #764 (review)

petrelharp mentioned this issue Aug 11, 2020

metadata documentation #762

Closed

petrelharp changed the title ~~metadata schema equality depends on ordering of dicts~~ metadata schema equality and hence table equality is not well-defined Aug 11, 2020

jeromekelleher added this to the Python 0.3.0 milestone Aug 11, 2020

benjeffery mentioned this issue Aug 11, 2020

Add equality methods for metadata schemas and canonicalise string representation #764

Merged

petrelharp closed this as completed Aug 12, 2020

benjeffery reopened this Aug 12, 2020

mergify bot closed this as completed in #764 Aug 12, 2020

petrelharp mentioned this issue Aug 12, 2020

update tskit MesserLab/SLiM#101

Merged

petrelharp mentioned this issue Sep 2, 2020

add method to get raw top-level metadata #832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata schema equality and hence table equality is not well-defined #763

metadata schema equality and hence table equality is not well-defined #763

petrelharp commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 11, 2020 •

edited

Loading

grahamgower commented Aug 11, 2020

petrelharp commented Aug 11, 2020

petrelharp commented Aug 11, 2020

jeromekelleher commented Aug 11, 2020

benjeffery commented Aug 11, 2020

benjeffery commented Aug 11, 2020

jeromekelleher commented Aug 11, 2020

benjeffery commented Aug 11, 2020

petrelharp commented Aug 11, 2020

benjeffery commented Aug 11, 2020

petrelharp commented Aug 11, 2020

jeromekelleher commented Aug 11, 2020

petrelharp commented Aug 11, 2020 •

edited

Loading

benjeffery commented Aug 11, 2020 •

edited

Loading

jeromekelleher commented Aug 11, 2020

petrelharp commented Aug 11, 2020

petrelharp commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 12, 2020

jeromekelleher commented Aug 12, 2020

metadata schema equality and hence table equality is not well-defined #763

metadata schema equality and hence table equality is not well-defined #763

Comments

petrelharp commented Aug 11, 2020 • edited Loading

petrelharp commented Aug 11, 2020 • edited Loading

petrelharp commented Aug 11, 2020 • edited Loading

grahamgower commented Aug 11, 2020

petrelharp commented Aug 11, 2020

petrelharp commented Aug 11, 2020

jeromekelleher commented Aug 11, 2020

benjeffery commented Aug 11, 2020

benjeffery commented Aug 11, 2020

jeromekelleher commented Aug 11, 2020

benjeffery commented Aug 11, 2020

petrelharp commented Aug 11, 2020

benjeffery commented Aug 11, 2020

petrelharp commented Aug 11, 2020

jeromekelleher commented Aug 11, 2020

petrelharp commented Aug 11, 2020 • edited Loading

benjeffery commented Aug 11, 2020 • edited Loading

jeromekelleher commented Aug 11, 2020

petrelharp commented Aug 11, 2020

petrelharp commented Aug 11, 2020 • edited Loading

petrelharp commented Aug 12, 2020

jeromekelleher commented Aug 12, 2020

petrelharp commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 11, 2020 •

edited

Loading

benjeffery commented Aug 11, 2020 •

edited

Loading

petrelharp commented Aug 11, 2020 •

edited

Loading