Very slow equality checks for certain sentences #1

blyxxyz · 2019-07-05T20:13:08Z

Sentences where many nodes have multiple parents can be slow to compare to each other.

This happens when two sentences are identical or almost identical, but exist as separate objects. Each equality check naively compares the children of nodes, but that means nodes can be checked a lot of times.

For example, using examples/socialchoice.py:

$ python3 -m timeit -s 'from socialchoice import lin; s = lin(range(10)); t = lin(range(10))' 's == t'
1 loop, best of 5: 7.47 sec per loop

There's no straightforward way to solve this, because the equality checks use frozenset equality. The problem is solved in other methods with a temporary cache around a function defined inside the method, but comparing frozensets just invokes the __eq__ methods of the children again.

Two potential solutions:

Enable a global cache when __eq__ is called, and clear the cache when the call to __eq__ that enabled the cache is finished. This probably interacts badly with threads, but only by potentially keeping objects in memory longer than necessary or discarding the cache too early, without outright incorrect behavior. It also adds at least a little overhead to typical __eq__ calls.
Stop using frozenset equality, and implement all of the equality checks from scratch. This could make performance much worse for typical cases.

The problem only occurs when the objects are constructed separately. If an object is compared to itself Python is smart enough to notice and skip the lengthier check.

Solving it might not be worth it, unless a concrete case where it occurs pops up.

The text was updated successfully, but these errors were encountered:

haz · 2020-07-15T01:59:33Z

Would this work as a third option?

Every variable obviously has a unique hash (as does its negation).
Every node has a hash built recursively on the (sorted) children hashes and node type. E.g.,
hash("%s (%s)" % (node_type, ','.join(sorted([str(node_hash(c)) for c in children])))
Equality is checked by just comparing the root-level hashes.

The added benefit of this is that you could feed it a formula, and then it would be simplified and compressed to the DAG representation rather than the full branching tree.

blyxxyz · 2020-07-16T19:29:16Z

Every variable obviously has a unique hash (as does its negation).

This isn't the case, unfortunately, and you don't even have to get very pathological to find a collision: hash(Var(0)) == hash(Var('')) and hash(Var(-1)) == hash(Var(-2)) (CPython 3.8.3).
(ints hash to themselves, but -1 is an internal error code, so hash(-1) == -2 == hash(-2).)
I don't think it's a good idea to assume unique hashes, at least by default. We can't assume anything besides o == p -> hash(o) == hash(p) (but not vice versa).
Making this opt-in with a warning to only use collision-resistant labels (like positive integers < 2**60) could still work.

I first thought to reuse existing internal node hashing which uses frozenset hashes. Here's the implementation of those, for reference. They XOR all the member hashes, so there are edges like collisions canceling out so that hash(frozenset({'', 0})) == hash(frozenset({-1, -2})). So your more manual method seems safer.

I think it could work, but not as a default.

compressed to the DAG representation rather than the full branching tree

Converting to a (V, E) DAG representation would be a safe solution if we can do it without losing information. That might also help with converting to NetworkX.
You can't do it deterministically without a total ordering over all possible variables. But if you already know which sentences you're comparing then you can invent your own ordering without worrying about consistency between calls.
Let's call that option four.

haz · 2020-07-16T20:39:44Z

Ah, well I was assuming you always hash the string of an object. And variables wrapping an integer should then be fine (collisions notwithstanding): i.e., hash(Var(-1)) == hash(str(-1)) != hash(str(-2)) == hash(Var(-2))

Collisions still possible, but perhaps a binning scheme could be used to ensure uniqueness. At the very least, as you point out, we could use it as an optional optimization and then throw an error when we've detected a collision (reverse map of the hash to string to keep track).

blyxxyz · 2020-07-16T21:50:45Z

That would work well if you have simple labels that are all of the same type, but worse if you mix types, or if your label type has a very generic string representation.

But as long as it's self-contained we don't need to bother with real hashes. We could just assign each variable a value arbitrarily, à la var_hashes = {label: num for num, label in enumerate(self.vars())}.

haz · 2020-07-17T19:58:04Z

Ah, ya, I see your point. Given a generic __str__ implementation on whatever's forced into Var objects, then we run into issues at the Variable level. But wouldn't what you're proposing also face issues with object uniqueness? I.e., Var(2) != Var(2), which might not be something we'd want.

Cake and eat it too? Pick and document a reasonable default, but allow for the alternative to be specified as part of a configuration. Already feels like some configuration would be helpful (do you want to try and do (collision risk) hashing? want to change the default SAT solver? etc).

haz · 2020-07-17T20:01:46Z

Btw, how does this entire discussion relate to the deduplication you've done? https://python-nnf.readthedocs.io/en/stable/_modules/nnf.html#NNF.deduplicate

blyxxyz · 2020-08-02T13:30:07Z

Btw, how does this entire discussion relate to the deduplication you've done?

The kind of sentences where you get these slow equality checks (compact as DAGs but not as trees) is also the kind where you have to take care not to duplicate your nodes. But even if you don't duplicate nodes within the individual sentences you still get this problem with equality checks.

The .deduplicate() method doesn't really solve the duplication, because you still have high memory use and a long time to construct the sentence beforehand. It's there as a diagnostic tool to use with .object_count().

.deduplicate() is itself slowed down by this problem, because it relies on object equality. That suggests offering a scoped solution, independent of user configuration, where equality is remembered for the duration of the call to .deduplicate(). I'm working on something similar for uses of .vars(), related to #12.

haz · 2020-08-04T20:11:01Z

Well if we have the hashing in place (configured or otherwise), then the hashing can be computed during either equality or deduplication -- since it's recursive, it can save over several operations. I guess I always assumed that equality makes sense to check on deduped theories, as otherwise you could just define a canonical string printout of the theory and match that way (issues w/ Var(2) != Var(2) aside).

blyxxyz added a commit that referenced this issue Jul 5, 2019

Add a note to the caveats about issue #1 (slow equality checks)

aca180c

haz added the enhancement New feature or request label Jul 15, 2020

blyxxyz changed the title ~~Very slow equality checks~~ Very slow equality checks for certain sentences Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow equality checks for certain sentences #1

Very slow equality checks for certain sentences #1

blyxxyz commented Jul 5, 2019

haz commented Jul 15, 2020

blyxxyz commented Jul 16, 2020

haz commented Jul 16, 2020

blyxxyz commented Jul 16, 2020

haz commented Jul 17, 2020

haz commented Jul 17, 2020

blyxxyz commented Aug 2, 2020

haz commented Aug 4, 2020 •

edited

Loading

Very slow equality checks for certain sentences #1

Very slow equality checks for certain sentences #1

Comments

blyxxyz commented Jul 5, 2019

haz commented Jul 15, 2020

blyxxyz commented Jul 16, 2020

haz commented Jul 16, 2020

blyxxyz commented Jul 16, 2020

haz commented Jul 17, 2020

haz commented Jul 17, 2020

blyxxyz commented Aug 2, 2020

haz commented Aug 4, 2020 • edited Loading

haz commented Aug 4, 2020 •

edited

Loading