Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow equality checks for certain sentences #1

Open
blyxxyz opened this issue Jul 5, 2019 · 8 comments
Open

Very slow equality checks for certain sentences #1

blyxxyz opened this issue Jul 5, 2019 · 8 comments
Labels
enhancement New feature or request

Comments

@blyxxyz
Copy link
Collaborator

blyxxyz commented Jul 5, 2019

Sentences where many nodes have multiple parents can be slow to compare to each other.

This happens when two sentences are identical or almost identical, but exist as separate objects. Each equality check naively compares the children of nodes, but that means nodes can be checked a lot of times.

For example, using examples/socialchoice.py:

$ python3 -m timeit -s 'from socialchoice import lin; s = lin(range(10)); t = lin(range(10))' 's == t'
1 loop, best of 5: 7.47 sec per loop

There's no straightforward way to solve this, because the equality checks use frozenset equality. The problem is solved in other methods with a temporary cache around a function defined inside the method, but comparing frozensets just invokes the __eq__ methods of the children again.

Two potential solutions:

  • Enable a global cache when __eq__ is called, and clear the cache when the call to __eq__ that enabled the cache is finished. This probably interacts badly with threads, but only by potentially keeping objects in memory longer than necessary or discarding the cache too early, without outright incorrect behavior. It also adds at least a little overhead to typical __eq__ calls.
  • Stop using frozenset equality, and implement all of the equality checks from scratch. This could make performance much worse for typical cases.

The problem only occurs when the objects are constructed separately. If an object is compared to itself Python is smart enough to notice and skip the lengthier check.

Solving it might not be worth it, unless a concrete case where it occurs pops up.

@haz haz added the enhancement New feature or request label Jul 15, 2020
@haz
Copy link
Collaborator

haz commented Jul 15, 2020

Would this work as a third option?

  • Every variable obviously has a unique hash (as does its negation).
  • Every node has a hash built recursively on the (sorted) children hashes and node type. E.g.,
    hash("%s (%s)" % (node_type, ','.join(sorted([str(node_hash(c)) for c in children])))
  • Equality is checked by just comparing the root-level hashes.

The added benefit of this is that you could feed it a formula, and then it would be simplified and compressed to the DAG representation rather than the full branching tree.

@blyxxyz
Copy link
Collaborator Author

blyxxyz commented Jul 16, 2020

Every variable obviously has a unique hash (as does its negation).

This isn't the case, unfortunately, and you don't even have to get very pathological to find a collision: hash(Var(0)) == hash(Var('')) and hash(Var(-1)) == hash(Var(-2)) (CPython 3.8.3).
(ints hash to themselves, but -1 is an internal error code, so hash(-1) == -2 == hash(-2).)
I don't think it's a good idea to assume unique hashes, at least by default. We can't assume anything besides o == p -> hash(o) == hash(p) (but not vice versa).
Making this opt-in with a warning to only use collision-resistant labels (like positive integers < 2**60) could still work.

I first thought to reuse existing internal node hashing which uses frozenset hashes. Here's the implementation of those, for reference. They XOR all the member hashes, so there are edges like collisions canceling out so that hash(frozenset({'', 0})) == hash(frozenset({-1, -2})). So your more manual method seems safer.

I think it could work, but not as a default.

compressed to the DAG representation rather than the full branching tree

Converting to a (V, E) DAG representation would be a safe solution if we can do it without losing information. That might also help with converting to NetworkX.
You can't do it deterministically without a total ordering over all possible variables. But if you already know which sentences you're comparing then you can invent your own ordering without worrying about consistency between calls.
Let's call that option four.

@haz
Copy link
Collaborator

haz commented Jul 16, 2020

Ah, well I was assuming you always hash the string of an object. And variables wrapping an integer should then be fine (collisions notwithstanding): i.e., hash(Var(-1)) == hash(str(-1)) != hash(str(-2)) == hash(Var(-2))

Collisions still possible, but perhaps a binning scheme could be used to ensure uniqueness. At the very least, as you point out, we could use it as an optional optimization and then throw an error when we've detected a collision (reverse map of the hash to string to keep track).

@blyxxyz
Copy link
Collaborator Author

blyxxyz commented Jul 16, 2020

That would work well if you have simple labels that are all of the same type, but worse if you mix types, or if your label type has a very generic string representation.

But as long as it's self-contained we don't need to bother with real hashes. We could just assign each variable a value arbitrarily, à la var_hashes = {label: num for num, label in enumerate(self.vars())}.

@haz
Copy link
Collaborator

haz commented Jul 17, 2020

Ah, ya, I see your point. Given a generic __str__ implementation on whatever's forced into Var objects, then we run into issues at the Variable level. But wouldn't what you're proposing also face issues with object uniqueness? I.e., Var(2) != Var(2), which might not be something we'd want.

Cake and eat it too? Pick and document a reasonable default, but allow for the alternative to be specified as part of a configuration. Already feels like some configuration would be helpful (do you want to try and do (collision risk) hashing? want to change the default SAT solver? etc).

@haz
Copy link
Collaborator

haz commented Jul 17, 2020

Btw, how does this entire discussion relate to the deduplication you've done? https://python-nnf.readthedocs.io/en/stable/_modules/nnf.html#NNF.deduplicate

@blyxxyz blyxxyz changed the title Very slow equality checks Very slow equality checks for certain sentences Jul 31, 2020
@blyxxyz
Copy link
Collaborator Author

blyxxyz commented Aug 2, 2020

Btw, how does this entire discussion relate to the deduplication you've done?

The kind of sentences where you get these slow equality checks (compact as DAGs but not as trees) is also the kind where you have to take care not to duplicate your nodes. But even if you don't duplicate nodes within the individual sentences you still get this problem with equality checks.

The .deduplicate() method doesn't really solve the duplication, because you still have high memory use and a long time to construct the sentence beforehand. It's there as a diagnostic tool to use with .object_count().

.deduplicate() is itself slowed down by this problem, because it relies on object equality. That suggests offering a scoped solution, independent of user configuration, where equality is remembered for the duration of the call to .deduplicate(). I'm working on something similar for uses of .vars(), related to #12.

@haz
Copy link
Collaborator

haz commented Aug 4, 2020

Well if we have the hashing in place (configured or otherwise), then the hashing can be computed during either equality or deduplication -- since it's recursive, it can save over several operations. I guess I always assumed that equality makes sense to check on deduped theories, as otherwise you could just define a canonical string printout of the theory and match that way (issues w/ Var(2) != Var(2) aside).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants