Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue with small alphabets and long texts #4

Closed
almondtools opened this issue Aug 9, 2016 · 3 comments
Closed

Performance Issue with small alphabets and long texts #4

almondtools opened this issue Aug 9, 2016 · 3 comments

Comments

@almondtools
Copy link

I tried to benchmark your library with StringBench.

Yet there seems to be a performance issue with some your implementations (BoyerMoore*, BNDM) related to a binary alphabet with long texts.

You can reproduce this by

  • checking out StringBench
  • selecting SS*Test.java
  • removing the @ignore
  • starting the test

Yet I cannot provide any hints to the problem - may be the test setup is incorrect. If so it would be kind helping me to fix it.

@johannburkard
Copy link
Owner

I'll check this out and give you feedback in the next couple of days.

@johannburkard
Copy link
Owner

I tried your tests and -- apart from the performance -- the tests worked. I think the problem here is that you chose to essentially benchmark your computer and some libraries using very pathological input. That's not a problem of StringSearch so closing this issue.

@almondtools
Copy link
Author

almondtools commented Aug 14, 2016

No need to be rude:

  • A naive search with String.indexOf finishes in a 650 milliseconds

  • An optimized Horspool finishes in 2500 milliseconds

  • Your BoyerMooreHorspool finishes with:

    org.junit.runners.model.TestTimedOutException: test timed out after 60000 milliseconds
    ...

I do not get a result after minutes. I am certain that you were mistaken and did not remove the @Ignore-Marker from the class before starting the test (as explained by me).

After some debugging I found out, that the problem is not caused by an infinite loop, but as you said, from poor performance (maybe because of a wrong api usage). And it is just not a problem of the boyer moore horspool algorithm (the one of byteseek, java.util.Matcher and StringsAndChars perform slower than naive search, but faster than 5 seconds on the same scenario).

One reason might be that your searchString(String, int, String, Object) method converts the strings to char arrays at every call. Calling searchString in a loop (which is done by the benchmark) means that the whole document is copied in memory at every iteration. So perhaps you can provide a hint how one could efficiently collect all non-overlapping matches in a text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants