Performance Issue with small alphabets and long texts #4

almondtools · 2016-08-09T06:21:59Z

I tried to benchmark your library with StringBench.

Yet there seems to be a performance issue with some your implementations (BoyerMoore*, BNDM) related to a binary alphabet with long texts.

You can reproduce this by

checking out StringBench
selecting SS*Test.java
removing the @ignore
starting the test

Yet I cannot provide any hints to the problem - may be the test setup is incorrect. If so it would be kind helping me to fix it.

johannburkard · 2016-08-13T17:30:11Z

I'll check this out and give you feedback in the next couple of days.

johannburkard · 2016-08-14T20:37:22Z

I tried your tests and -- apart from the performance -- the tests worked. I think the problem here is that you chose to essentially benchmark your computer and some libraries using very pathological input. That's not a problem of StringSearch so closing this issue.

almondtools · 2016-08-14T21:10:57Z

No need to be rude:

A naive search with String.indexOf finishes in a 650 milliseconds
An optimized Horspool finishes in 2500 milliseconds
Your BoyerMooreHorspool finishes with:

org.junit.runners.model.TestTimedOutException: test timed out after 60000 milliseconds
...

I do not get a result after minutes. I am certain that you were mistaken and did not remove the @Ignore-Marker from the class before starting the test (as explained by me).

After some debugging I found out, that the problem is not caused by an infinite loop, but as you said, from poor performance (maybe because of a wrong api usage). And it is just not a problem of the boyer moore horspool algorithm (the one of byteseek, java.util.Matcher and StringsAndChars perform slower than naive search, but faster than 5 seconds on the same scenario).

One reason might be that your searchString(String, int, String, Object) method converts the strings to char arrays at every call. Calling searchString in a loop (which is done by the benchmark) means that the whole document is copied in memory at every iteration. So perhaps you can provide a hint how one could efficiently collect all non-overlapping matches in a text?

johannburkard closed this as completed Aug 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue with small alphabets and long texts #4

Performance Issue with small alphabets and long texts #4

almondtools commented Aug 9, 2016

johannburkard commented Aug 13, 2016

johannburkard commented Aug 14, 2016

almondtools commented Aug 14, 2016 •

edited

Loading

Performance Issue with small alphabets and long texts #4

Performance Issue with small alphabets and long texts #4

Comments

almondtools commented Aug 9, 2016

johannburkard commented Aug 13, 2016

johannburkard commented Aug 14, 2016

almondtools commented Aug 14, 2016 • edited Loading

almondtools commented Aug 14, 2016 •

edited

Loading