Benchmark run without optimization? #12

h-2 · 2017-01-05T16:31:57Z

Hi,
this is Hannes from SeqAn. From your benchmark file it looks like you are neither setting c++ optimization levels, nor deactivating debug:
https://github.com/walaj/SeqLib/blob/master/benchmark/Makefile
Were the published benchmark results indeed run without -O3 -DNDEBUG?

If yes, could you please repeat them? As the other libraries are included pre-compiled, it is unclear which options they were built with.
Other third party benchmarks previously had similar problems where optimized versions of htslib were tested against unoptimized versions of SeqAn (the difference is huge!):
wilzbach/bam-perf-test@5aea37e
Since all our tests show that we are faster than htslib in almost all cases we would of course be interested in your test data if it turns out we indeed perform notably worse.

If on the other hand the results are more close to what we are seeing, we would kindly ask you to update the table also in the paper.

Thank you very much,
Hannes

The text was updated successfully, but these errors were encountered:

walaj · 2017-01-05T17:16:00Z

Hi Hannes,

Thank you, this is an excellent point that I missed, and I agree that this should be fixed. I tried again all of the tests with the flags that you suggested, and am getting the following results:

SeqAn with better flags: 40.6s user time
SeqAn w/o better flags: 94.3s user time
SeqLib: 13.6s user time
BamTools (core only): 21.3s user time
BamTools: 57.8s

So indeed there is a difference. The SeqAn API is amazing, and I think has mostly complementary functionality for the most part with SeqLib. But for this benchmark, I don't want to compare with the wrong approach.

Does this perhaps sync more with what you are seeing?

Best,
Jeremiah

h-2 · 2017-01-05T17:34:47Z

Hi Jeremiah,

thanks for the quick reply and rerunning the benchmark! Is this for the jumping test or reading test or combined? And what does "core only" mean, is that without decompressing the strings -- and to which "strings" does that refer? Query name, query sequence, and/or cigar?

Could you provide an overview of :

reading the full record
reading without the strings if that is important (seqan cannot do this right now)
jumping with index

That would help me a lot.Thank you very much!

walaj · 2017-01-05T17:54:07Z

Hi Hannes,

This was for the reading test. The "core" concept is from BamTools, and is the option to read in an alignment without decompressing all of the string data. For instance, if one only wanted the position or insert size, etc. For SeqLib this is basically the default, and most of the functionality interacts with the htslib object directly and may not need to be converted to another form (e.g. cigar operations). But if one wanted to use the sequence data in SeqLib, they would likely have to decompress from 4 bits to a C++ string or char array, so I added this as well into the benchmarking table to be more in line with BamTools options.

I do think that having the initial load be as light-weight as possible, and then allowing the user to expand the pieces as necessary, is an important component of SeqLib. For instance, tools for calculating coverage don't need the read name or sequence, just the position. Or for some other tools I have, it was much more efficient to look at an alignment first, and then only expand the sequence at the last second if it ended up being necessary.

The jumping test picks 1000 random locations and grabs a single read from each. I think the value published had SeqAn as the fastest, although they didn't differ by much.

Hopefully this helps! I wonder if the string data loading is the source of the benchmarking difference.

h-2 · 2017-01-05T18:30:20Z

Hi Jeremiah,

what I meant was, if you could provide the benchmark times for the individual steps I mentioned above?
So for reading the full record SeqLib required 14s and SeqAn requires 40s? That's interesting.
I assume SeqLib+Htslib is different than just using Htslib? Did you try benchmarking against "pure Htslib", as well? [1]

Thanks for your time and your comments!

[1] Similiar to https://github.com/wilzbach/bam-perf-test/

walaj · 2017-01-05T19:10:15Z

Latest benchmarks:

SeqLib + sequence expansion: 18.0s
SeqLib default: (as above): 13.6s
Jumping test: SeqLib 9.3s, SeqAn: 13.3s.
(adding in htslib test, copied from bam-perf-test): 9.46s

Which is basically in line with the published table (this machine is a little slower), save for SeqAn should be closer to ~35s not ~80s. The wall time on the SeqAn jumping test is smaller, but it looks like it's using multiple CPUs. Still not seeing much of a difference either way for the jumping test. Adding in the htslib test, looks like a small amount of overhead in SeqLib (shared pointers, etc) vs htslib, which is expected.

Not sure where the 17s on the README from bam-perf-test came from for SeqAN. The log is showing a much slower time (7m+). I guess that log file was committed before the O3 addition. A 25x speed up doesn't sync with what I'm seeing (2x). Still an improvement over the version with non-optimal flags, but I don't see any way SeqAn could be faster than htslib

h-2 · 2017-01-06T11:09:03Z

but it looks like it's using multiple CPUs.

Yes, SeqAn always uses multiple threads for BamIO. Since (de)compression is a large part of the IO work this is an important feature (especially since every modern computer has multiple cores). Depending on how many cores you have you will see more or less speed-up. The benchmark I mentioned above was apparently run on eight cores so it might have benefited more strongly than yours.

walaj · 2017-01-06T15:49:10Z

Ok great, I think we're converging on what is going on! Thank you for the quick and courteous note about the benchmarking, and I'll reach out today to get that table fixed for the corrected proof with the right compiler flags.

I also agree that the embedded multi-threading is an important feature for many developers that want fast code without dealing with pthread, but still think CPU time is the right benchmark. Modern computers have multiple cores, but are often shared systems and so the extra CPU isn't free (and a lot of work now is done on VM instances that may only allocate one core). A developer may also want to have their own parallelization framework with other aspects besides BAM IO, and could be puzzled by some subcomponent using more threads on its own.

So both approaches are useful for different needs, since I suppose it's not possible to make something that works out of the box for everything.

h-2 · 2017-01-06T16:28:51Z

Thank you for the quick and courteous note about the benchmarking, and I'll reach out today to get that table fixed for the corrected proof with the right compiler flags.

Ok, cool! If you get the table updated, could you also note that SeqAn decodes strings, too?

Method	Memory (Gb)	CPU (s)
SeqLib	3.15	11.77
SeqLib (with string data)	3.86	14.33
BamTools	6.93	17.52
BamTools (with string data)	13.32	48.74
SeqAn (with string data)	8.35	...

or maybe it would be even clearer to mark those that don't read the full record:

Method	Memory (Gb)	CPU (s)
SeqLib (without string data)	3.15	11.77
SeqLib	3.86	14.33
BamTools (without string data)	6.93	17.52
BamTools	13.32	48.74
SeqAn	8.35	...

=> otherwise people intuitively compare SeqAn to the "wrong" table entries.
It would probably also help if you mention the hardware it was benchmarked on and maybe briefly mention that SeqAn scales with / depends on cpu cores. Otherwise it might be difficult for others to reproduce your results.

A developer may also want to have their own parallelization framework with other aspects besides BAM IO, and could be puzzled by some subcomponent using more threads on its own.

actually we do some sophisticated probing, work balancing and cpu yielding to prevent ever using more threads than there are free cores. It shouldn't effect anything else negatively 😃
But, yes, we also plan to make this configurable by the developer/user in the future!

Thanks for being so quick to respond and for discussing these things openly!

walaj · 2017-01-06T18:24:08Z

Hi Hannes,

OK sounds good, I agree about the table annotation. I contacted them and will see what needs to be done to make the change. I'll also make a note in the README about the differences between these approaches (i.e. threading) and a note about the updated benchmarking, so folks can decide which API is best for their needs. My sense is that a combination of both actually would be really powerful for most projects, since there is a great deal of functionality in SeqAn (it's a much bigger project) that is not in SeqLib, but the BWA / BLAT / Fermi / htslib wrappers in SeqLib are really incredibly useful as well.

Best,
Jeremiah

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark run without optimization? #12

Benchmark run without optimization? #12

h-2 commented Jan 5, 2017

walaj commented Jan 5, 2017

h-2 commented Jan 5, 2017

walaj commented Jan 5, 2017 •

edited

Loading

h-2 commented Jan 5, 2017

walaj commented Jan 5, 2017 •

edited

Loading

h-2 commented Jan 6, 2017

walaj commented Jan 6, 2017

h-2 commented Jan 6, 2017

walaj commented Jan 6, 2017

Benchmark run without optimization? #12

Benchmark run without optimization? #12

Comments

h-2 commented Jan 5, 2017

walaj commented Jan 5, 2017

h-2 commented Jan 5, 2017

walaj commented Jan 5, 2017 • edited Loading

h-2 commented Jan 5, 2017

walaj commented Jan 5, 2017 • edited Loading

h-2 commented Jan 6, 2017

walaj commented Jan 6, 2017

h-2 commented Jan 6, 2017

walaj commented Jan 6, 2017

walaj commented Jan 5, 2017 •

edited

Loading

walaj commented Jan 5, 2017 •

edited

Loading