Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rvalue reduce #169

Closed
wants to merge 3 commits into from
Closed

Conversation

BenFrantzDale
Copy link

Use rvalue references to allow reduction to more-efficiently handle operations like vector or list concatenation as the reduction operation.

Are you interested in this?

@akukanov
Copy link

Hi Ben, thanks for the patch. Yes, we are interested, but - since this piece of code still has to work with C++98/03 - it needs to be improved. If you are comfortable with keeping working on it, we will provide some guidance; if not, thanks anyway, and we will let you know when the patch is integrated.
Meanwhile, do you have any test for this code?

@BenFrantzDale
Copy link
Author

@akukanov Thanks. I can take a look at it. What's the right way to selectively do move and to overload && member-function declaration specifiers?

@rfschtkt
Copy link

rfschtkt commented Aug 3, 2019

How relevant is it to make all those changes, including a voodoo magic rvalue ref-qualifier (the version with the lvalue ref-qualifier is not needed), just to also eliminate the copy for result(), which happens only once per invocation? Do you have measurements (total copy/move counts, performance)?

I made an implementation including a test that counts the number of copy/move x construct/assign operations, and curiously the minimum number of copies is 2, presumably for passing init and returning the result, but it can also be higher, and it seems premature to worry about returning the result really really efficiently as long as those extra copies aren't eliminated as well. BTW, apparently Standard C++ has decided to pass init by value, so that's another problem: do the same, or take it by const reference instead? Here's some test output for what I've got so far:

./test_parallel_reduce.exe  
CheckParallelReduceDefault() 1: cc=2 ca=0 mc=128 ma=128
[...]
CheckParallelReduceDefault() 1: cc=18 ca=0 mc=343 ma=359
CheckParallelReduceDefault() 2: cc=22 ca=0 mc=413 ma=433
CheckParallelReduce() 1: cc=52 ca=0 mc=1000000 ma=1000050
CheckParallelReduce() 2: cc=46 ca=0 mc=1000000 ma=1000044
[...]
CheckParallelReduce() 1: cc=5 ca=0 mc=4 ma=7
CheckParallelReduce() 2: cc=5 ca=0 mc=4 ma=7
done

Those high cc (copy constructor) counts apparently are all about initialising an internal accumulator value to the identity element, which might well be negligible in aggregate to a final copy of the result (copying an empty vector is very cheap indeed), but my hunch is still that a final copy is negligible to the total work by joining accumulator values in pairs, and perhaps it would be more useful to review that first, and perhaps assemble lists of accumulator values before they are joined. The contrapositive of that would tell you not to worry about one final copy.

@rfschtkt
Copy link

rfschtkt commented Aug 6, 2019

Can I publish my own solution in a different branch, or how does that work with github? It's mostly interesting for the counters that I've added to the test.

@rfschtkt
Copy link

rfschtkt commented Aug 11, 2019

(Comment removed: if nobody corrects me I'll do it myself...)

@alexey-katranov
Copy link
Contributor

alexey-katranov commented Aug 12, 2019

@rfschtkt , thank you a lot for the contribution. We are aware about your patch and find it interesting. We just need some time to look it through... Sorry for the long respond.

@rfschtkt
Copy link

I should probably first add some counters to the operations and relate them to the current counters in hard ASSERTions to make sure that all move opportunities are exploited etc...

BTW, I'm not totally against the use of ref-qualifiers, but it seems a bit contrived in this context where result() is always moved from and is not exposed as public API anyway, and I really would like to know how relevant it is in actual usage to move the final result (I'm just being an armchair critic here).

I've also been thinking that it seems very strange indeed that there is an identity element at all. It is perfectly possible to parallelise a reduction on a semigroup that is not a monoid, and I would have reserved the name "reduce" for such usage, as in, e.g., Scala, which distinguishes "reduce" from "fold" and its variants. Of course it is possible to do this on top of the current inter-domain operation, but perhaps there should be an additional API to do this directly, maybe simply using the same name but with the identity parameter omitted?

@BenFrantzDale
Copy link
Author

@rfschtkt, you are right, I probably didn't need to use ref qualifiers. I like your diff. The one change I'd suggest is instead of void join( lambda_reduce_body& rhs ), have it take lambda_reduce_body&&, since as written the caller could inadvertently have rhs.my_value stolen.

@rfschtkt
Copy link

rfschtkt commented Aug 16, 2019

I've used RVALUE_REF_ELSE_LVALUE_REF in the test file, but it could/should obviously also be introduced in some form to the library itself, although preferably consistently and not just in this context, which seems like a project all by itself.

(Added) But it would not make any difference to the user in this case, only perhaps make the library code somewhat easier to read by itself, although rvalue references were not introduced to the language to facilitate code comprehension but to differentiate two kinds of arguments for hard performance-related reasons, and here the argument is always expendable anyway, nothing "inadvertent" about it.

@rfschtkt
Copy link

I think I've reached the final limits of avoiding unnecessary copies now in #171.

If it were ever to make a difference whether joins are performed from small to large or all together at the end (?), the user should probably implement this in a Body, and even a library-provided adapter might be more confusing than helpful in this fringe case, so I'm not going to explore that.

However, I might have a look into supporting non-monoid semigroups next, primarily for the not-so-fringe benefits of the user's not having to provide an identity element and maybe also of the library's not having to copy it to construct each new internal value.

@alexey-katranov
Copy link
Contributor

As for identity element, we have experience with implementation of std::parallel_reduce in ParallelSTL (C++17 Parallelism). Surely, it is possible but it has some issues with vectorization with pragma simd. It is much easier for the compiler when "zero" is known.
Another drawback is compatibility with ranges. tbb::parallel_reduce is not only reduction but it might be used to apply transformation before reduction. So, it is unclear how to apply transformation for the initial element. Can we try to apply user-provided lambda for the first element of range? Is it efficient? What about vectorization?
Therefore, the identity has some value in some cases over non-identity approach and vice-versa. If you feel that non-identity approach really makes sense we can consider it as an additional interface for tbb::parallel_reduce

@rfschtkt
Copy link

Well, you wouldn't have to use the version without identity element... but I also don't see the problem with SIMD if you implement the so-called RealBody with an identity value right there instead of one passed as an argument to tbb::parallel_reduce().

The main idea is something like the following (probably using the TBB allocator instead, if not std::optional where available):

void operator()(Range& range) {
    if (!my_value) {
        my_value = new Value(my_real_body(range));
    } else {
        Value value = my_real_body(range); // lvalue necessary before C++11
        *my_value = my_reduction(tbb::internal::move(*my_value), tbb::internal::move(value));
    }
}

The biggest problem I'm seeing is all the annoying documentation that prevents generating the many versions automatically using macro expansion and/or templates, and adding another dimension would double the number of "lambda" versions...

@alexey-katranov
Copy link
Contributor

The proposed solution seems inefficient for heavy Values (e.g. arrays):

Value value = my_real_body(range); 

The above expression prevents my_real_body to reduce in-place into my_value. Therefore, either my_value should be passed to my_real_body or my_real_body should reduce the value internally. In any case, it leads to some conditional logic of my_real_body (i.e. two overloads or internal state). It complicates the interface slightly.

@rfschtkt
Copy link

rfschtkt commented Aug 20, 2019

Does TBB have to support any compilers that don't provide (N)RVO?

Evidently this approach would split up each Body "range extension" into a pair of so-called RealBody and Reduction calls (I would prefer to call these Reduction and Operation, respectively...), causing more calls to the latter, but how costly is such a call anyway if it can treat its arguments like C++11 rvalue references, and what percentage of Body invocations are "range extensions" (Amdahl)? And again, you wouldn't have to use the version without identity element, but why would all those functional languages provide reduce() as well as fold() if both didn't have their use?

I might have some working code later this week. So far I've made some progress reducing the number of overloads using templates and SFINAE.

(Added) First draft committed: about 100 extra lines, horrible!

(Added) Well, first outline anyway: I still have to at least try to derive the type of Value (or else this should become the first template argument), my_value is leaked, etc.

(Added) Now with first smoke test. I cheated by using explicit "reduce1" names, but maybe that can still be avoided by type tests on some of the arguments. I'm now thinking of either moving Value to the front in the template arguments list and requiring it to be specified, or requiring at least C++11 and deriving it from one of the operations (although I was hoping to also support C++98/03 functors, even though this is not currently tested with the other "lambda" versions either).

(Added) BTW, it would also be possible to omit the so-called RealBody, and just iterate on the so-called Reduction, both with and without Identity element. Now think of the bloat that would cause! But it would be very user-friendly indeed if you couldn't vectorise the RealBody anyway.

@alexey-katranov
Copy link
Contributor

I am not sure that (N)RVO can solve array reduction concern. If Value is a vector than we need to fill it in my_real_body, then to go through the vector again in my_reduction. However, if we pass Value to my_real_body you do not need to process it twice. So I like the Rvalue approach because it resolves this concern within the current interface (assuming that the array can be moved). (To tell the truth, I do not understand why we pass const Value& to the lambda and cannot reduce to the provided Value but we have to return a new Value).
Also I like the idea having both identity and non-identity versions to give users a choice.

@rfschtkt
Copy link

rfschtkt commented Aug 21, 2019

Not sure I'm following you here. What I wrote about (N)RVO was to ask if your concern was that assigning to the intermediate variable would make a copy, which would indeed be the case for compilers that don't know how to do copy elision. Otherwise it doesn't seem to matter, because passing along the value is done in constant time anyway; with C++11 you could return my_real_body's result directly into my_reduction's parameter, but the difference is only cosmetic, because there's still an intermediate object, in the former case an xvalue and in the latter case a prvalue.

There is of course the difference that non-initial my_real_body invocations, what I called "range extensions" above, are creating new intermediate Value instances instead of modifying one they've already got, but most applications can probably amortise O(log(N)) such occurrences (or how many is it?), and the ones that can't still have the other options available.

@rfschtkt
Copy link

"(although I was hoping to also support C++98/03 functors, even though this is not currently tested with the other "lambda" versions either)"

Wait... what? Of course it is currently tested!

Maybe my brain just had a glitch because of wondering earlier but not coming to a conclusion about why there is any lambda naming and testing going on at all, because a lambda expression is just a way to create a "closure object", which "behaves like a function object" (according to the Standard). So aren't these __TBB_CPP11_LAMBDAS_PRESENT tests following the functor tests merely testing the compiler rather than TBB, and wouldn't it be proper maintenance to just remove this clutter? Or what am I missing here?

@alexey-katranov
Copy link
Contributor

Sure, you can test lambda oriented interface with functors but it does not seem convenient and it almost does not have sense from the user perspective.

As for array reduction, my only regret that we cannot reduce into my_value directly. We still need to create value then move it to my_reduction and process the value second time (perhaps, in some cases compiler can "jam"/combine the loops if everything is inlined). Although, it might be more efficient to reduce value inside my_real_body if we provided such possibility.

Value value = my_real_body(range);
*my_value = my_reduction(tbb::internal::move(*my_value), tbb::internal::move(value));

However, this concern relates only to lambda based interface, we can achieve in-place reduction with classic Body interface.

@rfschtkt
Copy link

"Sure, you can test lambda oriented interface with functors but it does not seem convenient and it almost does not have sense from the user perspective."
Well, both are tested: function objects (for C++98/03), and lambda expressions (I guess to validate that the first test didn't accidentally rely on a hidden extra feature of those function objects?). But then I still have something else to nitpick about: Sum() wasn't tested again with a lambda argument (easily fixed)!

I'm currently stuck with the no-Identity interface. I don't think I need a separate template name with enough type tests (although I have not fully worked that out yet), but it also doesn't solve the problem that remains: how to reconcile that C++98/03 needs a type for the return value (using an explicit template argument for Value seems to be the only way to do that?), and that C++11 can just get by with something like auto parallel_reduce( const Range& range, const RealBody& real_body, const Reduction& reduction ) -> decltype(real_body(range))? I don't see a solution where you can use the same client code for both language versions that also allows the template argument(s) to be omitted from C++11 onward (and it would make no sense at all not to have the latter).

So it seems as if either C++98/03 should not get these new overloads, or both library and user have to differentiate, and most users probably don't have to support C++98/03 anymore anyway so it's not that much of a burden on the user side of the equation. With differentiation, the second question is whether the template parameters may be switched around so that only a Value has to be supplied rather than first Range and then Value.

@alexey-katranov
Copy link
Contributor

I believe it is acceptable if the new interface works only with C++11 (to avoid C++98/03 support
complexities).

@rfschtkt
Copy link

rfschtkt commented Aug 29, 2019

I haven't done much yet, besides mulling over the design.

Behind the scenes, the idea is fairly obvious, using either tbb::cache_aligned_allocator (I guess?) or perhaps differentiated for C++17 using std::optional for the current value, starting out as nullptr resp. nullopt (augmenting the destination domain with an "identity" value).

Meanwhile, I've only added some counters to the pull request ("Yet more counters", reordered before the no-identity commits) to investigate the average reduction length (the longer I look at this, the more I think that the code should switch naming from RealBody&Reduction to Reduction&Operation, respectively), and it seems acceptable to use a lambda that creates a new value in the destination domain each time... but I'm also thinking of breaking the simple symmetry anyway and using a pointer parameter for the starting value instead, to take all concerns about destination domain value creation amortisation off the table, except that this would be weird if you just wanted to omit an argument for brevity even though you do have an identity value in the destination domain, which would probably be the main use case. Should the new parallel_reduce overloads accept both signatures to get out of this conundrum? That would be nice for the user, but the implementation would get even more complicated...

Or else the implementation should only worry about the brevity-related use case, and then semigroup users should either accept limited amortisation or bring their own destination domain augmentation with the explicit-identity overloads, e.g., boost::optional or std::optional?

(Added) Why did I write "source domain" 3 times when I meant "destination domain" (now corrected)?

@rfschtkt
Copy link

rfschtkt commented Sep 3, 2019

I pushed some plausible initial support for semigroups, only with symmetric interface (see above).

Still wondering about the cost, both relative to ultra-cheap explicit identity elements and when destination domain values are expensive to generate de novo.

(Added) Using aligned_space now, should be good enough to compete with the former, although the issue with the latter is not affected.

(Added) Now with updated inspectCounters(). Next would be tests for new deterministic overloads.

(Added) Tests completed.

@rfschtkt
Copy link

rfschtkt commented Sep 6, 2019

So that's about enough exploration, I would say. Did I overlook anything (other than some clean-up)? The main things are moving things around for performance (also before C++11 by using non-const rather than const lvalue references), and allowing usage without an identity element (for convenience and/or out of necessity, only from C++11). The implementation was condensed a bit before it got bloated again with the new overloads, and counters were added to validate what's going on with object life cycles. Plus some sneaky suggestions here and there of course. :-)

@rfschtkt
Copy link

rfschtkt commented Sep 9, 2019

I squashed all my commits in #171 and declared it ready for review.

(Added) To make a review easier, I split it up again into thematic commits.

(Added) Removed __TBB_delete (should be done elsewhere, more exhaustively), moved some #defines (to a more appropriate location), each rebased in their relative commits.

@rfschtkt
Copy link

Does anything require further clarification? Any suggestions?

@BenFrantzDale
Copy link
Author

Go for it.

@rfschtkt
Copy link

Maybe they're waiting for me to add some more overloads that synthesise the RangeFunction from the BinaryFunction? Hmm, the Standard uses the name "BinaryOperation" for std::accumulate(), not "BinaryFunction"...

@anton-potapov
Copy link
Contributor

@BenFrantzDale can you re-target the path to the master branch ? (both tbb2020 and tbb2019 are no longer updated)

@BenFrantzDale BenFrantzDale changed the base branch from tbb_2019 to master February 1, 2022 19:16
@BenFrantzDale
Copy link
Author

@anton-potapov I just force-pushed. I went ahead and made the arg of join be &&.
It still needs some tests... like one that assembles a std::list<std::unique_ptr<int>>, maybe (where std::list is joinable in O(1) and std::unique_ptr<int> isn't copyable).

@kboyarinov
Copy link
Contributor

Since #1307 was merged to master, this PR can be closed. Thanks to everyone for the productive discussion.
Feel free to reopen the PR of open a separate PR/issue in case of any questions on the current implementation.

@kboyarinov kboyarinov closed this Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants