Better rvalues support for parallel_reduce algorithm #1307

kboyarinov · 2024-02-01T16:48:38Z

Description

Adding the optimization for reduction of rvalues using parallel_reduce and parallel_deterministic_reduce.
Based on #171 , #169 and #1299.

This patch is intended to optimize operating with Value class in parallel reduce algorithm (lambda version). Example from #1299

// We want to efficiently reduce a number of std::set objects into single std::set containing unique values from each set
std::vector<std::set<int>> sets = ...;

parallel_reduce(blocked_range<size_t>(0, sets.size()),
                         std::set<int>{}, // Identity - empty set
                         [&](const blocked_range<size_t>& range, const std::set<int>& value) { // Second argument is required to be const Value& by the spec
                             std::set<int> result = value; // We are forced to do copy here since value is constant
                             
                             for (size_t i = range.begin(); i < range.end(); ++i) {
                                 result.merge(sets[i]); // set::merge is efficient since it transfers required node from one set to another
                             }
                             return result;
                         },
                         [](const std::set<int>& value1, const std::set<int>& value2) { // Both arguments are required to be const Value&
                             std::set<int> result = value1; // We are forced to do copy
                             result.insert(value2.begin(), value2.end()); // unable to use set::merge here since value2 is constant  
                         }                  
                         );

The spec forces the user to use inefficient copying instead of "moving" even if the input range preservation is not important for the user. "Body" version of parallel reduce allows to drop unnecessary copying:

struct Body {
    std::vector<std::set<int>>& sets;
    std::set<int> result;
    
    void operator()(const blocked_range<size_t>& range) {
        for (size_t i = range.begin(); i != range.end(); ++i) {
            result.merge(sets[i]); // only moving nodes
        }
    }
    
    void join(Body& rhs) {
        result.merge(rhs.result); // also only moving nodes
    }
};

The main goal of this patch is to allow rvalues optimization for lambda version of reduce.
This patch is NOT intended to allow reduction of move-only objects (such as std::unique_ptrs) since we still need to copy the identity value into each parallel_reduce leaf.

Specification change is also needed for this PR:

Relax ParallelReduceFunc named requirement to allow Value operator()(const Range& range, Value&& value) const
Relax ParallelReduceReduction named requirement to allow Value operator()(Value&& x, Value&& y) const.

This PR also contains simple test that I guess should be extended to better cover all overloads and algorithms.

Fixes # - issue number(s) if exists

- git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details)

Type of change

Choose one or multiple, leave empty if none of the other choices apply

Add a respective label(s) to PR if you have permissions

bug fix - change that fixes an issue
new feature - change that adds functionality
tests - change in tests
infrastructure - change in infrastructure and CI
documentation - documentation update

Tests

added - required for new features and some bug fixes
not needed

Documentation

updated in # - add PR number
needs to be updated
not needed

Breaks backward compatibility

Yes
No
Unknown

Notify the following users

@BenFrantzDale
@rfschtkt
Feel free to participate in review and discussions for this patch:)

Other information

Signed-off-by: Konstantin Boyarinov <[email protected]>

BenFrantzDale

Great to see someone looking at this again! A few suggestions.

BenFrantzDale · 2024-02-01T16:58:01Z

include/oneapi/tbb/parallel_reduce.h

                                   body(range);
-                                   body.join(rhs);
+                                   body.join(std::move(rhs));


This is by forwarding reference and so should be a forward not a move.

Since this part of the reduce API was not changed at all, it was a mistake to change this concept. This line was removed.
parallel_reduce_reduction and parallel_reduce_combine concepts were changed instead.

BenFrantzDale · 2024-02-01T17:01:52Z

include/oneapi/tbb/parallel_reduce.h

+    Value&& result() noexcept {
+        return std::move(my_value);
    }


Would it be desirable to && qualify this member function? As in

[[nodiscard]] Value&& result() noexcept && { return std::move(my_value); }

?
That would force you to say return std::move(body).result(); elsewhere, which may be clearer. (In general body.result() isn't obvious that it is extracting the result from body.)

My first thought on this question was removing the && qualifier because as an internal structure, the body would be only move-from the result. But now I guess having this qualifier would be more clear from the move semantics perspective. Thanks for catching that.

Yeah. An alternative is to give it a different name so it’s clearer that it moves out without the std::move, but just qualifying it like this feels reasonable.

BenFrantzDale · 2024-02-01T17:02:05Z

include/oneapi/tbb/parallel_reduce.h

@@ -514,7 +515,7 @@ Value parallel_reduce( const Range& range, const Value& identity, const RealBody
    lambda_reduce_body<Range,Value,RealBody,Reduction> body(identity, real_body, reduction);
    start_reduce<Range,lambda_reduce_body<Range,Value,RealBody,Reduction>,const __TBB_DEFAULT_PARTITIONER>
                          ::run(range, body, __TBB_DEFAULT_PARTITIONER() );
-    return body.result();
+    return std::move(body.result());


I think this std::move does nothing since body.result() is an rvalue?

Changed to std::move(body).result(), as discussed in the previous thread.

BenFrantzDale · 2024-02-01T17:02:42Z

test/conformance/conformance_parallel_reduce.cpp

@@ -56,6 +57,37 @@ struct ReduceBody {
    }
 };

+template <class T>


//! explaining what this struct is for?

BenFrantzDale · 2024-02-01T17:06:15Z

test/conformance/conformance_parallel_reduce.cpp

+    auto join_body = [](vector_wrapper<int>&& x, vector_wrapper<int>&& y) {
+        vector_wrapper<int> new_vector = std::move(x);
+        new_vector.reserve(new_vector.size() + y.size());
+        new_vector.insert(new_vector.end(), std::make_move_iterator(y.begin()), std::make_move_iterator(y.end()));


A move iterator over a vector of ints doesn't buy much. Consider for this example making

struct MoveOnlyInt { int x; MoveOnlyInt() = default; explicit MoveOnlyInt(int x) : x{x} {} MoveOnlyInt(const MoveOnlyInt&) = delete; MoveOnlyInt(MoveOnlyInt&& other) : x{std::exchange(other.x, 0)} {} MoveOnlyInt& operator=(const MoveOnlyInt&) = delete; MoveOnlyInt& operator=(MoveOnlyInt&& other) { this->x = std::exchange(other.x, 0); return *this; } };

and use vectors of MoveOnlyInt so the move iterators are required?

Could also consider a test case using std::list or std::set that moves the nodes across, like

auto join_body = [](std::list<int>&& x, std::list<int>&& y) { x.splice(x.end(), std::move(y)); return std::move(x); }

or

auto join_body = [](std::set<int> x, std::set<int>&& y) { x.merge(std::move(y)); return std::move(x); }

(These could use MoveOnlyInt too.)

It would probably be good to do a test with a type that isn’t default constructible and/or isn’t movable or copyable in a node-based container — basically all the weird corner cases of irregular types.

Unfortunately, we still cannot use move-only types for parallel_reduce since we need to copy the identity into each reduction "leaf". And using std::vector<MoveOnlyInt> or list would cause compilation errors even if the identity element is empty.
I have changed the value type into the integer wrapper that is copyable, but the actual copy constructor should never be called.

Regarding adding new test cases, I think it makes no sense to test both "vector of vectors" and "vector of lists" use-cases. Splicing the lists while doing parallel_reduce is more interesting IMHO, so I left only the vector of lists test.

Signed-off-by: Konstantin Boyarinov <[email protected]>

BenFrantzDale

Looks good! I'll leave it to tbb maintainers to approve.

BenFrantzDale · 2024-02-06T21:35:32Z

test/conformance/conformance_parallel_reduce.cpp

+    NeverCopyWrapper(const NeverCopyWrapper&) {
+        REQUIRE_MESSAGE(false, "Copy constructor of NeverCopyWrapper should never be called");
+    }
+
+    NeverCopyWrapper& operator=(const NeverCopyWrapper&) {
+        REQUIRE_MESSAGE(false, "Copy assignment of NeverCopyWrapper should never be called");
+        return *this;
+    }


Wouldn't it be better to =delete; these?

We cannot delete them because if we provide std::vector<NeverCopyWrapper> as an identity to parallel_reduce, it should be copied into each reduce leaf. And even if the empty identity would be copied, it requires value_type to be copy constructible and causes compilation issues otherwise.
Because of this, I decided to make it copiable to fix the compilation but deny to actually call the ctor and assignment.

I have investigated one more time, and since we don't use std::vector directly, only using the wrapper, we can control that actual copy constructor would not be called by the container. So I removed NeverCopyWrapper and changed it to be MoveOnlyWrapper

BenFrantzDale · 2024-02-06T21:36:37Z

test/conformance/conformance_parallel_reduce.cpp

+    auto operator()(Args&&... args) const -> decltype(oneapi::tbb::parallel_deterministic_reduce(std::forward<Args>(args)...)) {
+        return oneapi::tbb::parallel_deterministic_reduce(std::forward<Args>(args)...);
+    }


Is the trailing return type helpful versus just leave it to auto (or decltype(auto))?

We are forced to use trailing return type, because both auto with automatic deduction and decltype(auto) are C++14 features and we still need to support C++11.

Gotcha. That's annoying.

pavelkumbrasev · 2024-02-14T11:43:22Z

LGTM!

Better rvalues support for parallel_reduce algorithm

bb26935

Signed-off-by: Konstantin Boyarinov <[email protected]>

kboyarinov requested review from vossmjp and pavelkumbrasev February 1, 2024 16:48

github-actions bot added the enhancement label Feb 1, 2024

BenFrantzDale reviewed Feb 1, 2024

View reviewed changes

kboyarinov added 5 commits February 2, 2024 03:20

Fix review comments for the implementation part

1a003b7

Remove unnecessary reference

9584083

Signed-off-by: Konstantin Boyarinov <[email protected]>

Improve testing, fix copyright years

2425998

Signed-off-by: Konstantin Boyarinov <[email protected]>

Comment fix

3ea69c4

Fix warning on Windows

556b089

BenFrantzDale reviewed Feb 6, 2024

View reviewed changes

kboyarinov added 4 commits February 8, 2024 05:21

Remove NeverCopyWrapper

26f3c5b

Remove extra newline

f23e0c2

Add test case category tags

93b68d2

Remove integral delimiter

9b7b4d1

pavelkumbrasev approved these changes Feb 14, 2024

View reviewed changes

kboyarinov merged commit ae0696c into master Feb 14, 2024
19 checks passed

kboyarinov deleted the dev/kboyarinov/rvalue-reduce branch February 14, 2024 13:45

This was referenced Feb 14, 2024

Tentative rvalue reference support for parallel_reduce #171

Closed

Rvalue reduce #169

Closed

kboyarinov mentioned this pull request Feb 28, 2024

TBB may not handle RAII struct very well in parallel_reduce #1251

Closed

kboyarinov mentioned this pull request May 20, 2024

[Doc] Add documentation for rvalue reduction #1385

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better rvalues support for parallel_reduce algorithm #1307

Better rvalues support for parallel_reduce algorithm #1307

kboyarinov commented Feb 1, 2024 •

edited

Loading

BenFrantzDale left a comment

BenFrantzDale Feb 1, 2024

kboyarinov Feb 2, 2024

BenFrantzDale Feb 1, 2024

kboyarinov Feb 2, 2024

BenFrantzDale Feb 2, 2024

BenFrantzDale Feb 1, 2024

kboyarinov Feb 2, 2024

BenFrantzDale Feb 1, 2024

kboyarinov Feb 6, 2024

BenFrantzDale Feb 1, 2024

BenFrantzDale Feb 1, 2024 •

edited

Loading

BenFrantzDale Feb 2, 2024

kboyarinov Feb 6, 2024

BenFrantzDale left a comment

BenFrantzDale Feb 6, 2024

kboyarinov Feb 7, 2024

kboyarinov Feb 8, 2024

BenFrantzDale Feb 6, 2024

kboyarinov Feb 7, 2024

BenFrantzDale Feb 9, 2024

pavelkumbrasev commented Feb 14, 2024

Better rvalues support for parallel_reduce algorithm #1307

Better rvalues support for parallel_reduce algorithm #1307

Conversation

kboyarinov commented Feb 1, 2024 • edited Loading

Description

Type of change

Tests

Documentation

Breaks backward compatibility

Notify the following users

Other information

BenFrantzDale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenFrantzDale Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenFrantzDale left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavelkumbrasev commented Feb 14, 2024

kboyarinov commented Feb 1, 2024 •

edited

Loading

BenFrantzDale Feb 1, 2024 •

edited

Loading