Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better rvalues support for parallel_reduce algorithm #1307

Merged
merged 10 commits into from
Feb 14, 2024

Conversation

kboyarinov
Copy link
Contributor

@kboyarinov kboyarinov commented Feb 1, 2024

Description

Adding the optimization for reduction of rvalues using parallel_reduce and parallel_deterministic_reduce.
Based on #171 , #169 and #1299.

This patch is intended to optimize operating with Value class in parallel reduce algorithm (lambda version). Example from #1299

// We want to efficiently reduce a number of std::set objects into single std::set containing unique values from each set
std::vector<std::set<int>> sets = ...;

parallel_reduce(blocked_range<size_t>(0, sets.size()),
                         std::set<int>{}, // Identity - empty set
                         [&](const blocked_range<size_t>& range, const std::set<int>& value) { // Second argument is required to be const Value& by the spec
                             std::set<int> result = value; // We are forced to do copy here since value is constant
                             
                             for (size_t i = range.begin(); i < range.end(); ++i) {
                                 result.merge(sets[i]); // set::merge is efficient since it transfers required node from one set to another
                             }
                             return result;
                         },
                         [](const std::set<int>& value1, const std::set<int>& value2) { // Both arguments are required to be const Value&
                             std::set<int> result = value1; // We are forced to do copy
                             result.insert(value2.begin(), value2.end()); // unable to use set::merge here since value2 is constant  
                         }                  
                         );

The spec forces the user to use inefficient copying instead of "moving" even if the input range preservation is not important for the user. "Body" version of parallel reduce allows to drop unnecessary copying:

struct Body {
    std::vector<std::set<int>>& sets;
    std::set<int> result;
    
    void operator()(const blocked_range<size_t>& range) {
        for (size_t i = range.begin(); i != range.end(); ++i) {
            result.merge(sets[i]); // only moving nodes
        }
    }
    
    void join(Body& rhs) {
        result.merge(rhs.result); // also only moving nodes
    }
};

The main goal of this patch is to allow rvalues optimization for lambda version of reduce.
This patch is NOT intended to allow reduction of move-only objects (such as std::unique_ptrs) since we still need to copy the identity value into each parallel_reduce leaf.

Specification change is also needed for this PR:

  • Relax ParallelReduceFunc named requirement to allow Value operator()(const Range& range, Value&& value) const
  • Relax ParallelReduceReduction named requirement to allow Value operator()(Value&& x, Value&& y) const.

This PR also contains simple test that I guess should be extended to better cover all overloads and algorithms.

Fixes # - issue number(s) if exists

  • - git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details)

Type of change

Choose one or multiple, leave empty if none of the other choices apply

Add a respective label(s) to PR if you have permissions

  • bug fix - change that fixes an issue
  • new feature - change that adds functionality
  • tests - change in tests
  • infrastructure - change in infrastructure and CI
  • documentation - documentation update

Tests

  • added - required for new features and some bug fixes
  • not needed

Documentation

  • updated in # - add PR number
  • needs to be updated
  • not needed

Breaks backward compatibility

  • Yes
  • No
  • Unknown

Notify the following users

@BenFrantzDale
@rfschtkt
Feel free to participate in review and discussions for this patch:)

Other information

Copy link

@BenFrantzDale BenFrantzDale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see someone looking at this again! A few suggestions.

body(range);
body.join(rhs);
body.join(std::move(rhs));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by forwarding reference and so should be a forward not a move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this part of the reduce API was not changed at all, it was a mistake to change this concept. This line was removed.
parallel_reduce_reduction and parallel_reduce_combine concepts were changed instead.

Comment on lines 400 to 402
Value&& result() noexcept {
return std::move(my_value);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be desirable to && qualify this member function? As in

[[nodiscard]] Value&& result() noexcept && {
    return std::move(my_value);
}

?
That would force you to say return std::move(body).result(); elsewhere, which may be clearer. (In general body.result() isn't obvious that it is extracting the result from body.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought on this question was removing the && qualifier because as an internal structure, the body would be only move-from the result. But now I guess having this qualifier would be more clear from the move semantics perspective. Thanks for catching that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. An alternative is to give it a different name so it’s clearer that it moves out without the std::move, but just qualifying it like this feels reasonable.

@@ -514,7 +515,7 @@ Value parallel_reduce( const Range& range, const Value& identity, const RealBody
lambda_reduce_body<Range,Value,RealBody,Reduction> body(identity, real_body, reduction);
start_reduce<Range,lambda_reduce_body<Range,Value,RealBody,Reduction>,const __TBB_DEFAULT_PARTITIONER>
::run(range, body, __TBB_DEFAULT_PARTITIONER() );
return body.result();
return std::move(body.result());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this std::move does nothing since body.result() is an rvalue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to std::move(body).result(), as discussed in the previous thread.

@@ -56,6 +57,37 @@ struct ReduceBody {
}
};

template <class T>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//! explaining what this struct is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto join_body = [](vector_wrapper<int>&& x, vector_wrapper<int>&& y) {
vector_wrapper<int> new_vector = std::move(x);
new_vector.reserve(new_vector.size() + y.size());
new_vector.insert(new_vector.end(), std::make_move_iterator(y.begin()), std::make_move_iterator(y.end()));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A move iterator over a vector of ints doesn't buy much. Consider for this example making

struct MoveOnlyInt {
    int x;
    MoveOnlyInt() = default;
    explicit MoveOnlyInt(int x) : x{x} {}
    MoveOnlyInt(const MoveOnlyInt&) = delete;
    MoveOnlyInt(MoveOnlyInt&& other) : x{std::exchange(other.x, 0)} {}
    MoveOnlyInt& operator=(const MoveOnlyInt&) = delete;
    MoveOnlyInt& operator=(MoveOnlyInt&& other) {
        this->x = std::exchange(other.x, 0);
        return *this;
    }
};

and use vectors of MoveOnlyInt so the move iterators are required?

Copy link

@BenFrantzDale BenFrantzDale Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also consider a test case using std::list or std::set that moves the nodes across, like

auto join_body = [](std::list<int>&& x, std::list<int>&& y) {
    x.splice(x.end(), std::move(y));
    return std::move(x);
}

or

auto join_body = [](std::set<int> x, std::set<int>&& y) {
    x.merge(std::move(y));
    return std::move(x);
}

(These could use MoveOnlyInt too.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to do a test with a type that isn’t default constructible and/or isn’t movable or copyable in a node-based container — basically all the weird corner cases of irregular types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we still cannot use move-only types for parallel_reduce since we need to copy the identity into each reduction "leaf". And using std::vector<MoveOnlyInt> or list would cause compilation errors even if the identity element is empty.
I have changed the value type into the integer wrapper that is copyable, but the actual copy constructor should never be called.

Regarding adding new test cases, I think it makes no sense to test both "vector of vectors" and "vector of lists" use-cases. Splicing the lists while doing parallel_reduce is more interesting IMHO, so I left only the vector of lists test.

Copy link

@BenFrantzDale BenFrantzDale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I'll leave it to tbb maintainers to approve.

Comment on lines 71 to 78
NeverCopyWrapper(const NeverCopyWrapper&) {
REQUIRE_MESSAGE(false, "Copy constructor of NeverCopyWrapper should never be called");
}

NeverCopyWrapper& operator=(const NeverCopyWrapper&) {
REQUIRE_MESSAGE(false, "Copy assignment of NeverCopyWrapper should never be called");
return *this;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to =delete; these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot delete them because if we provide std::vector<NeverCopyWrapper> as an identity to parallel_reduce, it should be copied into each reduce leaf. And even if the empty identity would be copied, it requires value_type to be copy constructible and causes compilation issues otherwise.
Because of this, I decided to make it copiable to fix the compilation but deny to actually call the ctor and assignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have investigated one more time, and since we don't use std::vector directly, only using the wrapper, we can control that actual copy constructor would not be called by the container. So I removed NeverCopyWrapper and changed it to be MoveOnlyWrapper

Comment on lines +304 to +306
auto operator()(Args&&... args) const -> decltype(oneapi::tbb::parallel_deterministic_reduce(std::forward<Args>(args)...)) {
return oneapi::tbb::parallel_deterministic_reduce(std::forward<Args>(args)...);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the trailing return type helpful versus just leave it to auto (or decltype(auto))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are forced to use trailing return type, because both auto with automatic deduction and decltype(auto) are C++14 features and we still need to support C++11.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. That's annoying.

@pavelkumbrasev
Copy link
Contributor

LGTM!

@kboyarinov kboyarinov merged commit ae0696c into master Feb 14, 2024
19 checks passed
@kboyarinov kboyarinov deleted the dev/kboyarinov/rvalue-reduce branch February 14, 2024 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants