Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Further optimize the optimal rotation code path #560

Open
HanatoK opened this issue Jul 25, 2023 · 6 comments
Open

RFC: Further optimize the optimal rotation code path #560

HanatoK opened this issue Jul 25, 2023 · 6 comments
Labels
non-urgent Long-term improvement optimization

Comments

@HanatoK
Copy link
Member

HanatoK commented Jul 25, 2023

I think the optimal rotation code path can still be optimized further as follows:

  1. Use rotation matrix instead of quaternion to rotate the vectors. This could make the rotation operation 1.5~3 times faster on my computer. A distilled test example can be seen in https://godbolt.org/z/qnKv8E6jT;
  2. Use fixed size array for S, S_eigvec and S_backup. cvm::matrix2d allocates the memory dynamically but we already know that the matrices are all 4x4 at compile time;

Any ideas?

Update: for 1, it should be 0.5~2 times faster.

@HanatoK HanatoK added optimization non-urgent Long-term improvement labels Jul 25, 2023
@jhenin
Copy link
Member

jhenin commented Jul 25, 2023

Thanks, those seem valid to me.

  1. I think this is promising because calling rotate() many times for the same quaternion does redundant work, especially compared with caching the rotation matrix and applying it directly to many positions.
  2. I would expect only minor gains from this, as the resizing is done once in calc_optimal_rotation_impl and then becomes trivial because these are persistent member data, so there is no reallocation on subsequent calls: it's really allocated once and for all. That said, I would rather have the resize() happen in the constructor for a cleaner code. Then the run-time should be nearly indistinguishable from fixed arrays: we don't construct very many rotation objects per second (at least I can't think of a workflow that would).

@giacomofiorin
Copy link
Member

I agree on both points.

@HanatoK
Copy link
Member Author

HanatoK commented Jul 25, 2023

Thanks @jhenin and @giacomofiorin ! I will try the optimizations if I have time.

@HanatoK
Copy link
Member Author

HanatoK commented Jul 26, 2023

@giacomofiorin
Is it also possible to parallelize the loops over atoms? I am not familiar with the SMP but it seems it only parallelizes the computation of CVCs and biases, but in this way, the threads may have imbalance workloads. Let's say we have one distance CVC and one RMSD CVC, which are computed by two threads via SMP, then the thread calculating distance CVC may finish faster than the other. Another similar case is running metadynamics with a harmonic wall restraint. The harmonic wall restraint is easier to compute than the metadynamics bias, since the latter needs to iterate over either the grid to project a new hill, or iterate all history hills to calculate the biasing force at a certain position. Is it possible to utilize SMP in a fine-grained manner, such as looping over all atoms?

@giacomofiorin
Copy link
Member

giacomofiorin commented Sep 20, 2023

Answering your last point (before the issue gets closed when #585 is merged) parallelization over atoms is desirable, but will take some refactoring.

Remember that we are dealing with a broad range of use cases between (1) a single expensive rotation (e.g. large protein) and (2) many rotations/RMSDs with respect to multiple points (e.g. path CV). Also, when going to large numbers of threads good SMP parallelization requires a clever data structure layout (ask those with more experience than either of us). This means yet more refactoring because the current layout is nowhere near clever! :-)

Continuous testing has helped tremendously, especially considering that we cannot plan around the release cycles of major packages. I feel that we're getting closer to making that kind of refactoring manageable, but it remains more than one or two PRs away.

A practical suggestion: would you be open to consider taking some of the tests that you already wrote in the namd/tests/library folder and making them agnostic from NAMD (i.e. top-level tests folder)? If most of the functionality could be run with using just the library, it would make it much easier to troubleshoot any errors. Running a full-blown MD engine is important for benchmarking, but makes debugging harder.

@HanatoK
Copy link
Member Author

HanatoK commented Sep 26, 2023

@giacomofiorin

I will try to make some of the tests MD-engine agnostic.

Regarding the threading, is it possible to use a thread pool? I think it might be possible to pull a thread from the pool to do either (1) or (2) that you mentioned, but do not know if there are other performance overheads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-urgent Long-term improvement optimization
Projects
None yet
Development

No branches or pull requests

3 participants