improving scalability #53

francispoulin · 2020-05-15T18:17:36Z

Hello Mikael,

I have tested the scalability of my 2D Navier-Stokes (shared in another issue) code and find that it's not doing great. It seems to get worst as the number of degrees of freedom increases. I have copied the results in a table below in an text table.

Just to be clear, I wasn't experting great efficiency as I have not been very clever in the setup of the code, but I do find it strange that as the number of points increase, the efficiency decrease. Normally, I would expect it to get better.

Question 1: Does this look odd to you?

Question 2: If I wanted to monitor the efficiency of my code in parallel, what would you recommend I use?

Note: these are done on my new, person desktop with 18 cores and 64 GB of ram. I am trying to get shenfun installed on a server, which has more cores to play with, but the people in charge have decided not to support conda, and don't want us to install conda, so that is becoming more of a chore than it should be. This may make it's way into another issue sometime ;)

| Np=1 | 2 | 4 | 8 | 16 |
N=1024 | 1 | 0.68 | 0.61 | 0.4 | 0.26
N=2048 | 1 | 0.55 | 0.47 | 0.37 | 0.19
N=4096 | 1 | 0.47 | 0.39 | 0.27 | 0.16

mikaem · 2020-05-15T21:00:20Z

Hi Francis
Could you run the tests with just slab decomposition? It is normally the best solution. Choose slab=True when creating the TensorProductSpace.
Installing without conda should not be that hard on supercomputers. They usually have all the dependencies you need. FFTW, HDF5 in parallel, MPI etc.

mikaem · 2020-05-15T21:03:37Z

Ah. Just realized this is 2D, so it's already slab.

mikaem · 2020-05-16T09:02:20Z

BTW, I do not find the results strange, but I cannot really understand them without a better knowledge of your PC. Is the memory shared, distributed, heterogeneous? How many cores in each CPU? By the looks of it I'm guessing 2 cores per CPU because the performance is very good on 2 cores, but then there is very little speed-up going to 4. That could be explained with your computer having fast interconnect between 2 cores on the same CPU, but slower from CPU to CPU. But you could also have 4 cores on one CPU and fast interconnect between cores 1 and 2, but slower between (1, 2) and to the two remaining (3, 4). If you look at our MPI paper we discuss this to some length in section 4. The thing with spectral codes is that they require communication between all cores because the basis functions are global. So spectral codes are very heavy on communication and require fast interconnect between all cores. Most computers are not built to handle that. On dedicated supercomputers with very fast interconnect the codes can scale well up to thousands of cores and I've used this code up to 70,000 cores with near perfect scaling. But that does not mean that the code will scale at all on a different computer with a different hardware. These things are complicated:-)

francispoulin · 2020-05-16T15:14:45Z

Hello Mikael and thanks for your respones. Greatly appreciated.

Yes, this is a 2D problem so it is always a slab.

I did a 3D version of my code and did some tests with 512^3 and found the scalings of

np = 1, 2, 4, 8, 16
1., 0.93, 0.75, 0.54, 0.28

I suspect that if I used more degrees fo freedom I would get even better performance. This suggests that my architure can have good scalability.

As for the hardware, this is a brand new inspiron, 1 processor with 18 cores, which I have been led to believe should have great scalability.

I might try doing some tests with parts of the code to see how computing the flux, RHS, many times scales. If that scales well then the problem might be with the time stepping. I plan to try this tomorrow and will let you know how things go.

francispoulin · 2020-05-19T14:57:00Z

I have some strange results to share but might help to point out where the problem occurs.

Below, I will copy a simple code that times how long it takes to compute the gradient 20 times. It doesn't do any updating so it's not physically meaningful but it should be a good measure of how the effort scales with mulitple cores. The results are as follows:

N=2048
np scaling
1 1
2 0.5
4 0.45
8 0.35

N=4096
np scaling
1 1
2 0.43
4 0.36
8 0.26

In both cases the scaling from 1 to 2 is awful. It actually takes longer with two cores than with 1. But, going from 2 to 4, actually scales very well. The efficiency from 2 to 4 is 0.96 for the first and 0.84 for the second. Again, I would think with more points it should scale better.

I guess I have some questions.

First, do you get similar results on one of your machines that scales well?

Second, could it be there is a problem in what's going on from 1 to 2?

Third, I don't remember if I've turned on the optimization or not. Should that make a difference in the scalability?

As an aside, I have tried to install shenfun on one of our servers and have problems both with and without conda, but I will save that for another issue. ; )

francispoulin · 2020-05-19T15:02:25Z

`from mpi4py import MPI
import numpy as np
from shenfun import *
from mpi4py_fft import generate_xdmf, fftw

import matplotlib.pyplot as plt
import time
import sys

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# Geometry
M = 12
N = (2**M, 2**M)
L = [20.0, 20.]

# Parameters
F2 = 1e-8

# Funcion Spaces
V1  = Basis(N[0], 'F', dtype='D', domain=(0, L[0]))
V2  = Basis(N[1], 'F', dtype='d', domain=(0, L[1]))
T   = TensorProductSpace(comm, (V1, V2), **{'planner_effort': 'FFTW_MEASURE'})
TV  = VectorTensorProductSpace(T)

# Padded Spaces
Tp  =  T.get_dealiased((1.5, 1.5))
TVp = TV.get_dealiased((1.5, 1.5))

q_hat, q, qb   = Function(T),  Array(T), Array(T)
gradq_hat      = Function(TV)          
qp, gradqp, Up = Array(Tp), Array(TVp), Array(TVp)  # Padded arrays

#FJP
Ub, gradqb = Array(TVp), Array(TVp)

# Wavenumbers
K  = np.array(T.local_wavenumbers(True,True))
K2 = np.sum(K*K + F2, 0, dtype=float)
K_over_K2 = K.astype(float) / np.where(K2 == 0, 1, K2).astype(float)

# Initialization
np.random.seed(rank)
q[:]  = 1e-4*2.*(np.random.random(q.shape).astype(q.dtype)-0.5)
q_hat = T.forward(q, q_hat)

if __name__ == '__main__':

     # Solve
    time_initial = time.time()

    rhs_hat = Function(T)

    for ii in np.arange(20):
        # Gradient of q
        gradq_hat = 1j*K*q_hat
        gradqp    = TVp.backward(gradq_hat, gradqp)

        # Velocity
        Up[0]  = Tp.backward( 1j*K_over_K2[1]*q_hat, Up[0])
        Up[1]  = Tp.backward(-1j*K_over_K2[0]*q_hat, Up[1])
        
        # Flux
        rhs     = -(Up[0]*gradqp[0] + Up[1]*gradqp[1])
        rhs_hat =   Tp.forward(rhs, rhs_hat)

    time_final = time.time()

print('Total time = ', (time_final - time_initial))
    `

mikaem · 2020-05-20T09:38:54Z

Hi Francis,

I think you are are going about this the wrong way. You need to first benchmark your computer using some common benchmark tools. When you have an idea of what to expect it will be much easier to analyse the shenfun results. There are MPI benchmarks available for most processors, for example here for intel. Note that you should only look into the performance of alltoall, because that is all that is used by shenfun. In fact, we are only using alltoallw.
You may see more speed-up on your computer using threads. Choose number of threads when creating the TensorProductSpace.
It is not really strange that the performance does not increase going from 1 to 2 processors. That is very often the case. One processor requires NO communication, whereas 2 does a whole lot. If communications are slow, then 2 may be slower than 1 even though the serial operations are twice as fast. For 1 processor you can also do collapse_fourier=True, which will speed up 1 processor even further, but it cannot be used for padding or 2D slab.
MPI has a whole bunch of environment variables that may be tweaked to enhance performance. Each function usually has several different implementations and you only get the default unless you ask for something else. For example, there's a MPICH_SHARED_MEM_COLL_OPT that tries to take advantage of shared memory, that I suspect you computer has?
You can get either MPICH or OpenMPI from conda. You should probably try both on your computer to figure out which one performs best.
Shenfun is not optimized for shared memory, only distributed memory. So even though all cores have access to all memory on a shared memory computer, shenfun will still assume memory is distributed and send arrays back and forth. Optimizing for shared memory would probably lead to major speed-ups, but I have not looked into it.
Speed-up will most likely come from tweaking the MPI parameters and not shenfun's implementation. Shenfun just calls alltoallw for any array, 2D, 3D, 4D etc, and that's the beauty of it. Other parallel FFT vendors have codes that use 10,000 lines just to get the 3D pencil implemented. We have a completely generic implementation on just a few hundred lines of code. But we have to call alltoallw, there is simply no room for change or optimization on shenfun's side. Improve MPICH's alltoallw and shenfun's performance will improve.

francispoulin · 2020-05-20T12:19:03Z

Hello Mikael,

All excellent points. I'll start from the first step and go from there.

I will keep you posted as I make progess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improving scalability #53

improving scalability #53

francispoulin commented May 15, 2020

mikaem commented May 15, 2020

mikaem commented May 15, 2020

mikaem commented May 16, 2020

francispoulin commented May 16, 2020

francispoulin commented May 19, 2020

francispoulin commented May 19, 2020

mikaem commented May 20, 2020

francispoulin commented May 20, 2020

improving scalability #53

improving scalability #53

Comments

francispoulin commented May 15, 2020

mikaem commented May 15, 2020

mikaem commented May 15, 2020

mikaem commented May 16, 2020

francispoulin commented May 16, 2020

francispoulin commented May 19, 2020

francispoulin commented May 19, 2020

mikaem commented May 20, 2020

francispoulin commented May 20, 2020