Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large static memory allocation on hpc system #13

Closed
tomgrylls opened this issue Jul 8, 2019 · 10 comments
Closed

Large static memory allocation on hpc system #13

tomgrylls opened this issue Jul 8, 2019 · 10 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation
Milestone

Comments

@tomgrylls
Copy link
Contributor

tomgrylls commented Jul 8, 2019

Describe the bug
Code will crash on one of the first lines of a subroutine where a static memory declaration over some threshold value is made. Only occuring when compiled through intel, run on Imperial HPC and with a sufficiently large domain size (have not managed to reproduce this bug locally). The bug arose following a change to the architecture of the HPC system.

To Reproduce
Compile code on HPC and sufficiently large simulation (e.g. 100x200x200 on debug queue with 5 nodes).

Expected behavior
Segmentation fault at start of said subroutine:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

However, we have found with similar memory related issues in the past that bug can occur intermittently and not always cause runtime error on a line that is relevant to the problem itself.

@tomgrylls tomgrylls self-assigned this Jul 8, 2019
@tomgrylls
Copy link
Contributor Author

Update on current progress: Initial bug had a temporary fix of changing the declaration of said variables to a dynamic allocation. Initial tests showed that this was successful and discussed optimal solution with @dmey.

Recent simulations crashed for seemingly the same reason in different locations of the code (declaration of thv0 in modstartup). It is possible to implement the same fix here but raises further questions of (1) why this problem is occurring now and (2) why we only experience this on Imperial HPC.

To be tested: change mcmodel to large.

Outlook: Continue to fix this bug on branch tomgrylls/trees-driver as need to run simulations ASAP and produce pull request from seperate branch (alongside generalised statistics routine) once bug is fixed completely.

@tomgrylls tomgrylls added the bug Something isn't working label Jul 8, 2019
@dmey
Copy link
Contributor

dmey commented Jul 8, 2019

@tomgrylls do you have a test case you can attach to this issue so that I can also try and have a look if/when time permits?

(2) why we only experience this on Imperial HPC.

The make file for archer had -mcmodel=large where at Imperial you were using -mcmodel=medium. Locally if you were using GNU "The behaviour of Fortran’s automatic arrays depends on the type of compiler. Using Intel Fortran means they are stored on the stack (by default), whereas GNU Fortran stores them on the heap" (see Arrays and memory in Fortran also good overview anyway). For now you can use the sledge-hammer approach if you want and use something like the heap-arrays flag in ifort or indeed set the mcmodel to large. Ultimately, you may want to restructure the code and pass allocatable whenever you declare a 'large' array.

BTW, there may be other issues with the code that have exacerbated the problems you are reporting which we have not yet fixed. So fixing this may not solve all your issues but is a step in the right direction.

@tomgrylls
Copy link
Contributor Author

tomgrylls commented Jul 8, 2019

My initial fix of editing the static declarations of the 3D temporary arrays in the statistics routine worked for the majority of the simulations I ran. Two larger simulations still crashed but at different locations within the code. They also crashed following similar declarations. I changed a couple of these to allocatable but there are many examples of these kind of declarations - particularly in subroutines that are called throughout the code (excjs, slabsum, avey_ibm etc.). So before actually changing those, should establish that these crashes are for that exact reason and not something else (these subroutines do not pass allocatable and some have worked without issues for years).

I tried using mcmodel=large last week and it did not work. However, I was using the module netcdf/4.0.1-mcmodel-medium. I retested this with a different netcdf module and mcmodel=large does not work for my current test simulation (zip attached).

Using the flag -heap-arrays 10 has worked on initial test simulations and this includes on the larger simulations that still crashed after the modifications to the statistics routine (suggesting that it was caused by a similar bug).

I will run a set of tests using the executable compiled from the master branch. If -heap-arrays sorts this for now then I think the best path forward would be to go for this sledge-hammer approach in the short-term (the code has been running with relatively few issues for the last couple of years in its current form) and then integrate the memory declaration changes to the statistics to an enhancement that I will do to generalise this module when I have time anyway.

Also will go to the ICL HPC drop-in clinic tomorrow to enquire about why this issue arose and exactly what changes they made during the recent shutdown.

619.zip EDIT: this exp folder will not run on master branch some updated namoptions from tomgrylls/trees-driver. Will attach a different test case later that is compatible.

@tomgrylls
Copy link
Contributor Author

So I went to the RCS Walk-In clinic and asked them about this. He said that they did inadvertently overwrite a file that set no limit to the static memory during the recent PBS upgrade. This means that it referred to its default which is to define some limit. The system had been functioning for a long time with no limit set. They did not mean to make this change and plan to change this back during the next PBS reboot in 6 weeks time (they have already changed this on some private queues but no public ones).

I showed him the -heap-arrays workaround and he felt that this is a good option in terms of being able to run simulations on their system over the next few weeks until this is changed back. In general he did not seem overly concerned over the advantages/ disadvantages of using the heap or stack for these large arrays in fortran. He thought that there was limited performance difference although he did mention he was not 100% on this. In general he felt that the concern was just a legacy one.

As this is the case, I think the best way forward is to submit a pull request with the -heap-arrays flag. He said he would let me know when they reboot the system and it reverts to its previous set-up.

If we do still think changing the memory declaration is advantageous then doing this in the statistics routine can be part of that enhancement (expand the current module to one universal one). And if @dmey you think that changing these static declarations in other parts of the code is advantageous then this could be part of the general clean up?

@tomgrylls
Copy link
Contributor Author

To add to this I tested running the executable of the master branch with and without -heap-arrays 10 and it did and did not run respectively.

@dmey
Copy link
Contributor

dmey commented Jul 9, 2019

@tomgrylls 👍. I have no issue with setting -heap-arrays 10 for now so you can get the simulation working but I would really like to have this fixed properly by declaring an allocatable object by passing the allocatable attribute to those arrays. I guess we can do during cleanup and I would leave this issue open until is ultimately fixed.

tomgrylls added a commit that referenced this issue Jul 11, 2019
…. (#14)

Additional flag as work around to threshold on stack memory declaration on ICL HPC.
@dmey
Copy link
Contributor

dmey commented Dec 6, 2019

revisit and add to docs

@dmey dmey added the documentation Improvements or additions to documentation label Dec 6, 2019
@dmey dmey added this to the Future milestone Dec 6, 2019
@dmey
Copy link
Contributor

dmey commented Dec 15, 2019

@tomgrylls what was the decision on this -- are we going to have it fixed by declaring the allocatable attribute or patch it by setting it the -heap-arrays 10? We will need to considerer this in #49.

@dmey dmey modified the milestones: Future, 0.1.0 Dec 15, 2019
@tomgrylls
Copy link
Contributor Author

As far as I am concerned -heap-arrays 10 is a good enough solution to this problem. I would suggest making memory declaration changes a long-term issue and assigning it to the future milestone. There are other more urgent things required in any clean up. If I have time before I leave I will revisit the form of the statistics routine (see #42), so I will add this to list of changes to made in that subroutine. However, as discussed above the static memory declarations occur throughout the code in its current state.

@dmey
Copy link
Contributor

dmey commented Dec 15, 2019

OK -- I will move this back to the Future milestone and we can revisit it another time.

@dmey dmey modified the milestones: 0.1.0, Future Dec 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants