-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large static memory allocation on hpc system #13
Comments
Update on current progress: Initial bug had a temporary fix of changing the declaration of said variables to a dynamic allocation. Initial tests showed that this was successful and discussed optimal solution with @dmey. Recent simulations crashed for seemingly the same reason in different locations of the code (declaration of thv0 in modstartup). It is possible to implement the same fix here but raises further questions of (1) why this problem is occurring now and (2) why we only experience this on Imperial HPC. To be tested: change mcmodel to large. Outlook: Continue to fix this bug on branch tomgrylls/trees-driver as need to run simulations ASAP and produce pull request from seperate branch (alongside generalised statistics routine) once bug is fixed completely. |
@tomgrylls do you have a test case you can attach to this issue so that I can also try and have a look if/when time permits?
The make file for archer had BTW, there may be other issues with the code that have exacerbated the problems you are reporting which we have not yet fixed. So fixing this may not solve all your issues but is a step in the right direction. |
My initial fix of editing the static declarations of the 3D temporary arrays in the statistics routine worked for the majority of the simulations I ran. Two larger simulations still crashed but at different locations within the code. They also crashed following similar declarations. I changed a couple of these to allocatable but there are many examples of these kind of declarations - particularly in subroutines that are called throughout the code ( I tried using mcmodel=large last week and it did not work. However, I was using the module netcdf/4.0.1-mcmodel-medium. I retested this with a different netcdf module and mcmodel=large does not work for my current test simulation (zip attached). Using the flag I will run a set of tests using the executable compiled from the master branch. If Also will go to the ICL HPC drop-in clinic tomorrow to enquire about why this issue arose and exactly what changes they made during the recent shutdown. 619.zip EDIT: this exp folder will not run on master branch some updated namoptions from tomgrylls/trees-driver. Will attach a different test case later that is compatible. |
So I went to the RCS Walk-In clinic and asked them about this. He said that they did inadvertently overwrite a file that set no limit to the static memory during the recent PBS upgrade. This means that it referred to its default which is to define some limit. The system had been functioning for a long time with no limit set. They did not mean to make this change and plan to change this back during the next PBS reboot in 6 weeks time (they have already changed this on some private queues but no public ones). I showed him the -heap-arrays workaround and he felt that this is a good option in terms of being able to run simulations on their system over the next few weeks until this is changed back. In general he did not seem overly concerned over the advantages/ disadvantages of using the heap or stack for these large arrays in fortran. He thought that there was limited performance difference although he did mention he was not 100% on this. In general he felt that the concern was just a legacy one. As this is the case, I think the best way forward is to submit a pull request with the -heap-arrays flag. He said he would let me know when they reboot the system and it reverts to its previous set-up. If we do still think changing the memory declaration is advantageous then doing this in the statistics routine can be part of that enhancement (expand the current module to one universal one). And if @dmey you think that changing these static declarations in other parts of the code is advantageous then this could be part of the general clean up? |
To add to this I tested running the executable of the master branch with and without |
@tomgrylls 👍. I have no issue with setting |
…. (#14) Additional flag as work around to threshold on stack memory declaration on ICL HPC.
revisit and add to docs |
@tomgrylls what was the decision on this -- are we going to have it fixed by declaring the |
As far as I am concerned |
OK -- I will move this back to the Future milestone and we can revisit it another time. |
Describe the bug
Code will crash on one of the first lines of a subroutine where a static memory declaration over some threshold value is made. Only occuring when compiled through intel, run on Imperial HPC and with a sufficiently large domain size (have not managed to reproduce this bug locally). The bug arose following a change to the architecture of the HPC system.
To Reproduce
Compile code on HPC and sufficiently large simulation (e.g. 100x200x200 on debug queue with 5 nodes).
Expected behavior
Segmentation fault at start of said subroutine:
However, we have found with similar memory related issues in the past that bug can occur intermittently and not always cause runtime error on a line that is relevant to the problem itself.
The text was updated successfully, but these errors were encountered: