Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU illegal memory access #3267

Closed
jagoosw opened this issue Sep 15, 2023 · 8 comments
Closed

GPU illegal memory access #3267

jagoosw opened this issue Sep 15, 2023 · 8 comments
Labels
GPU 👾 Where Oceananigans gets its powers from help wanted 🦮 plz halp (guide dog provided)

Comments

@jagoosw
Copy link
Collaborator

jagoosw commented Sep 15, 2023

Hi all,

I'm stuck trying to debug an error I keep getting when running a non-hydrostatic model on GPU.

It runs for a bit and then throws this error:

... (loads of similar CUDA stuff that goes on for a very very long time)
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined]
 [16] context!
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined]
 [17] unsafe_free!(xs::CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer}, stream::CUDA.CuStream) 
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:129
 [18] unsafe_finalize!(xs::CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:150
 [19] top-level scope
    @ ~/.julia/packages/InteractiveErrors/JOo2y/src/InteractiveErrors.jl:329
 [20] eval
    @ ./boot.jl:370 [inlined]
 [21] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:153
 [22] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:249
 [23] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:234
 [24] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:379
 [25] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:365
 [26] (::Base.var"#1017#1019"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:421
 [27] #invokelatest#2
    @ ./essentials.jl:816 [inlined]
 [28] invokelatest
    @ ./essentials.jl:813 [inlined]
 [29] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:405
 [30] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:322
 [31] _start()
    @ Base ./client.jl:522
LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
in expression starting at /rds/user/js2430/hpc-work/Eady/eady.jl:133
 >   (stacktrace)
      (user)
       CUDA
   +    throw_api_error ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27
   +   [inlined]
       CUDA
   +    cuOccupancyMaxPotentialBlockSize ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26
   +    #launch_configuration#875 ~/.julia/packages/CUDA/35NC6/lib/cudadrv/occupancy.jl:63
   +   [inlined]
v      CUDA
   +    cuOccupancyMaxPotentialBlockSize ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26
   +    #launch_configuration#875 ~/.julia/packages/CUDA/35NC6/lib/cudadrv/occupancy.jl:63
   +   [inlined]
       CUDA
   +    #mapreducedim!#1119 ~/.julia/packages/CUDA/35NC6/src/mapreduce.jl:236
   +   [inlined]
       GPUArrays
 > +    #_mapreduce#31 ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:69
v  +   [inlined]
     GPUArrays
   +    #_mapreduce#31 ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:69
   +   [inlined]
       Oceananigans.Solvers
   +    solve! ~/.julia/packages/Oceananigans/mwXt0/src/Solvers/fourier_tridiagonal_poisson_solver.jl:134
   +   [inlined]
       Oceananigans.Models.NonhydrostaticModels
   +    calculate_pressure_correction! ~/.julia/packages/Oceananigans/mwXt0/src/Models/NonhydrostaticModels/pressure_correction.jl:15
 >     Oceananigans.TimeSteppers
v  +    #time_step!#8 ~/.julia/packages/Oceananigans/mwXt0/src/TimeSteppers/runge_kutta_3.jl:138
      Oceananigans.Simulations
   +    time_step! ~/.julia/packages/Oceananigans/mwXt0/src/Simulations/run.jl:134
   +    #run!#7 ~/.julia/packages/Oceananigans/mwXt0/src/Simulations/run.jl:97
   +    run! ~/.julia/packages/Oceananigans/mwXt0/src/Simulations/run.jl:85
   +   [top-level]
         (system)

I can't get the whole error message because its longer than the screen length but this seems to be the relevant bit when using InteractiveErrors.

If I make the grid smaller it gets more iterations done before it errors but is nowhere near using all of the GPUs memory (A100 with 80GB and model is about 2GB when 256x256x64).

This is with the latest version of Oceananigans (87.4). I'll try to make an MWE.

@jagoosw jagoosw added help wanted 🦮 plz halp (guide dog provided) GPU 👾 Where Oceananigans gets its powers from labels Sep 15, 2023
@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 15, 2023

When I exit the REPL I get a very long error message ending:

``` WARNING: Error while freeing DeviceBuffer(568 bytes at 0x0000000320000400): CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27
[2] check
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined]
[3] cuMemFreeAsync
@ ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined]
[4] #free#2
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:97 [inlined]
[5] free
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:92 [inlined]
[6] #actual_free#976
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:77 [inlined]
[7] actual_free
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:74 [inlined]
[8] #_free#998
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:492 [inlined]
[9] _free
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:479 [inlined]
[10] macro expansion
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:464 [inlined]
[11] macro expansion
@ ./timing.jl:393 [inlined]
[12] #free#997
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:463 [inlined]
[13] free
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:452 [inlined]
[14] (::CUDA.var"#1004#1005"{CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CUDA.CuStream})()
@ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:130
[15] #context!#887
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined]
[16] context!
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined]
[17] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, stream::CUDA.CuStream)
@ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:129
[18] unsafe_finalize!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer})
@ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:150
WARNING: Error while freeing DeviceBuffer(560 bytes at 0x0000000320000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)

Stacktrace:
[1] throw_api_error(res::CUDA.cudaError_enum)
@ CUDA ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27
[2] check
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined]
[3] cuMemFreeAsync
@ ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined]
[4] #free#2
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:97 [inlined]
[5] free
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:92 [inlined]
[6] #actual_free#976
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:77 [inlined]
[7] actual_free
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:74 [inlined]
[8] #_free#998
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:492 [inlined]
[9] _free
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:479 [inlined]
[10] macro expansion
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:464 [inlined]
[11] macro expansion
@ ./timing.jl:393 [inlined]
[12] #free#997
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:463 [inlined]
[13] free
@ ~/.julia/packages/CUDA/35NC6/src/pool.jl:452 [inlined]
[14] (::CUDA.var"#1004#1005"{CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CUDA.CuStream})()
@ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:130
[15] #context!#887
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined]
[16] context!
@ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined]
[17] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, stream::CUDA.CuStream)
@ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:129
[18] unsafe_finalize!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer})
@ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:150
error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
throw_api_error at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27
check at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined]
cuStreamDestroy_v2 at /home/js2430/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined]
#834 at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/stream.jl:86 [inlined]
#context!#887 at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170
unknown function (ip: 0x7f08bc0a0880)
context! at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined]
unsafe_destroy! at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/stream.jl:85
unknown function (ip: 0x7f08bc0a0622)
_jl_invoke at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gf.c:2940
run_finalizer at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gc.c:417
jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gc.c:507
run_finalizers at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gc.c:553
ijl_atexit_hook at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/init.c:299
jl_repl_entrypoint at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/jlapi.c:718
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x401098)

</details>
</summary>

@jagoosw jagoosw changed the title GPU out of bounds memory error GPU illegal memory access Sep 15, 2023
@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 15, 2023

Trying to make an MWE I can't reproduce the error without all of my code running so perhaps its not actually in the pressure solver even though that's where the error is being raised.

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 15, 2023

So in this I've got a load of update_tendencies! being called, and adding synchronize(device(architecture(model))) at the end appears to have fixed this.

To summarise:

  • CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS) error
  • Resolved by manually synchronizing the device with synchronize(device(architecture(model)))

@jagoosw jagoosw closed this as completed Sep 15, 2023
@glwagner
Copy link
Member

Do you know why the manual synchronize is needed?

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 16, 2023

No, I'll try making an MWE.

@glwagner
Copy link
Member

Are all GPU operations KernelAbstractions? Or do you have other stuff sprinkled in?

@jagoosw
Copy link
Collaborator Author

jagoosw commented Sep 16, 2023

All KernelAbstractions

@Yixiao-Zhang
Copy link
Contributor

I found a similar problem (see #3320), but I am not sure whether it is related or not.

I do not know whether synchronize(device(architecture(model))) will solve my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU 👾 Where Oceananigans gets its powers from help wanted 🦮 plz halp (guide dog provided)
Projects
None yet
Development

No branches or pull requests

3 participants