Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Julia 1.3 the new minimum required version #636

Merged
merged 9 commits into from
Feb 21, 2020
Merged

Conversation

ali-ramadhan
Copy link
Member

Would be nice to get this into v0.23.0

Resolves #625

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Feb 20, 2020

Hmmm, yeah so the GitLab CI GPU tests always get stuck testing the pressure solvers (which is where we start creating FFT plans) with Julia 1.3+. Was never able to reproduce this on any machine I have access to.

Example build logs:
Julia 1.3: https://gitlab.com/JuliaGPU/Oceananigans-jl/-/jobs/444576929
Julia 1.5: https://gitlab.com/JuliaGPU/Oceananigans-jl/-/jobs/444576930

@maleadt Was wondering if you have any ideas about what's going on? Maybe someone else had similiar issues. I'll isolate exactly which test/line it always gets stuck on.

@ali-ramadhan
Copy link
Member Author

Worth noting that Travis tests are much faster now! 🎉

@maleadt
Copy link
Collaborator

maleadt commented Feb 20, 2020

@maleadt Was wondering if you have any ideas about what's going on? Maybe someone else had similiar issues. I'll isolate exactly which test/line it always gets stuck on.

I've seen some people having issues which turned out FFT threading related, not sure if that applies here. If a job is stuck, ping me on Slack and I'll have a look at dumping a backtrace using gdb,

@maleadt
Copy link
Collaborator

maleadt commented Feb 20, 2020

OK, so this is the same issue as @navidcy experienced with FourierFlows.jl -- did you get this resolved? Looks to be happening on 1.3+.

Backtrace from GDB:

(gdb) bt
#0  0x00007f5e2a4626d6 in do_futex_wait.constprop () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f5e2a4627c8 in __new_sem_wait_slow.constprop.0 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f5debf04b42 in lock_planner_mutex () from target:/builds/JuliaGPU/Oceananigans-jl/.julia/artifacts/e40697527cebb56d421346210295905df6e421dc/lib/libfftw3.so
#3  0x00007f5debdf3ae7 in fftw_destroy_plan () from target:/builds/JuliaGPU/Oceananigans-jl/.julia/artifacts/e40697527cebb56d421346210295905df6e421dc/lib/libfftw3.so
#4  0x00007f5d470b48c7 in ?? ()
#5  0x00007f5d47fff8a8 in ?? ()
#6  0x00007f5e2ab310cc in _jl_invoke (world=27509, mfunc=<optimized out>, nargs=1, args=0x7f5d47fff8a8, F=0x7f5e1a1c80e0) at /buildworker/worker/package_linux64/build/src/gf.c:2144
#7  jl_apply_generic (F=<optimized out>, args=args@entry=0x7f5d47fff8a8, nargs=nargs@entry=1) at /buildworker/worker/package_linux64/build/src/gf.c:2328
#8  0x00007f5e2ab7a9bf in jl_apply (nargs=2, args=0x7f5d47fff8a0) at /buildworker/worker/package_linux64/build/src/julia.h:1695
#9  run_finalizer (o=0x7f5dc6ae1210, ff=0x7f5e1a1c80e0, ptls=0x7f5e2b4794a0) at /buildworker/worker/package_linux64/build/src/gc.c:277
#10 0x00007f5e2ab7b500 in jl_gc_run_finalizers_in_list (ptls=ptls@entry=0x7f5e2b4794a0, list=list@entry=0x7f5d47fffa10) at /buildworker/worker/package_linux64/build/src/gc.c:363
#11 0x00007f5e2ab83885 in run_finalizers (ptls=0x7f5e2b4794a0) at /buildworker/worker/package_linux64/build/src/gc.c:391
#12 jl_gc_collect (collection=JL_GC_INCREMENTAL) at /buildworker/worker/package_linux64/build/src/gc.c:3128
#13 0x00007f5e0e8477a7 in ?? ()
#14 0x00007f5e0e847730 in ?? ()
#15 0x00007f5decb6e760 in ?? ()
#16 0x00007f5e2b4794a0 in ?? ()
#17 0x00007f5d47fffdb8 in ?? ()
#18 0x00007f5e2b4794a0 in ?? ()
#19 0x00007f5ded987590 in ?? ()
#20 0x00007f5d47fffd90 in ?? ()
#21 0x00007f5e0e84a7c6 in ?? ()
#22 0x00007f5d47fffc60 in ?? ()
#23 0x00007f5e1c30b01d in iolock_end () at libuv.jl:49
#24 julia_Timer#506_3522 () at asyncevent.jl:88
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

From jlbacktrace (tricky to get a hold of given the CI set-up):

do_futex_wait.constprop.1 at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
__new_sem_wait_slow.constprop.0 at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
lock_planner_mutex at /builds/JuliaGPU/Oceananigans-jl/.julia/artifacts/e40697527cebb56d421346210295905df6e421dc/lib/libfftw3.so (unknown line)
fftw_destroy_plan at /builds/JuliaGPU/Oceananigans-jl/.julia/artifacts/e40697527cebb56d421346210295905df6e421dc/lib/libfftw3.so (unknown line)
destroy_plan at /builds/JuliaGPU/Oceananigans-jl/.julia/packages/FFTW/qqcBj/src/fft.jl:255
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2328
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1695 [inlined]
run_finalizer at /buildworker/worker/package_linux64/build/src/gc.c:277
jl_gc_run_finalizers_in_list at /buildworker/worker/package_linux64/build/src/gc.c:363
run_finalizers at /buildworker/worker/package_linux64/build/src/gc.c:391 [inlined]
jl_gc_collect at /buildworker/worker/package_linux64/build/src/gc.c:3128
gc at ./gcutils.jl:79 [inlined]
scan at /builds/JuliaGPU/Oceananigans-jl/.julia/packages/CuArrays/1njKF/src/memory/binned.jl:104
#8 at /builds/JuliaGPU/Oceananigans-jl/.julia/packages/CuArrays/1njKF/src/memory/binned.jl:267
lock at ./lock.jl:161
macro expansion at /builds/JuliaGPU/Oceananigans-jl/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
macro expansion at /builds/JuliaGPU/Oceananigans-jl/.julia/packages/CuArrays/1njKF/src/memory/binned.jl:266 [inlined]
#6 at ./task.jl:358
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2161 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2328
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1695 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:687
unknown function (ip: (nil))
unknown function (ip: (nil))

Other tasks seems to be waiting for that to finish:

signal (3): Quit
in expression starting at /builds/JuliaGPU/Oceananigans-jl/test/test_pressure_solvers.jl:183
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:480
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2328
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1695 [inlined]
jl_finish_task at /buildworker/worker/package_linux64/build/src/task.c:198
start_task at /buildworker/worker/package_linux64/build/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))

So this looks FFTW related, but rings no bells here. Maybe @stevengj has seen this before?

@ali-ramadhan
Copy link
Member Author

Thanks for looking into this @maleadt!

Hmmm, if it's indeed a multithreading issue then maybe the simple solution is to just turn off multithreading for FFTW during testing?

@stevengj
Copy link

I haven't seen that before… not sure why there would be a deadlock, but you can try setting JULIA_NUM_THREADS=1 as a workaround.

@stevengj
Copy link

Not sure if JuliaMath/FFTW.jl#138 will help?

@codecov
Copy link

codecov bot commented Feb 21, 2020

Codecov Report

Merging #636 into master will increase coverage by 7.29%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #636      +/-   ##
==========================================
+ Coverage    70.7%   77.99%   +7.29%     
==========================================
  Files         118      118              
  Lines        2270     2345      +75     
==========================================
+ Hits         1605     1829     +224     
+ Misses        665      516     -149
Impacted Files Coverage Δ
src/Utils/launch_config.jl 70.58% <0%> (-29.42%) ⬇️
src/Models/incompressible_model.jl 87.5% <0%> (-12.5%) ⬇️
src/Logger.jl 79.16% <0%> (-9.73%) ⬇️
src/Solvers/solve_for_pressure.jl 93.33% <0%> (-6.67%) ⬇️
...ntations/rozema_anisotropic_minimum_dissipation.jl 35.84% <0%> (-2.45%) ⬇️
src/Solvers/box_pressure_solver.jl 0% <0%> (ø) ⬆️
src/Solvers/batched_tridiagonal_solver.jl 100% <0%> (ø) ⬆️
src/Utils/time_step_wizard.jl 100% <0%> (ø) ⬆️
src/Grids/vertically_stretched_cartesian_grid.jl 100% <0%> (ø) ⬆️
src/Solvers/triply_periodic_pressure_solver.jl 0% <0%> (ø) ⬆️
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ed3660...2610b13. Read the comment docs.

@ali-ramadhan
Copy link
Member Author

Thanks so much for looking into this @stevengj!

I tried both suggestions.

  1. Setting JULIA_NUM_THREADS=1 worked and now GitLab CI tests with Julia 1.3+ do not get stuck anymore 🎉 https://gitlab.com/JuliaGPU/Oceananigans-jl/-/jobs/445804927

  2. Using FFTW#stevengj-patch-1 (and going back to JULIA_NUM_THREADS=4) unfortunately did not help and testing still gets stuck at the same point as before https://gitlab.com/JuliaGPU/Oceananigans-jl/-/jobs/445830533

We don't compute any large FFTs during testing so JULIA_NUM_THREADS=1 is a perfect solution for us. Unfortunately I don't know enough about multithreading and FFTW to help debug this issue but more than happy to test patches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package 📦 Quite meta
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Julia 1.3 the new minimum required version?
4 participants