Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation violation on travis on Java 11 #6649

Open
cmnbroad opened this issue Jun 8, 2020 · 13 comments
Open

Segmentation violation on travis on Java 11 #6649

cmnbroad opened this issue Jun 8, 2020 · 13 comments

Comments

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jun 8, 2020

I hit this same segmentation violation issue on 4 separate branches on travis today (I believe in each case only the Java 11 unit test job failed - the rest of the matrix succeeded). It seems to be intermittent since, so far rerunning the job seems to make it go away.

Finished 210000 tests
Finished 220000 tests
Finished 230000 tests
Finished 240000 tests
Finished 250000 tests
Finished 260000 tests
Finished 270000 tests
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2bcaefd0f2, pid=10075, tid=10100
#
# JRE version: OpenJDK Runtime Environment (11.0.2+9) (build 11.0.2+9)
# Java VM: OpenJDK 64-Bit Server VM (11.0.2+9, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x8fd0f2]  jni_GetByteArrayElements+0x72
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/travis/build/broadinstitute/gatk/core.10075)
#
# An error report file with more information is saved as:
# /home/travis/build/broadinstitute/gatk/hs_err_pid10075.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

Gradle suite > Gradle test > org.broadinstitute.hellbender.utils.pairhmm.VectorPairHMMUnitTest > testLikelihoodsFromHaplotypesForAvailableImplementations SKIPPED
Results: SUCCESS (276386 tests, 276385 successes, 0 failures, 1 skipped)

> Task :test FAILED

Entire log is attached.
java11segv.txt

@samuelklee
Copy link
Contributor

Hmm, I thought I fixed this #5026? Perhaps see #5026 (comment) for hints?

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jun 8, 2020

@samuelklee Yeah, you did see this issue there (as well as the concurrent modification exception).

@samuelklee
Copy link
Contributor

Right, I thought I resolved it by removing the use of the DataProvider in that PR. And I thought I removed the SkipException as well? Are your branches rebased (sorry, can’t check easily now)?

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jun 8, 2020

Yeah, the branches are all rebased on current code, and one of the cases where I saw it this morning was right on master on travis. I'm not sure exactly which test/data provider change you think may have fixed it - could someone maybe have reintroduced it in the same place or elsewhere ?

@samuelklee
Copy link
Contributor

I'm not sure how you're getting

Gradle suite > Gradle test > org.broadinstitute.hellbender.utils.pairhmm.VectorPairHMMUnitTest > testLikelihoodsFromHaplotypesForAvailableImplementations SKIPPED

from master---I thought I changed the SkipException to a warning in #5026? I could be missing something, though.

See #5026 (comment) and #5026 (comment) for more details, if you haven't already.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jun 9, 2020

Oh yeah, I see, that is strange.

@samuelklee
Copy link
Contributor

Haven't seen this again, so I'm going to go ahead and close this. Not sure if this has to do with randomly getting older CPUs on Travis or something like that.

@cmnbroad cmnbroad reopened this Jan 4, 2021
@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jan 4, 2021

We've been seeing this again on travis recently (both @asmirnov and @yfarjoun have encountered it lately) so reopening since they look identical to this.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jan 5, 2021

Attached is a log file from a travis job (https://travis-ci.com/github/broadinstitute/gatk/builds/212021574) where this happened again, with the core dump file contents embedded in the log text.

log.txt

The java stack indicates that the seg fault originated in a call to jni_GetByteArrayElements made from GKL com.intel.gkl.pairhmm.IntelPairHmm.computeLikelihoodsNative:

`Stack: [0x00007ff9430cc000,0x00007ff9431cd000], sp=0x00007ff9431c84d0, free space=1009k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x8fd0f2] jni_GetByteArrayElements+0x72
C [libgkl_pairhmm9647826338235308809.so+0x66c02] JavaData::getData(JNIEnv_, _jobjectArray&, _jobjectArray*&)+0x222

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.intel.gkl.pairhmm.IntelPairHmm.computeLikelihoodsNative([Ljava/lang/Object;[Ljava/lang/Object;[D)V+0
j com.intel.gkl.pairhmm.IntelPairHmm.computeLikelihoods([Lorg/broadinstitute/gatk/nativebindings/pairhmm/ReadDataHolder;[Lorg/broadinstitute/gatk/nativebindings/pairhmm/HaplotypeDataHolder;[D)V+9
j org.broadinstitute.hellbender.utils.pairhmm.VectorLoglessPairHMM.computeLog10Likelihoods(Lorg/broadinstitute/hellbender/utils/genotyper/LikelihoodMatrix;Ljava/util/List;Lorg/broadinstitute/hellbender/utils/pairhmm/PairHMMInputScoreImputator;)V+356
j org.broadinstitute.hellbender.utils.pairhmm.VectorPairHMMUnitTest.testLikelihoodsFromHaplotypesForAvailableImplementations()V+478
v ~StubRoutines::call_stub
J 4376 jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; [email protected] (0 bytes) @ ...`

@rpomaris
Copy link

rpomaris commented Jan 5, 2021

@SnehalA have you seen this error before? @Kmannth?

@Kmannth
Copy link

Kmannth commented Jan 5, 2021

We have run the GATK Units test a few times and we don't have any reports of this issue but our testing has been on Centos 7.x no quite apples to apples to Ubunta and Java 11.

What is a GATK git tag that this issue has been seen on?
How many different version of JAVA is Travis testing with?
Has this only been seen with Java 11 or others as well?

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jan 6, 2021

@Kmannth It happened on current master (commit 60e1aa2) yesterday, but I don't think it matters since its been happening on numerous builds going back to at least August. Rerunning the job usually resolves it, though sometimes it takes 2 or 3 tries.

Travis runs with Java 8 and Java 11. AFAIK every time we've seen this its been on Java 11.

@droazen
Copy link
Collaborator

droazen commented Jan 25, 2021

When this is resolved, we'll need to re-enable the test disabled in #7044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants