Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex is unlocked on cancellation #4145

Open
jossiwolf opened this issue May 31, 2024 · 2 comments
Open

Mutex is unlocked on cancellation #4145

jossiwolf opened this issue May 31, 2024 · 2 comments

Comments

@jossiwolf
Copy link

jossiwolf commented May 31, 2024

Possibly related to #2683.

Hi, we have had a bug report where users are seeing an exception from Mutex bubble up.

Here's the stack trace a user shared:

kotlinx.coroutines.CompletionHandlerException: Exception in resume onCancellation handler for CancellableContinuation(DispatchedContinuation[p@710c66a, Continuation at ba.u$b@e94f75b]){Completed}@42d3bf8
	at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core(CancellableContinuationImpl.java:264)
	at kotlinx.coroutines.CancellableContinuationImpl.callCancelHandler(CancellableContinuationImpl.java:264)
	at kotlinx.coroutines.CancellableContinuationImpl.callSegmentOnCancellation(CancellableContinuationImpl.java:264)
	at kotlinx.coroutines.CancellableContinuationImpl.callOnCancellation(CancellableContinuationImpl.java:264)
	at kotlinx.coroutines.CompletedContinuation.invokeHandlers(CompletedContinuation.java:659)
	at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core(CancellableContinuationImpl.java:180)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.java:102)
	at androidx.compose.ui.platform.AndroidUiDispatcher.performTrampolineDispatch(AndroidUiDispatcher.java:81)
	at androidx.compose.ui.platform.AndroidUiDispatcher.access$performTrampolineDispatch(AndroidUiDispatcher.java:41)
	at androidx.compose.ui.platform.AndroidUiDispatcher.access$performFrameDispatch(AndroidUiDispatcher.java:41)
	at androidx.compose.ui.platform.AndroidUiDispatcher.access$setScheduledFrameDispatch$p(AndroidUiDispatcher.java:41)
	at androidx.compose.ui.platform.AndroidUiDispatcher$dispatchCallback$1.run(AndroidUiDispatcher.java:57)
	at android.os.Handler.handleCallback(Handler.java:883)
	at android.os.Handler.dispatchMessage(Handler.java:100)
	at android.os.Looper.loop(Looper.java:241)
	at android.app.ActivityThread.main(ActivityThread.java:7604)
	at java.lang.reflect.Method.invoke(Method.java)
	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:941)
Caused by: java.lang.IllegalStateException: This mutex is not locked
	at kotlinx.coroutines.sync.MutexImpl.unlock(MutexImpl.java:213)
	at kotlinx.coroutines.sync.MutexImpl$CancellableContinuationWithOwner$tryResume$token$1.invoke(MutexImpl.java:261)
	at kotlinx.coroutines.sync.MutexImpl$CancellableContinuationWithOwner$tryResume$token$1.invoke(MutexImpl.java:258)
	at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core(CancellableContinuationImpl.java:259)
	at kotlinx.coroutines.CancellableContinuationImpl.callCancelHandler(CancellableContinuationImpl.java:259)
	at kotlinx.coroutines.CancellableContinuationImpl.callSegmentOnCancellation(CancellableContinuationImpl.java:259)
	at kotlinx.coroutines.CancellableContinuationImpl.callOnCancellation(CancellableContinuationImpl.java:259)
	at kotlinx.coroutines.CompletedContinuation.invokeHandlers(CompletedContinuation.java:659)
	at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core(CancellableContinuationImpl.java:180)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.java:102)
	at androidx.compose.ui.platform.AndroidUiDispatcher.performTrampolineDispatch(AndroidUiDispatcher.java:81)
	at androidx.compose.ui.platform.AndroidUiDispatcher.access$performTrampolineDispatch(AndroidUiDispatcher.java:41)
	at androidx.compose.ui.platform.AndroidUiDispatcher.access$performFrameDispatch(AndroidUiDispatcher.java:41)
	at androidx.compose.ui.platform.AndroidUiDispatcher.access$setScheduledFrameDispatch$p(AndroidUiDispatcher.java:41)
	at androidx.compose.ui.platform.AndroidUiDispatcher$dispatchCallback$1.run(AndroidUiDispatcher.java:57)
	at android.os.Handler.handleCallback(Handler.java:883)
	at android.os.Handler.dispatchMessage(Handler.java:100)
	at android.os.Looper.loop(Looper.java:241)
	at android.app.ActivityThread.main(ActivityThread.java:7604)
	at java.lang.reflect.Method.invoke(Method.java)
	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:492)
	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:941)

Where Continuation at ba.u$b is: androidx.compose.foundation.gestures.PressGestureScopeImpl$reset$1 -> ba.u$b:. reset is here.

Users are seeing this on Samsung, Motorola and Xioami devices, on SDK 30 - 34.
From the stack trace, the only explanation I can think of is this sequence:

  1. Launch a coroutine that locks the mutex, and one that unlocks the mutex so they are dispatched in order
  2. Cancel the locking job

If acquiring the semaphore takes longer than dispatching the unlock, the mutex would already be unlocked when the continuation's cancellation clause is invoked. Although I don't have an explanation why the permit acquisition would take longer.

A similar scenario is also outlined by @dkhalanskyjb here. From an API PoV, I am not sure how to guard against this, but the cancellation of a lock acquisition should always be safe as we would otherwise need to track ongoing acquisitions.

Provide a Reproducer

No repro case just yet, will update if I find one. I could imagine something like this:

val scope = PressGestureScopeImpl()
launch(start = Undispatched) { scope.reset() }
scope.cancel()

Version: Users originally reported this starting in Compose Foundation 1.4.3, where we were using Coroutines 1.6.4. Users have also reported this in Compose Foundation 1.6.0, where we are using Coroutines 1.7.3.

Let me know if there is anything else I can provide. Thanks!

@jossiwolf jossiwolf added the bug label May 31, 2024
@qwwdfsad qwwdfsad self-assigned this Jun 18, 2024
@qwwdfsad
Copy link
Collaborator

Steps to reproduce:

@Test
fun repro() = runBlocking<Unit> {
    val d = Dispatchers.Default.limitedParallelism(1)
    repeat(5) { // <- will spam into stderr, adjust as you see reproducible
        println("Iter $it")
        val m = Mutex()
        m.lock()

        launch(d, start = CoroutineStart.UNDISPATCHED) {
            m.lock()
        }

        launch(d) {
            m.unlock()
        }
        m.unlock()

        for (child in coroutineContext.job.children) {
            child.cancelAndJoin()
        }
    }
}

@qwwdfsad
Copy link
Collaborator

but the cancellation of a lock acquisition should always be safe as we would otherwise need to track ongoing acquisitions.

The code in the example still has a bug. Even if lock manages to do the unlocking "successfully" (in our scenario, basically ignoring the fact that the mutex is not locked when we expect it to be locked), a different scenario is possible:
locking coroutine successfully releases the lock, and the unlocking coroutine tries to unlock non-locked mutex, leading to IllegalStateException.

The modified repro consistently has two failure modes:

@Test
fun repro() = runBlocking<Unit> {
    val d = Dispatchers.Default.limitedParallelism(1)
    repeat(1_000_000) {
        val m = Mutex()
        m.lock()

        launch(d) {
            m.lock() // Can fail here because cancellation has lost the race to unlock
        }

        launch(d) {
            m.unlock() // Can fail here as well because mutex wasn't locked by the first coroutine
        }
        m.unlock()

        for (child in coroutineContext.job.children) {
            child.cancelAndJoin()
        }
    }
}

It doesn't seem like we can do much (assuming that my reproducer reflects the actual failure mode).
Attempt to unlock mutex from a different coroutine in the face of asynchronous cancellation is a datarace by design: it might be the case that the locking coroutine was cancelled before the lock, and then the unlocking coroutine is still here trying to unlock non-locked mutex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants