Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demonstrate ExecuteChildWorkflow bug + prepare test for a fix #1138

Closed
wants to merge 1 commit into from

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Oct 7, 2021

A user noticed incorrect cancellation behavior on one of their workflows,
which had workflow code somewhat like this:

// start a bunch of child workflows, add to `cfs`
var cfs []workflow.Future
for _, arg := range args {
  cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
}

// wait for them to complete
for _, f := range cfs {
  f.Get(...)
}

// run a final child workflow to do the final report
workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)

When they canceled their parent workflow while "stuff" was still running, it would
wait for all the "stuff" to cancel and return (as expected)...
... and then it would start the "final" child, which would never actually finish
because the previous "stuff" was canceled, not completed.

For cancellation prior to calling ExecuteChildWorkflow, this can be worked around by
checking ctx.Err() == nil, and only executing if true.

For cancellation between ExecuteChildWorkflow and the child being scheduled,
there may not be a viable workaround. This time window is thankfully usually very
small, so most workflows should not have to worry about it.


The cause appears to be that this cancellation check in ExecuteChildWorkflow depends
on childWorkflowExecution being non-nil (since that sends the cancellation event):

if cancellable {
cancellationCallback.fn = func(v interface{}, more bool) bool {
if ctx.Err() == ErrCanceled && childWorkflowExecution != nil && !mainFuture.IsReady() {
// child workflow started, and ctx cancelled
getWorkflowEnvironment(ctx).RequestCancelChildWorkflow(*options.domain, childWorkflowExecution.ID)
}
return false
}
_, ok, more := ctxDone.receiveAsyncImpl(cancellationCallback)
if ok || !more {
cancellationCallback.fn(nil, more)
}
}

but that variable is only set when the child workflow's "execution" future completes
(i.e. it has been scheduled successfully):
err = getWorkflowEnvironment(ctx).ExecuteChildWorkflow(params, func(r []byte, e error) {
mainSettable.Set(r, e)
if cancellable {
// future is done, we don't need cancellation anymore
ctxDone.removeReceiveCallback(cancellationCallback)
}
}, func(r WorkflowExecution, e error) {
if e == nil {
childWorkflowExecution = &r
}
executionSettable.Set(r, e)
})

If cancellation occurs prior to that point, the cancellation is ignored for this child.
Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
the child workflow acts as if it was run with a workflow.NewDisconnectedContext,
though it was not.


... unfortunately, fixing this can cause non-deterministic replay errors for anyone
who had previously executed the child. For some users this is probably fine (just reset),
but not for everyone.

On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
just improvements) in the client, and we do not seem to have an established way to do
that yet. Fixing this safely may require us to come up with a strategy, build that, and
make use of it.

@coveralls
Copy link

coveralls commented Oct 7, 2021

Pull Request Test Coverage Report for Build 1b0d8058-d708-49f6-bfdc-30fa95821ab5

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 63.17%

Totals Coverage Status
Change from base Build 23625b2f-f895-4e62-be1e-e45768586dbe: 0.01%
Covered Lines: 12176
Relevant Lines: 19275

💛 - Coveralls

@Groxx Groxx requested a review from a team October 7, 2021 21:03
A user noticed incorrect cancellation behavior on one of their workflows,
which had workflow code somewhat like this:
```go
// start a bunch of child workflows, add to `cfs`
var cfs []workflow.Future
for _, arg := range args {
  cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
}

// wait for them to complete
for _, f := range cfs {
  f.Get(...)
}

// run a final child workflow to do the final report
workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)
```

When they canceled their parent workflow while "stuff" was still running, it would
wait for all the "stuff" to cancel and return (as expected)...
... and then it would start the "final" child, which would never actually finish
because the previous "stuff" was canceled, not completed.

For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by
checking `ctx.Err() == nil`, and only executing if true.

For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled,
there may not be a viable workaround.  This time window is thankfully usually very
small, so *most* workflows should not have to worry about it.

---

The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends
on `childWorkflowExecution` being non-nil (since that sends the cancellation event):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917
but that variable is only set when the child workflow's "execution" future completes
(i.e. it has been scheduled successfully):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897

If cancellation occurs prior to that point, the cancellation is ignored for this child.
Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`,
though it was not.

---

... unfortunately, fixing this can cause non-deterministic replay errors for anyone
who had previously executed the child.  For some users this is probably fine (just reset),
but not for everyone.

On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
just improvements) in the client, and we do not seem to have an established way to do
that yet.  Fixing this safely may require us to come up with a strategy, build that, and
make use of it.
Groxx added a commit to Groxx/cadence-client that referenced this pull request Oct 15, 2021
A proof-of-concept that I believe resolves the bug entirely...
... but is not backwards compatible.

Merging this will break any workflows currently executing the buggy
behavior.  While that should be fairly rare, and is likely undesirable,
we should find some way to detect buggy behavior and maintain it so
these workflows are not permanently broken.
Groxx added a commit to Groxx/cadence-client that referenced this pull request Oct 28, 2021
A proof-of-concept that I believe resolves the bug entirely...
... but is not backwards compatible.

Merging this will break any workflows currently executing the buggy
behavior.  While that should be fairly rare, and is likely undesirable,
we should find some way to detect buggy behavior and maintain it so
these workflows are not permanently broken.
@Groxx
Copy link
Contributor Author

Groxx commented Oct 29, 2021

Gonna close this, as the real fix will be in some other PR (possibly #1144), and they'll likely be merged together.

@Groxx Groxx closed this Oct 29, 2021
Groxx added a commit to Groxx/cadence-client that referenced this pull request Nov 4, 2021
Resolves uber-go#1138 by correcting the bug.

As this is a non-backwards-compatible change, the next commit contains
a "backport" of sorts to allow selecting the buggy behavior if necessary,
to ease migration.
@Groxx Groxx deleted the childcancelbug branch November 4, 2021 22:49
Groxx added a commit that referenced this pull request Nov 8, 2021
Resolves #1138 by correcting the bug.

As this is a non-backwards-compatible change, the next commit contains
a "backport" of sorts to allow selecting the buggy behavior if necessary,
to ease migration.
@Groxx Groxx mentioned this pull request Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants