Skip to content

Commit

Permalink
Demonstrate ExecuteChildWorkflow bug + prepare test for a fix
Browse files Browse the repository at this point in the history
A user noticed incorrect cancellation behavior on one of their workflows,
which had workflow code somewhat like this:
```go
// start a bunch of child workflows, add to `cfs`
var cfs []workflow.Future
for _, arg := range args {
  cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
}

// wait for them to complete
for _, f := range cfs {
  f.Get(...)
}

// run a final child workflow to do the final report
workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)
```

When they canceled their parent workflow while "stuff" was still running, it would
wait for all the "stuff" to cancel and return (as expected)...
... and then it would start the "final" child, which would never actually finish
because the previous "stuff" was canceled, not completed.

Cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by
checking `ctx.Err() == nil`, and only executing if true.

For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled,
there may not be a viable workaround.  This time window is thankfully usually very
small, so *most* workflows should not have to worry about it.

---

The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends
on `childWorkflowExecution` being non-nil (since that sends the cancellation event):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917
but that variable is only set when the child workflow's "execution" future completes
(i.e. it has been scheduled successfully):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897

If cancellation occurs prior to that point, the cancellation is ignored for this child.
Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`,
though it was not.

---

... unfortunately, fixing this can cause non-deterministic replay errors for anyone
who had previously executed the child.  For some users this is probably fine (just reset),
but not for everyone.

On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
just improvements) in the client, and we do not seem to have an established way to do
that yet.  Fixing this safely may require us to come up with a strategy, build that, and
make use of it.
  • Loading branch information
Groxx committed Oct 7, 2021
1 parent 338aad0 commit 345e956
Showing 1 changed file with 60 additions and 0 deletions.
60 changes: 60 additions & 0 deletions internal/internal_workflow_testsuite_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3098,3 +3098,63 @@ func (s *WorkflowTestSuiteUnitTest) Test_AwaitWithTimeout() {
_ = env.GetWorkflowResult(&result)
s.False(result)
}

func (s *WorkflowTestSuiteUnitTest) Test_Regression_ExecuteChildWorkflowWithCanceledContext() {
// cancelTime of:
// - <0 == do not cancel
// - 0 == cancel synchronously
// - >0 == cancel after waiting that long
check := func(cancelTime time.Duration, expected string) {
env := s.NewTestWorkflowEnvironment()
env.Test(s.T())
env.RegisterWorkflowWithOptions(func(ctx Context) error {
return Sleep(ctx, time.Minute)
}, RegisterWorkflowOptions{Name: "child"})
env.RegisterWorkflowWithOptions(func(ctx Context) (string, error) {
ctx, cancel := WithCancel(ctx)
if cancelTime == 0 {
cancel()
} else if cancelTime > 0 {
Go(ctx, func(ctx Context) {
_ = Sleep(ctx, cancelTime)
cancel()
})
}

ctx = WithChildWorkflowOptions(ctx, ChildWorkflowOptions{
ExecutionStartToCloseTimeout: 2 * time.Minute,
TaskStartToCloseTimeout: 2 * time.Minute,
})
err := ExecuteChildWorkflow(ctx, "child").Get(ctx, nil)

if err == nil {
return "no err", nil
} else if _, ok := err.(*CanceledError); ok {
return "canceled", nil
}
return "unknown: " + err.Error(), nil
}, RegisterWorkflowOptions{Name: "parent"})

env.ExecuteWorkflow("parent")
s.True(env.IsWorkflowCompleted())
s.NoError(env.GetWorkflowError())

var result string
s.NoError(env.GetWorkflowResult(&result))
s.Equal(expected, result)
}
s.Run("sanity check", func() {
// workflow should run the child successfully normally...
check(-1, "no err")
})
s.Run("canceled after child starts", func() {
// ... and cancel the child when the child is canceled...
check(30*time.Second, "canceled")
})
s.Run("canceled before child starts", func() {
// ... and should not start the child (i.e. be canceled) when canceled before it is started.
check(0, "no err") // but it does not! this is a bug to fix.
// this should be:
// check(0, "canceled")
})
}

0 comments on commit 345e956

Please sign in to comment.