-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demonstrate ExecuteChildWorkflow bug + prepare test for a fix #1138
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Pull Request Test Coverage Report for Build 1b0d8058-d708-49f6-bfdc-30fa95821ab5
💛 - Coveralls |
A user noticed incorrect cancellation behavior on one of their workflows, which had workflow code somewhat like this: ```go // start a bunch of child workflows, add to `cfs` var cfs []workflow.Future for _, arg := range args { cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg) } // wait for them to complete for _, f := range cfs { f.Get(...) } // run a final child workflow to do the final report workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil) ``` When they canceled their parent workflow while "stuff" was still running, it would wait for all the "stuff" to cancel and return (as expected)... ... and then it would start the "final" child, which would never actually finish because the previous "stuff" was canceled, not completed. For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by checking `ctx.Err() == nil`, and only executing if true. For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled, there may not be a viable workaround. This time window is thankfully usually very small, so *most* workflows should not have to worry about it. --- The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends on `childWorkflowExecution` being non-nil (since that sends the cancellation event): https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917 but that variable is only set when the child workflow's "execution" future completes (i.e. it has been scheduled successfully): https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897 If cancellation occurs prior to that point, the cancellation is ignored for this child. Unfortunately it will also not "detect" this "lost" cancellation later in any way, so the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`, though it was not. --- ... unfortunately, fixing this can cause non-deterministic replay errors for anyone who had previously executed the child. For some users this is probably fine (just reset), but not for everyone. On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or just improvements) in the client, and we do not seem to have an established way to do that yet. Fixing this safely may require us to come up with a strategy, build that, and make use of it.
Groxx
added a commit
to Groxx/cadence-client
that referenced
this pull request
Oct 15, 2021
A proof-of-concept that I believe resolves the bug entirely... ... but is not backwards compatible. Merging this will break any workflows currently executing the buggy behavior. While that should be fairly rare, and is likely undesirable, we should find some way to detect buggy behavior and maintain it so these workflows are not permanently broken.
Groxx
added a commit
to Groxx/cadence-client
that referenced
this pull request
Oct 28, 2021
A proof-of-concept that I believe resolves the bug entirely... ... but is not backwards compatible. Merging this will break any workflows currently executing the buggy behavior. While that should be fairly rare, and is likely undesirable, we should find some way to detect buggy behavior and maintain it so these workflows are not permanently broken.
Gonna close this, as the real fix will be in some other PR (possibly #1144), and they'll likely be merged together. |
Groxx
added a commit
to Groxx/cadence-client
that referenced
this pull request
Nov 4, 2021
Resolves uber-go#1138 by correcting the bug. As this is a non-backwards-compatible change, the next commit contains a "backport" of sorts to allow selecting the buggy behavior if necessary, to ease migration.
Groxx
added a commit
that referenced
this pull request
Nov 8, 2021
Resolves #1138 by correcting the bug. As this is a non-backwards-compatible change, the next commit contains a "backport" of sorts to allow selecting the buggy behavior if necessary, to ease migration.
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A user noticed incorrect cancellation behavior on one of their workflows,
which had workflow code somewhat like this:
When they canceled their parent workflow while "stuff" was still running, it would
wait for all the "stuff" to cancel and return (as expected)...
... and then it would start the "final" child, which would never actually finish
because the previous "stuff" was canceled, not completed.
For cancellation prior to calling
ExecuteChildWorkflow
, this can be worked around bychecking
ctx.Err() == nil
, and only executing if true.For cancellation between
ExecuteChildWorkflow
and the child being scheduled,there may not be a viable workaround. This time window is thankfully usually very
small, so most workflows should not have to worry about it.
The cause appears to be that this cancellation check in
ExecuteChildWorkflow
dependson
childWorkflowExecution
being non-nil (since that sends the cancellation event):cadence-client/internal/workflow.go
Lines 905 to 917 in 8fff028
but that variable is only set when the child workflow's "execution" future completes
(i.e. it has been scheduled successfully):
cadence-client/internal/workflow.go
Lines 886 to 897 in 8fff028
If cancellation occurs prior to that point, the cancellation is ignored for this child.
Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
the child workflow acts as if it was run with a
workflow.NewDisconnectedContext
,though it was not.
... unfortunately, fixing this can cause non-deterministic replay errors for anyone
who had previously executed the child. For some users this is probably fine (just reset),
but not for everyone.
On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
just improvements) in the client, and we do not seem to have an established way to do
that yet. Fixing this safely may require us to come up with a strategy, build that, and
make use of it.