Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bugfix for child-workflow bug in #1138 #1144

Merged
merged 3 commits into from
Nov 4, 2021

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Oct 15, 2021

The three commits in this PR are relevant for discussion, possibly also merging:

  1. "Demonstrate ExecuteChildWorkflow bug + prepare test for a fix" is Demonstrate ExecuteChildWorkflow bug + prepare test for a fix #1138, unmodified.
  2. "[incomplete] Implement bugfix for child-workflow bug in #1138" is the "bare" bugfix, which I believe to be a complete fix... which is not backwards compatible because it changes behavior.
  3. "Backported" un-bugfix in case people hit non-determinism errors" is a controllable way to undo the bugfix, and documentation on how to use it.

We can decide whether or not the current and future complexity is worth commit 3, but I think it's... not too bad?
Other options include:

  1. Ship the fix without the backport.
    • Simpler now and in the future.
    • No remediation possible for users that break, except "reset to before bug, and/or terminate".
  2. Use GetVersion internally when the bug might be triggered, with the same logic as the field check here.
    • Simpler and invisible.
    • Not user controllable, so if they were relying on it, they'll break...... but maybe that's fine. It's not always worth fighting Hyrum's Law. If it is, we can also add commit 3's flag.
    • Requires recording that version marker for all future, non-buggy workflows, increasing size slightly.
  3. Start recording client version numbers in decision task histories, so we can access it during replay, like BinaryChecksum. Use "no version" to mean "contains bug" and act appropriately.
    • Honestly this is probably a good idea regardless, as it'd give us (and users) a lever / escape-hatch for dealing with version-specific bugs.
    • I'm not sure what this change will involve. Server + web changes maybe? If it's client-side only, I can build that instead.
    • This would behave the same as GetVersion, but (relatively) "free" from markers that a bug is being addressed, as well as for future bugs.

@coveralls
Copy link

coveralls commented Oct 15, 2021

Pull Request Test Coverage Report for Build 49c94db2-60ed-4e72-ab26-0a4e029d577a

  • 29 of 42 (69.05%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.006%) to 63.296%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/workflow.go 24 37 64.86%
Totals Coverage Status
Change from base Build f974e15f-4f41-4413-b0bb-e32f7328b7f4: 0.006%
Covered Lines: 12256
Relevant Lines: 19363

💛 - Coveralls

@Groxx Groxx changed the title [incomplete] Implement bugfix for child-workflow bug in #1138 [discuss] Implement bugfix for child-workflow bug in #1138 Oct 28, 2021
@Groxx Groxx marked this pull request as ready for review October 28, 2021 04:46
@Groxx Groxx requested a review from a team November 3, 2021 02:25
Groxx and others added 3 commits November 4, 2021 14:59
A user noticed incorrect cancellation behavior on one of their workflows,
which had workflow code somewhat like this:
```go
// start a bunch of child workflows, add to `cfs`
var cfs []workflow.Future
for _, arg := range args {
  cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
}

// wait for them to complete
for _, f := range cfs {
  f.Get(...)
}

// run a final child workflow to do the final report
workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)
```

When they canceled their parent workflow while "stuff" was still running, it would
wait for all the "stuff" to cancel and return (as expected)...
... and then it would start the "final" child, which would never actually finish
because the previous "stuff" was canceled, not completed.

For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by
checking `ctx.Err() == nil`, and only executing if true.

For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled,
there may not be a viable workaround.  This time window is thankfully usually very
small, so *most* workflows should not have to worry about it.

---

The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends
on `childWorkflowExecution` being non-nil (since that sends the cancellation event):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917
but that variable is only set when the child workflow's "execution" future completes
(i.e. it has been scheduled successfully):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897

If cancellation occurs prior to that point, the cancellation is ignored for this child.
Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`,
though it was not.

---

... unfortunately, fixing this can cause non-deterministic replay errors for anyone
who had previously executed the child.  For some users this is probably fine (just reset),
but not for everyone.

On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
just improvements) in the client, and we do not seem to have an established way to do
that yet.  Fixing this safely may require us to come up with a strategy, build that, and
make use of it.
Resolves uber-go#1138 by correcting the bug.

As this is a non-backwards-compatible change, the next commit contains
a "backport" of sorts to allow selecting the buggy behavior if necessary,
to ease migration.
Introduces and uses a "bug backport" config, to allow selecting old behavior
that could not be fixed in a completely transparent way.

See code comments for detailed use for this new flag.

In general we should treat this config as *somewhat desirable* to keep when
the upkeep cost is low or zero, but it does represent known tech debt that
we need to clean up at some point.
When adding or removing fields on it, make sure to cover it in the version notes!
@Groxx Groxx merged commit 2629459 into uber-go:master Nov 4, 2021
@Groxx Groxx deleted the childcancelfix branch November 4, 2021 22:48
@Groxx Groxx changed the title [discuss] Implement bugfix for child-workflow bug in #1138 Implement bugfix for child-workflow bug in #1138 Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants