Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bugfix for child-workflow bug in #1138 #1144

Merged
merged 3 commits into from
Nov 4, 2021

Commits on Nov 4, 2021

  1. Demonstrate ExecuteChildWorkflow bug + prepare test for a fix

    A user noticed incorrect cancellation behavior on one of their workflows,
    which had workflow code somewhat like this:
    ```go
    // start a bunch of child workflows, add to `cfs`
    var cfs []workflow.Future
    for _, arg := range args {
      cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
    }
    
    // wait for them to complete
    for _, f := range cfs {
      f.Get(...)
    }
    
    // run a final child workflow to do the final report
    workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)
    ```
    
    When they canceled their parent workflow while "stuff" was still running, it would
    wait for all the "stuff" to cancel and return (as expected)...
    ... and then it would start the "final" child, which would never actually finish
    because the previous "stuff" was canceled, not completed.
    
    For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by
    checking `ctx.Err() == nil`, and only executing if true.
    
    For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled,
    there may not be a viable workaround.  This time window is thankfully usually very
    small, so *most* workflows should not have to worry about it.
    
    ---
    
    The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends
    on `childWorkflowExecution` being non-nil (since that sends the cancellation event):
    https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917
    but that variable is only set when the child workflow's "execution" future completes
    (i.e. it has been scheduled successfully):
    https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897
    
    If cancellation occurs prior to that point, the cancellation is ignored for this child.
    Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
    the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`,
    though it was not.
    
    ---
    
    ... unfortunately, fixing this can cause non-deterministic replay errors for anyone
    who had previously executed the child.  For some users this is probably fine (just reset),
    but not for everyone.
    
    On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
    just improvements) in the client, and we do not seem to have an established way to do
    that yet.  Fixing this safely may require us to come up with a strategy, build that, and
    make use of it.
    Groxx committed Nov 4, 2021
    Configuration menu
    Copy the full SHA
    7d3b4c4 View commit details
    Browse the repository at this point in the history
  2. Implement bugfix for child-workflow bug in uber-go#1138

    Resolves uber-go#1138 by correcting the bug.
    
    As this is a non-backwards-compatible change, the next commit contains
    a "backport" of sorts to allow selecting the buggy behavior if necessary,
    to ease migration.
    Groxx committed Nov 4, 2021
    Configuration menu
    Copy the full SHA
    d40c5cc View commit details
    Browse the repository at this point in the history
  3. "Backported" un-bugfix in case people hit non-determinism errors

    Introduces and uses a "bug backport" config, to allow selecting old behavior
    that could not be fixed in a completely transparent way.
    
    See code comments for detailed use for this new flag.
    
    In general we should treat this config as *somewhat desirable* to keep when
    the upkeep cost is low or zero, but it does represent known tech debt that
    we need to clean up at some point.
    When adding or removing fields on it, make sure to cover it in the version notes!
    Groxx committed Nov 4, 2021
    Configuration menu
    Copy the full SHA
    2629459 View commit details
    Browse the repository at this point in the history