Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 0.18.4 #1148

Closed
wants to merge 25 commits into from
Closed

fix 0.18.4 #1148

wants to merge 25 commits into from

Commits on Aug 18, 2021

  1. Compatibility layer fix (uber-go#1120)

    * Add integration test for domain update
    
    * Fix protoUpdateDomainRequest for compatibility layer
    vytautas-karpavicius committed Aug 18, 2021
    Configuration menu
    Copy the full SHA
    eb5bff7 View commit details
    Browse the repository at this point in the history

Commits on Aug 23, 2021

  1. Clear workflow state when not cached and not complete (uber-go#1111)

    This resolves a bug a user encountered, where a full sticky cache + a query for a non-cached workflow
    would result in a goroutine leak due to the associated event handler never being shut down.
    Since this leak retains all in-workflow data, it can eventually lead to an out-of-memory crash, though
    it does not cause any logic errors (these abandoned goroutines are forever idle once abandoned).
    There are probably also other scenarios where this is possible, hopefully all caught by this addition.
    
    ---
    
    Separately: this function seems to be far too complex, and is almost certainly duplicating checks made
    elsewhere, which should not be duplicated like this.  There have been multiple issues with
    state-clearing that have lead to adding conditions to this func, which is a clear sign of a code smell.
    
    State / cache decisions like this should be made in exactly one place ever, and built up as obviously
    as possible, to ensure gaps like this never occur.  In this case we'll likely need to invert the
    dependency flow somehow, so callers control when cache is cleared based on whether or not it is cached,
    rather than double-checking internally like this.
    
    We should also probably add something like go.uber.org/goleak to our tests, to help ensure we do not
    have goroutine leaks.  This may not have been caught by that, as the steps leading to it are a bit odd
    and rely on singleton config (sticky cache size), but it may find or prevent others.  With this PR we now
    have at least one test using it, but it'll probably take some time to roll out all over, and to add missing
    test-state cleanup funcs.
    Groxx authored and vytautas-karpavicius committed Aug 23, 2021
    Configuration menu
    Copy the full SHA
    ed86b8a View commit details
    Browse the repository at this point in the history

Commits on Oct 6, 2021

  1. Minor cleanup (uber-go#1106)

    Unnecessary indirection and reflection, these are all struct-pointers.  != nil is identical and more straightforward.
    
    Created with `gofmt -w -s -r '!isInterfaceNil(a) -> a != nil' internal/internal_worker.go`
    Groxx committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    b32ef3e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ae45c0f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    167c643 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    2017e60 View commit details
    Browse the repository at this point in the history
  5. Close the test dispatcher when completing (uber-go#1117)

    Currently, a test workflow like this will not ever run the code after the goroutine's sleep (correct),
    nor the code in the defer (incorrect, production would run it):
    ```
    func(ctx workflow.Context) error {
      workflow.Go(func(ctx workflow.Context) {
        defer func() { fmt.Println("in defer") }()
        workflow.Sleep(ctx, time.Hour) // wait longer than the workflow lives
        fmt.Println("after sleep")
      })
      workflow.Sleep(ctx, time.Minute) // just to make sure the goroutine starts
      return nil
    }
    ```
    The workflow will correctly end, but since the dispatcher was never closed, any not-yet-complete
    goroutines would never exit, and we'd leak goroutines.
    
    Semantically this should be a safe change:
    - Any post-complete decisions would not be executed or recorded, and this retains that.
    - When panicking, anything that would be *recorded* in a defer will not be recorded, so no
      replay-state-aware user code should be affected.  And any code that ignores replay state will
      now execute like it should, where before it would not.
    
    So safe / correct code should be unaffected, leaks should be reduced, and latent mistakes should
    now cause errors.  AFAICT - I'm not sure how complete our tests are here :)
    
    There's some room for in-defer code to be semantically incorrect in tests without this fix, (e.g. testing
    custom logger/metric impls in defers), though I expect those to be very rare bordering on nonexistent.
    But for the most part I expect that people will not notice this change, they'll just have fewer goroutine
    leaks during tests (so e.g. https://github.com/uber-go/goleak users will be happy).
    
    ---
    
    Prior to this fix, the added test fails with:
    ```
    === RUN   TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown
        internal_workflow_test.go:1210: 
            	Error Trace:	internal_workflow_test.go:1210
            	Error:      	deferred func should have been called within 1 second
            	Test:       	TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown
        internal_workflow_test.go:1216: code after sleep correctly not executed
    ```
    Now it passes with this, which also shows it's not slowing tests down in any meaningful way:
    ```
    === RUN   TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown
        internal_workflow_test.go:1210: deferred callback executed after 9.177µs
        internal_workflow_test.go:1217: code after sleep correctly not executed
    ```
    Groxx committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    b2db20e View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    332c2b4 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    982628d View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    222e0cb View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    8c34519 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    2e7438c View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2021

  1. Configuration menu
    Copy the full SHA
    75506d3 View commit details
    Browse the repository at this point in the history
  2. Add a random UUID to worker identities, to prevent collisions (uber-g…

    …o#1135)
    
    The fallback worker identity of {pid}@{host}@{tasklist} is insufficient in practice.
    E.g. within Uber, our docker setup can somewhat regularly lead to multiple instances
    of a worker ending up on the same physical host (which is not changed by docker) and
    with the same PID (as all the docker containers start up the same way -> they all
    get the same PIDs).
    Other causes are possible too, e.g. if a worker crashes and is restarted it could
    share the same host+pid, even though its caches (and anything else we actually care
    about for the identity) are lost.
    
    At the least-problematic-but-confusing level, this can lead to a misleadingly-short
    list of workers on tasklists in the web UI, as duplicates are collapsed.
    
    At the most-problematic level, this can lead to uneven load and sub-par request
    routing, as only the first/last/??? request to poll the server will receive data
    intended for a specific worker.
    
    Ideally that wouldn't be an issue, e.g. the server could keep better track of polls
    and route more precisely, but 1) it's difficult to reliably identify polls since
    the server only has the poller-provided data (which mis-identifies itself), and 2)
    even if it can be fixed on the server, improving identity uniqueness helps both fix
    this now and reduces or eliminates the impact of any future identity confusions.
    
    ---
    
    Alternatively, we could add a process-global random UUID that auto-inits, which
    would give us IDs that can be correlated across tasklists.  While I think this is
    acceptable as well, I'm loathe to add another global variable, and the benefit of
    cross-task identity correlating is pretty minor or nonexistent.
    
    If someone actually cares about that, they can pass an explicit ID that they
    generate however they like.
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    bfbc369 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0fb34ee View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    7de3d62 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    7044e1d View commit details
    Browse the repository at this point in the history
  6. Creating cadence.IsWorkflowError helper (uber-go#1145)

    We have a variety of from-workflow errors that are relevant for users calling
    `client.ExecuteWorkflow(...).Get(...)` and `client.GetWorkflow(...).Get(...)`.
    
    Since they do not share any common wrapper or "parent type" that would allow
    easy comparison with an error type check or `errors.As`, they are rather
    annoying to differentiate from general RPC errors like timeouts or connection
    failures.
    
    So this adds a top-level "check all relevant types" helper, which is
    intended to be a stable target for users to rely on, even if we add more
    types later.  It intentionally does not tell *which* kind of error it is, in
    case there are multiple types in the wrapped chain, as then behavior would
    be order-dependent and any change risks breaking someone's code.
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    b886df3 View commit details
    Browse the repository at this point in the history
  7. New docs section for canceled-context in workflows, other minor clean…

    …ups (uber-go#1134)
    
    Canceled-context documentation was strewn about in individual function documentation, though it's a fairly important concept that needs to be understood.
    
    In particular, users are sometimes under the impression that "all things which accept a context will error", which can lead to nasty surprises like `selector.Select(ctx)`'s blocking behavior (causing a deadlocked workflow).  Selector's docs have been improved in uber-go#1131, and this will hopefully make its behavior easier to learn before it is needed.
    
    And, since I noticed some small other things while writing this chunk of docs, I've rolled those changes into here as well.
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    1d7aa64 View commit details
    Browse the repository at this point in the history
  8. Demonstrate ExecuteChildWorkflow bug + prepare test for a fix

    A user noticed incorrect cancellation behavior on one of their workflows,
    which had workflow code somewhat like this:
    ```go
    // start a bunch of child workflows, add to `cfs`
    var cfs []workflow.Future
    for _, arg := range args {
      cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
    }
    
    // wait for them to complete
    for _, f := range cfs {
      f.Get(...)
    }
    
    // run a final child workflow to do the final report
    workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)
    ```
    
    When they canceled their parent workflow while "stuff" was still running, it would
    wait for all the "stuff" to cancel and return (as expected)...
    ... and then it would start the "final" child, which would never actually finish
    because the previous "stuff" was canceled, not completed.
    
    For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by
    checking `ctx.Err() == nil`, and only executing if true.
    
    For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled,
    there may not be a viable workaround.  This time window is thankfully usually very
    small, so *most* workflows should not have to worry about it.
    
    ---
    
    The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends
    on `childWorkflowExecution` being non-nil (since that sends the cancellation event):
    https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917
    but that variable is only set when the child workflow's "execution" future completes
    (i.e. it has been scheduled successfully):
    https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897
    
    If cancellation occurs prior to that point, the cancellation is ignored for this child.
    Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
    the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`,
    though it was not.
    
    ---
    
    ... unfortunately, fixing this can cause non-deterministic replay errors for anyone
    who had previously executed the child.  For some users this is probably fine (just reset),
    but not for everyone.
    
    On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
    just improvements) in the client, and we do not seem to have an established way to do
    that yet.  Fixing this safely may require us to come up with a strategy, build that, and
    make use of it.
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    f90b46c View commit details
    Browse the repository at this point in the history
  9. Implement bugfix for child-workflow bug in uber-go#1138

    Resolves uber-go#1138 by correcting the bug.
    
    As this is a non-backwards-compatible change, the next commit contains
    a "backport" of sorts to allow selecting the buggy behavior if necessary,
    to ease migration.
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    4d64c01 View commit details
    Browse the repository at this point in the history
  10. "Backported" un-bugfix in case people hit non-determinism errors

    Introduces and uses a "bug backport" config, to allow selecting old behavior
    that could not be fixed in a completely transparent way.
    
    See code comments for detailed use for this new flag.
    
    In general we should treat this config as *somewhat desirable* to keep when
    the upkeep cost is low or zero, but it does represent known tech debt that
    we need to clean up at some point.
    When adding or removing fields on it, make sure to cover it in the version notes!
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    9871428 View commit details
    Browse the repository at this point in the history
  11. Bump cadence.Version to 0.18.4 (uber-go#1146)

    Prepwork for a release
    Groxx committed Nov 8, 2021
    Configuration menu
    Copy the full SHA
    a6b3235 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2021

  1. Revert "Added 2-way proto-thrift mapper (uber-go#1130)"

    This reverts commit 0fb34ee.
    
    Panicking due to null run-id values on children-cancellation, possibly elsewhere.
    Groxx committed Nov 9, 2021
    Configuration menu
    Copy the full SHA
    1d1f091 View commit details
    Browse the repository at this point in the history
  2. fmt, removing globals from tests, scoping test logs (uber-go#1142)

    * Removing some globals from tests
    * Use test-scoped loggers everywhere
    
    Conflicted with jwt-related things, so those have been removed.
    Otherwise no issues.
    Groxx committed Nov 9, 2021
    Configuration menu
    Copy the full SHA
    708e8bb View commit details
    Browse the repository at this point in the history