fix 0.18.4 #1148

* Add integration test for domain update * Fix protoUpdateDomainRequest for compatibility layer

This resolves a bug a user encountered, where a full sticky cache + a query for a non-cached workflow would result in a goroutine leak due to the associated event handler never being shut down. Since this leak retains all in-workflow data, it can eventually lead to an out-of-memory crash, though it does not cause any logic errors (these abandoned goroutines are forever idle once abandoned). There are probably also other scenarios where this is possible, hopefully all caught by this addition. --- Separately: this function seems to be far too complex, and is almost certainly duplicating checks made elsewhere, which should not be duplicated like this. There have been multiple issues with state-clearing that have lead to adding conditions to this func, which is a clear sign of a code smell. State / cache decisions like this should be made in exactly one place ever, and built up as obviously as possible, to ensure gaps like this never occur. In this case we'll likely need to invert the dependency flow somehow, so callers control when cache is cleared based on whether or not it is cached, rather than double-checking internally like this. We should also probably add something like go.uber.org/goleak to our tests, to help ensure we do not have goroutine leaks. This may not have been caught by that, as the steps leading to it are a bit odd and rely on singleton config (sticky cache size), but it may find or prevent others. With this PR we now have at least one test using it, but it'll probably take some time to roll out all over, and to add missing test-state cleanup funcs.

Unnecessary indirection and reflection, these are all struct-pointers. != nil is identical and more straightforward. Created with `gofmt -w -s -r '!isInterfaceNil(a) -> a != nil' internal/internal_worker.go`

Currently, a test workflow like this will not ever run the code after the goroutine's sleep (correct), nor the code in the defer (incorrect, production would run it): ``` func(ctx workflow.Context) error { workflow.Go(func(ctx workflow.Context) { defer func() { fmt.Println("in defer") }() workflow.Sleep(ctx, time.Hour) // wait longer than the workflow lives fmt.Println("after sleep") }) workflow.Sleep(ctx, time.Minute) // just to make sure the goroutine starts return nil } ``` The workflow will correctly end, but since the dispatcher was never closed, any not-yet-complete goroutines would never exit, and we'd leak goroutines. Semantically this should be a safe change: - Any post-complete decisions would not be executed or recorded, and this retains that. - When panicking, anything that would be *recorded* in a defer will not be recorded, so no replay-state-aware user code should be affected. And any code that ignores replay state will now execute like it should, where before it would not. So safe / correct code should be unaffected, leaks should be reduced, and latent mistakes should now cause errors. AFAICT - I'm not sure how complete our tests are here :) There's some room for in-defer code to be semantically incorrect in tests without this fix, (e.g. testing custom logger/metric impls in defers), though I expect those to be very rare bordering on nonexistent. But for the most part I expect that people will not notice this change, they'll just have fewer goroutine leaks during tests (so e.g. https://github.com/uber-go/goleak users will be happy). --- Prior to this fix, the added test fails with: ``` === RUN TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown internal_workflow_test.go:1210: Error Trace: internal_workflow_test.go:1210 Error: deferred func should have been called within 1 second Test: TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown internal_workflow_test.go:1216: code after sleep correctly not executed ``` Now it passes with this, which also shows it's not slowing tests down in any meaningful way: ``` === RUN TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown internal_workflow_test.go:1210: deferred callback executed after 9.177µs internal_workflow_test.go:1217: code after sleep correctly not executed ```

…1131)

…sting (uber-go#1137)

)

…o#1135) The fallback worker identity of {pid}@{host}@{tasklist} is insufficient in practice. E.g. within Uber, our docker setup can somewhat regularly lead to multiple instances of a worker ending up on the same physical host (which is not changed by docker) and with the same PID (as all the docker containers start up the same way -> they all get the same PIDs). Other causes are possible too, e.g. if a worker crashes and is restarted it could share the same host+pid, even though its caches (and anything else we actually care about for the identity) are lost. At the least-problematic-but-confusing level, this can lead to a misleadingly-short list of workers on tasklists in the web UI, as duplicates are collapsed. At the most-problematic level, this can lead to uneven load and sub-par request routing, as only the first/last/??? request to poll the server will receive data intended for a specific worker. Ideally that wouldn't be an issue, e.g. the server could keep better track of polls and route more precisely, but 1) it's difficult to reliably identify polls since the server only has the poller-provided data (which mis-identifies itself), and 2) even if it can be fixed on the server, improving identity uniqueness helps both fix this now and reduces or eliminates the impact of any future identity confusions. --- Alternatively, we could add a process-global random UUID that auto-inits, which would give us IDs that can be correlated across tasklists. While I think this is acceptable as well, I'm loathe to add another global variable, and the benefit of cross-task identity correlating is pretty minor or nonexistent. If someone actually cares about that, they can pass an explicit ID that they generate however they like.

We have a variety of from-workflow errors that are relevant for users calling `client.ExecuteWorkflow(...).Get(...)` and `client.GetWorkflow(...).Get(...)`. Since they do not share any common wrapper or "parent type" that would allow easy comparison with an error type check or `errors.As`, they are rather annoying to differentiate from general RPC errors like timeouts or connection failures. So this adds a top-level "check all relevant types" helper, which is intended to be a stable target for users to rely on, even if we add more types later. It intentionally does not tell *which* kind of error it is, in case there are multiple types in the wrapped chain, as then behavior would be order-dependent and any change risks breaking someone's code.

…ups (uber-go#1134) Canceled-context documentation was strewn about in individual function documentation, though it's a fairly important concept that needs to be understood. In particular, users are sometimes under the impression that "all things which accept a context will error", which can lead to nasty surprises like `selector.Select(ctx)`'s blocking behavior (causing a deadlocked workflow). Selector's docs have been improved in uber-go#1131, and this will hopefully make its behavior easier to learn before it is needed. And, since I noticed some small other things while writing this chunk of docs, I've rolled those changes into here as well.

A user noticed incorrect cancellation behavior on one of their workflows, which had workflow code somewhat like this: ```go // start a bunch of child workflows, add to `cfs` var cfs []workflow.Future for _, arg := range args { cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg) } // wait for them to complete for _, f := range cfs { f.Get(...) } // run a final child workflow to do the final report workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil) ``` When they canceled their parent workflow while "stuff" was still running, it would wait for all the "stuff" to cancel and return (as expected)... ... and then it would start the "final" child, which would never actually finish because the previous "stuff" was canceled, not completed. For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by checking `ctx.Err() == nil`, and only executing if true. For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled, there may not be a viable workaround. This time window is thankfully usually very small, so *most* workflows should not have to worry about it. --- The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends on `childWorkflowExecution` being non-nil (since that sends the cancellation event): https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917 but that variable is only set when the child workflow's "execution" future completes (i.e. it has been scheduled successfully): https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897 If cancellation occurs prior to that point, the cancellation is ignored for this child. Unfortunately it will also not "detect" this "lost" cancellation later in any way, so the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`, though it was not. --- ... unfortunately, fixing this can cause non-deterministic replay errors for anyone who had previously executed the child. For some users this is probably fine (just reset), but not for everyone. On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or just improvements) in the client, and we do not seem to have an established way to do that yet. Fixing this safely may require us to come up with a strategy, build that, and make use of it.

Resolves uber-go#1138 by correcting the bug. As this is a non-backwards-compatible change, the next commit contains a "backport" of sorts to allow selecting the buggy behavior if necessary, to ease migration.

Introduces and uses a "bug backport" config, to allow selecting old behavior that could not be fixed in a completely transparent way. See code comments for detailed use for this new flag. In general we should treat this config as *somewhat desirable* to keep when the upkeep cost is low or zero, but it does represent known tech debt that we need to clean up at some point. When adding or removing fields on it, make sure to cover it in the version notes!

Prepwork for a release

This reverts commit 0fb34ee. Panicking due to null run-id values on children-cancellation, possibly elsewhere.

* Removing some globals from tests * Use test-scoped loggers everywhere Conflicted with jwt-related things, so those have been removed. Otherwise no issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix 0.18.4 #1148

fix 0.18.4 #1148

Commits on Aug 18, 2021

Commits on Aug 23, 2021

Commits on Oct 6, 2021

Commits on Nov 8, 2021

Commits on Nov 9, 2021