Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix 0.18.4 #1148

Closed
wants to merge 25 commits into from
Closed

fix 0.18.4 #1148

wants to merge 25 commits into from

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Nov 9, 2021

What changed?

Why?

How did you test it?

Potential risks

vytautas-karpavicius and others added 25 commits August 18, 2021 09:58
* Add integration test for domain update

* Fix protoUpdateDomainRequest for compatibility layer
This resolves a bug a user encountered, where a full sticky cache + a query for a non-cached workflow
would result in a goroutine leak due to the associated event handler never being shut down.
Since this leak retains all in-workflow data, it can eventually lead to an out-of-memory crash, though
it does not cause any logic errors (these abandoned goroutines are forever idle once abandoned).
There are probably also other scenarios where this is possible, hopefully all caught by this addition.

---

Separately: this function seems to be far too complex, and is almost certainly duplicating checks made
elsewhere, which should not be duplicated like this.  There have been multiple issues with
state-clearing that have lead to adding conditions to this func, which is a clear sign of a code smell.

State / cache decisions like this should be made in exactly one place ever, and built up as obviously
as possible, to ensure gaps like this never occur.  In this case we'll likely need to invert the
dependency flow somehow, so callers control when cache is cleared based on whether or not it is cached,
rather than double-checking internally like this.

We should also probably add something like go.uber.org/goleak to our tests, to help ensure we do not
have goroutine leaks.  This may not have been caught by that, as the steps leading to it are a bit odd
and rely on singleton config (sticky cache size), but it may find or prevent others.  With this PR we now
have at least one test using it, but it'll probably take some time to roll out all over, and to add missing
test-state cleanup funcs.
Unnecessary indirection and reflection, these are all struct-pointers.  != nil is identical and more straightforward.

Created with `gofmt -w -s -r '!isInterfaceNil(a) -> a != nil' internal/internal_worker.go`
Currently, a test workflow like this will not ever run the code after the goroutine's sleep (correct),
nor the code in the defer (incorrect, production would run it):
```
func(ctx workflow.Context) error {
  workflow.Go(func(ctx workflow.Context) {
    defer func() { fmt.Println("in defer") }()
    workflow.Sleep(ctx, time.Hour) // wait longer than the workflow lives
    fmt.Println("after sleep")
  })
  workflow.Sleep(ctx, time.Minute) // just to make sure the goroutine starts
  return nil
}
```
The workflow will correctly end, but since the dispatcher was never closed, any not-yet-complete
goroutines would never exit, and we'd leak goroutines.

Semantically this should be a safe change:
- Any post-complete decisions would not be executed or recorded, and this retains that.
- When panicking, anything that would be *recorded* in a defer will not be recorded, so no
  replay-state-aware user code should be affected.  And any code that ignores replay state will
  now execute like it should, where before it would not.

So safe / correct code should be unaffected, leaks should be reduced, and latent mistakes should
now cause errors.  AFAICT - I'm not sure how complete our tests are here :)

There's some room for in-defer code to be semantically incorrect in tests without this fix, (e.g. testing
custom logger/metric impls in defers), though I expect those to be very rare bordering on nonexistent.
But for the most part I expect that people will not notice this change, they'll just have fewer goroutine
leaks during tests (so e.g. https://github.com/uber-go/goleak users will be happy).

---

Prior to this fix, the added test fails with:
```
=== RUN   TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown
    internal_workflow_test.go:1210: 
        	Error Trace:	internal_workflow_test.go:1210
        	Error:      	deferred func should have been called within 1 second
        	Test:       	TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown
    internal_workflow_test.go:1216: code after sleep correctly not executed
```
Now it passes with this, which also shows it's not slowing tests down in any meaningful way:
```
=== RUN   TestWorkflowUnitTest/Test_StaleGoroutinesAreShutDown
    internal_workflow_test.go:1210: deferred callback executed after 9.177µs
    internal_workflow_test.go:1217: code after sleep correctly not executed
```
…o#1135)

The fallback worker identity of {pid}@{host}@{tasklist} is insufficient in practice.
E.g. within Uber, our docker setup can somewhat regularly lead to multiple instances
of a worker ending up on the same physical host (which is not changed by docker) and
with the same PID (as all the docker containers start up the same way -> they all
get the same PIDs).
Other causes are possible too, e.g. if a worker crashes and is restarted it could
share the same host+pid, even though its caches (and anything else we actually care
about for the identity) are lost.

At the least-problematic-but-confusing level, this can lead to a misleadingly-short
list of workers on tasklists in the web UI, as duplicates are collapsed.

At the most-problematic level, this can lead to uneven load and sub-par request
routing, as only the first/last/??? request to poll the server will receive data
intended for a specific worker.

Ideally that wouldn't be an issue, e.g. the server could keep better track of polls
and route more precisely, but 1) it's difficult to reliably identify polls since
the server only has the poller-provided data (which mis-identifies itself), and 2)
even if it can be fixed on the server, improving identity uniqueness helps both fix
this now and reduces or eliminates the impact of any future identity confusions.

---

Alternatively, we could add a process-global random UUID that auto-inits, which
would give us IDs that can be correlated across tasklists.  While I think this is
acceptable as well, I'm loathe to add another global variable, and the benefit of
cross-task identity correlating is pretty minor or nonexistent.

If someone actually cares about that, they can pass an explicit ID that they
generate however they like.
We have a variety of from-workflow errors that are relevant for users calling
`client.ExecuteWorkflow(...).Get(...)` and `client.GetWorkflow(...).Get(...)`.

Since they do not share any common wrapper or "parent type" that would allow
easy comparison with an error type check or `errors.As`, they are rather
annoying to differentiate from general RPC errors like timeouts or connection
failures.

So this adds a top-level "check all relevant types" helper, which is
intended to be a stable target for users to rely on, even if we add more
types later.  It intentionally does not tell *which* kind of error it is, in
case there are multiple types in the wrapped chain, as then behavior would
be order-dependent and any change risks breaking someone's code.
…ups (uber-go#1134)

Canceled-context documentation was strewn about in individual function documentation, though it's a fairly important concept that needs to be understood.

In particular, users are sometimes under the impression that "all things which accept a context will error", which can lead to nasty surprises like `selector.Select(ctx)`'s blocking behavior (causing a deadlocked workflow).  Selector's docs have been improved in uber-go#1131, and this will hopefully make its behavior easier to learn before it is needed.

And, since I noticed some small other things while writing this chunk of docs, I've rolled those changes into here as well.
A user noticed incorrect cancellation behavior on one of their workflows,
which had workflow code somewhat like this:
```go
// start a bunch of child workflows, add to `cfs`
var cfs []workflow.Future
for _, arg := range args {
  cfs = append(cfs, workflow.ExecuteChildWorkflow(ctx, "stuff", arg)
}

// wait for them to complete
for _, f := range cfs {
  f.Get(...)
}

// run a final child workflow to do the final report
workflow.ExecuteChildWorkflow(ctx, "final", ...).Get(ctx, nil)
```

When they canceled their parent workflow while "stuff" was still running, it would
wait for all the "stuff" to cancel and return (as expected)...
... and then it would start the "final" child, which would never actually finish
because the previous "stuff" was canceled, not completed.

For cancellation prior to calling `ExecuteChildWorkflow`, this can be worked around by
checking `ctx.Err() == nil`, and only executing if true.

For cancellation *between* `ExecuteChildWorkflow` and the child being scheduled,
there may not be a viable workaround.  This time window is thankfully usually very
small, so *most* workflows should not have to worry about it.

---

The cause appears to be that this cancellation check in `ExecuteChildWorkflow` depends
on `childWorkflowExecution` being non-nil (since that sends the cancellation event):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L905-L917
but that variable is only set when the child workflow's "execution" future completes
(i.e. it has been scheduled successfully):
https://github.com/uber-go/cadence-client/blob/8fff028e0c174fdf14df6520a68ce086c2b272f4/internal/workflow.go#L886-L897

If cancellation occurs prior to that point, the cancellation is ignored for this child.
Unfortunately it will also not "detect" this "lost" cancellation later in any way, so
the child workflow acts as if it was run with a `workflow.NewDisconnectedContext`,
though it was not.

---

... unfortunately, fixing this can cause non-deterministic replay errors for anyone
who had previously executed the child.  For some users this is probably fine (just reset),
but not for everyone.

On a fundamental level, we need a way to safely make semantic changes (due to bugfixes or
just improvements) in the client, and we do not seem to have an established way to do
that yet.  Fixing this safely may require us to come up with a strategy, build that, and
make use of it.
Resolves uber-go#1138 by correcting the bug.

As this is a non-backwards-compatible change, the next commit contains
a "backport" of sorts to allow selecting the buggy behavior if necessary,
to ease migration.
Introduces and uses a "bug backport" config, to allow selecting old behavior
that could not be fixed in a completely transparent way.

See code comments for detailed use for this new flag.

In general we should treat this config as *somewhat desirable* to keep when
the upkeep cost is low or zero, but it does represent known tech debt that
we need to clean up at some point.
When adding or removing fields on it, make sure to cover it in the version notes!
This reverts commit 0fb34ee.

Panicking due to null run-id values on children-cancellation, possibly elsewhere.
* Removing some globals from tests
* Use test-scoped loggers everywhere

Conflicted with jwt-related things, so those have been removed.
Otherwise no issues.
@Groxx
Copy link
Contributor Author

Groxx commented Nov 9, 2021

compared to wrong branch

@Groxx Groxx closed this Nov 9, 2021
@coveralls
Copy link

Pull Request Test Coverage Report for Build 5118d85f-4c62-46a4-a6de-d81314e82248

  • 147 of 221 (66.52%) changed or added relevant lines in 13 files are covered.
  • 814 unchanged lines in 16 files lost coverage.
  • Overall coverage increased (+8.0%) to 71.257%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/context.go 31 32 96.88%
internal/session.go 0 1 0.0%
mocks/Client.go 0 1 0.0%
internal/internal_event_handlers.go 12 15 80.0%
internal/internal_worker.go 15 19 78.95%
client/client.go 24 34 70.59%
internal/workflow.go 24 40 60.0%
internal/compatibility/request.go 9 47 19.15%
Files with Coverage Reduction New Missed Lines %
client/client.go 2 71.74%
internal/common/convert.go 3 84.62%
internal/client.go 12 87.63%
internal/context.go 16 84.0%
internal/worker.go 16 15.79%
internal/internal_utils.go 20 85.41%
worker/worker.go 27 18.18%
workflow/workflow.go 30 47.37%
internal/compatibility/adapter.go 33 43.23%
internal/internal_task_pollers.go 39 81.53%
Totals Coverage Status
Change from base Build 4f32e7fd-8f48-4ad7-9dd0-cf6c681c5e48: 8.0%
Covered Lines: 11907
Relevant Lines: 16710

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants