Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestTracingGoldenData/otlp-opencensus timed out after 10m #33865

Open
songy23 opened this issue Jul 2, 2024 · 10 comments · Fixed by #34093
Open

Flaky test: TestTracingGoldenData/otlp-opencensus timed out after 10m #33865

songy23 opened this issue Jul 2, 2024 · 10 comments · Fixed by #34093
Labels
ci-cd CI, CD, testing, build issues flaky test a test is flaky help wanted Extra attention is needed receiver/opencensus testbed

Comments

@songy23
Copy link
Member

songy23 commented Jul 2, 2024

Component(s)

testbed

Describe the issue you're reporting

Seeing a lot of occurrences of this test timeout in CI:

panic: test timed out after 10m0s
running tests:
	TestTracingGoldenData (10m0s)
	TestTracingGoldenData/opencensus-otlp (9m30s)

E.g. https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9761790955/job/26943836035?pr=33856#step:7:904

Note this is different from #27295, #27295 is due to the same port being binded twice, while this one seems to be due to testbed.(*inProcessCollector).Stop taking too long (deadlock?)

@songy23 songy23 added ci-cd CI, CD, testing, build issues flaky test a test is flaky needs triage New item requiring triage testbed labels Jul 2, 2024
Copy link
Contributor

github-actions bot commented Jul 2, 2024

Pinging code owners for testbed: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

github-actions bot commented Jul 2, 2024

Pinging code owners:

  • testbed: @open-telemetry/collector-approvers

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

Looking into this a bit. Successful runs of this test take about 3 seconds (example).

From the panic stack trace, this one in particular looks interesting:

goroutine 1004 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0xc005df65a0?)
	/opt/hostedtoolcache/go/1.21.12/x64/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc0035140a8?)
	/opt/hostedtoolcache/go/1.21.12/x64/src/sync/waitgroup.go:116 +0x48
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Stop(0xc0015300e0)
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:104 +0x4c
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*TestCase).StopAgent(0xc003c3b180)
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/test_case.go:165 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/correctnesstests/traces.testWithTracingGoldenDataset(0x651cca0?, {0x7fe574eb2048, 0xc0017a46c0}, {0x4571c38, 0xc0066d8480}, {0x0?, 0x0?, 0xc0018f1760?, 0x498f6f?}, 0xc0016e55c0)
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/correctnesstests/traces/correctness_test.go:92 +0x712
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/correctnesstests/traces.TestTracingGoldenData.func1(0xa6f12b?)
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/correctnesstests/traces/correctness_test.go:39 +0x4a
testing.tRunner(0xc002bde820, 0xc00380f788)
	/opt/hostedtoolcache/go/1.21.12/x64/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 7
	/opt/hostedtoolcache/go/1.21.12/x64/src/testing/testing.go:1648 +0x3ad

It looks like the actual test functionality succeeds, but shutting down the in-process collector is blocking on this Wait() call. Since this is flaky, it looks like the waitgroup usage must have some kind of bug that is only hit intermittently.

Looking further, we see the collector is still running, which is blocking Done from being called. The collector process itself is attempting to be shut down, but is blocked on reporting its status, attempting to get a lock on the reporter's mutex. We learned this from this stack trace:

goroutine 1008 [sync.Mutex.Lock, 9 minutes]:
sync.runtime_SemacquireMutex(0xc00279b908?, 0x85?, 0xc0013516b0?)
	/opt/hostedtoolcache/go/1.21.12/x64/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc001498810)
	/opt/hostedtoolcache/go/1.21.12/x64/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(...)
	/opt/hostedtoolcache/go/1.21.12/x64/src/sync/mutex.go:90
go.opentelemetry.io/collector/service/internal/status.(*reporter).ReportStatus(0xc001498810, 0x4571c00?, 0x66d68a0?)
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/status/status.go:135 +0x4c
go.opentelemetry.io/collector/service/internal/graph.(*Graph).ShutdownAll(0xc001530700, {0x4571c00, 0x66d68a0}, {0x455bfb0, 0xc001498810})
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/graph/graph.go:469 +0x356
go.opentelemetry.io/collector/service.(*Service).Shutdown(0xc004030870, {0x4571c00, 0x66d68a0})
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/service.go:267 +0xdf
go.opentelemetry.io/collector/otelcol.(*Collector).shutdown(0xc000f0bc20, {0x4571c00, 0x66d68a0})
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/collector.go:353 +0xc5
go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc000f0bc20, {0x4571c00, 0x66d68a0})
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/collector.go:339 +0x9b0
github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Start.func2()
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:80 +0x65
created by github.com/open-telemetry/opentelemetry-collector-contrib/testbed/testbed.(*inProcessCollector).Start in goroutine 1004
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/testbed/testbed/in_process_collector.go:78 +0x35d

We can then find the other goroutine that held the lock, which is the opencensus receiver. The receiver reports status during shutdown if the server is closed with an error message that is unexpected.

goroutine 1027 [chan send, 9 minutes]:
go.opentelemetry.io/collector/service.(*serviceHost).notifyComponentStatusChange(...)
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/host.go:72
go.opentelemetry.io/collector/service/internal/status.(*reporter).componentFSM.func1(0x34eeec0?)
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/status/status.go:164 +0x25
go.opentelemetry.io/collector/service/internal/status.(*fsm).transition(0xc003e758c0, 0xc00422b740)
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/status/status.go:40 +0x8b
go.opentelemetry.io/collector/service/internal/status.(*reporter).ReportStatus(0xc001498810, 0x0?, 0xc0000b7f70?)
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/status/status.go:140 +0x99
go.opentelemetry.io/collector/service/internal/graph.(*Graph).buildComponents.TelemetrySettings.ToComponentTelemetrySettings.NewReportStatusFunc.func1(0x4531980?)
	/home/runner/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/internal/status/status.go:178 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver.(*ocReceiver).Start.func3()
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/opencensusreceiver/opencensus.go:159 +0x119
created by github.com/open-telemetry/opentelemetry-collector-contrib/receiver/opencensusreceiver.(*ocReceiver).Start in goroutine 1008
	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/receiver/opencensusreceiver/opencensus.go:156 +0xa59

Reporting status during shutdown is currently broken as filed in open-telemetry/opentelemetry-collector#9824, this is a frequency of that bug.

@crobert-1 crobert-1 added receiver/opencensus and removed needs triage New item requiring triage labels Jul 15, 2024
Copy link
Contributor

Pinging code owners for receiver/opencensus: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

This is the root cause blocking call:

ocr.settings.TelemetrySettings.ReportStatus(component.NewFatalErrorEvent(err))

The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down.

The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected.

@odubajDT
Copy link
Contributor

odubajDT commented Jul 16, 2024

This is the root cause blocking call:

ocr.settings.TelemetrySettings.ReportStatus(component.NewFatalErrorEvent(err))

The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down.

The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected.

Agree with this, it might be missing the check for ErrListenerClosed in the condition. There migh be a situation where the listener was already closed in the test. So the full if statement should look like the following

if err := ocr.multiplexer.Serve(); !errors.Is(err, cmux.ErrServerClosed) && !errors.Is(err, cmux.ErrListenerClosed) && err != nil {
			ocr.settings.TelemetrySettings.ReportStatus(component.NewFatalErrorEvent(err))
		}

will open a PR and try to do few runs until we confirm it's resolved.

mx-psi pushed a commit that referenced this issue Jul 19, 2024
@crobert-1
Copy link
Member

@crobert-1 crobert-1 reopened this Jul 19, 2024
djaglowski pushed a commit that referenced this issue Jul 23, 2024
…thub.com/jonboulle/clockwork` (#34221)

**Description:**
- replaces `github.com/tilinna/clock` with
`github.com/jonboulle/clockwork` in `chrony receiver`

**Link to tracking Issue:**
#34190

**Connected PRs:**
-
#34224
-
#34226

**Notes:**

The failing check is not caused by changes in this PR, it's a flaky test
#33865

---------

Signed-off-by: odubajDT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cd CI, CD, testing, build issues flaky test a test is flaky help wanted Extra attention is needed receiver/opencensus testbed
Projects
None yet
3 participants