Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A non-leader unit stuck in "awaiting for member to start" #560

Open
jeffreychang911 opened this issue Jul 11, 2024 · 5 comments
Open

A non-leader unit stuck in "awaiting for member to start" #560

jeffreychang911 opened this issue Jul 11, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@jeffreychang911
Copy link

jeffreychang911 commented Jul 11, 2024

Steps to reproduce

  1. SolQA deployed Charmed Kubernetes 1.30/beta on top of Charmed Openstack Yoga, and then deploy mysql-k8s and postgresql-k8s in different model on top of Charmed K8s.
  2. self-signed-certificates and data-integrator are both deployed and related.

Expected behavior

Postgresql-k8s charm should settle shortly after deploy.

Actual behavior

Juju status found one unit stuck in waiting until timeout in 1 hr.
Unit Workload Agent Address Ports Message
data-integrator/0* active idle 192.168.254.204
postgresql-k8s/0 waiting executing 192.168.252.201 awaiting for member to start
postgresql-k8s/1* active idle 192.168.253.201 Primary
postgresql-k8s/2 active idle 192.168.254.203
self-signed-certificates/0* active idle 192.168.252.200

Only found one ERROR from juju debug-log
unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)

Versions

Operating system: Jammy

Juju CLI: 3.5.2

Juju agent: 3.5.2

Charm revision: postgresql-k8s charm rev 281

Charmed Kubernetes 1.30/beta, and would be 1.30/stable soon without change.

Log output

Juju debug log:
unit-postgresql-k8s-0: 2024-07-11 07:54:44 ERROR unit.postgresql-k8s/0.juju-log certificates:3: Cannot push TLS certificates: RetryError(<Future at 0x7f1ee338f0a0 state=finished raised ConnectionError>)

Additional context

This is found in a SolQA run, https://solutions.qa.canonical.com/testruns/5dc43cf9-2211-4b4c-9a69-a39d4d61176e
Crashdump - https://oil-jenkins.canonical.com/artifacts/5dc43cf9-2211-4b4c-9a69-a39d4d61176e/generated/generated/postgresql-k8s/crashdump-2024-07-11-08.49.08.tar.gz

@jeffreychang911 jeffreychang911 added the bug Something isn't working label Jul 11, 2024
Copy link
Contributor

@asbalderson
Copy link

From another testrun that hit the same error on the unit that fails to connect we have this log message where the connection is getting reset by peer while trying to connect.

This reset connection happens on the other units once while they are coming up, but it is repeated ad nauseam in the failed unit where it never connects.

I'm not seeing anything else in the logs that signals to a service not starting or otherwise, but it seems like postgres doesnt start locally, so therefore it cant get health?

2024-07-18T08:43:30.530867708Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Exception in thread Thread-1802 (process_request_thread):
2024-07-18T08:43:30.530886993Z stdout F 2024-07-18T08:43:30.530Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:30.5308897Z stdout F 2024-07-18T08:43:30.530Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:30.531229302Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self.run()
2024-07-18T08:43:30.531240374Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:30.53139845Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:30.531405551Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:30.531709129Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     request.do_handshake()
2024-07-18T08:43:30.53171517Z stdout F 2024-07-18T08:43:30.531Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:30.531993036Z stdout F 2024-07-18T08:43:30.531Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:30.531999241Z stdout F 2024-07-18T08:43:30.531Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:33.54831443Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Exception in thread Thread-1803 (process_request_thread):
2024-07-18T08:43:33.548336755Z stdout F 2024-07-18T08:43:33.548Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:33.54833941Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:33.548341374Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self.run()
2024-07-18T08:43:33.548343111Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:33.54834474Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:33.548346971Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:33.548348695Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     request.do_handshake()
2024-07-18T08:43:33.548350325Z stdout F 2024-07-18T08:43:33.548Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:33.548351945Z stdout F 2024-07-18T08:43:33.548Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:33.54835357Z stdout F 2024-07-18T08:43:33.548Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:36.555747743Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Exception in thread Thread-1804 (process_request_thread):
2024-07-18T08:43:36.555893062Z stdout F 2024-07-18T08:43:36.555Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:36.555900221Z stdout F 2024-07-18T08:43:36.555Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:36.556189845Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     self.run()
2024-07-18T08:43:36.556289238Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:36.556511996Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:36.556518628Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:36.556826132Z stdout F 2024-07-18T08:43:36.556Z [postgresql]     request.do_handshake()
2024-07-18T08:43:36.556887803Z stdout F 2024-07-18T08:43:36.556Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:36.557129105Z stdout F 2024-07-18T08:43:36.557Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:36.557194273Z stdout F 2024-07-18T08:43:36.557Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:38.899227117Z stderr F 2024-07-18T08:43:38.899Z [pebble] GET /v1/notices?after=2024-07-18T08%3A32%3A08.874364511Z&timeout=30s 30.000151934s 200
2024-07-18T08:43:39.571698564Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Exception in thread Thread-1805 (process_request_thread):
2024-07-18T08:43:39.571720471Z stdout F 2024-07-18T08:43:39.571Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:39.571752036Z stdout F 2024-07-18T08:43:39.571Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:39.572014237Z stdout F 2024-07-18T08:43:39.571Z [postgresql]     self.run()
2024-07-18T08:43:39.572085537Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:39.572296492Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:39.572327798Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:39.572598412Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     request.do_handshake()
2024-07-18T08:43:39.572644153Z stdout F 2024-07-18T08:43:39.572Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:39.572947519Z stdout F 2024-07-18T08:43:39.572Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:39.572955647Z stdout F 2024-07-18T08:43:39.572Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.179252032Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Exception in thread Thread-1806 (process_request_thread):
2024-07-18T08:43:40.17927862Z stdout F 2024-07-18T08:43:40.179Z [postgresql] Traceback (most recent call last):
2024-07-18T08:43:40.179310988Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
2024-07-18T08:43:40.179593344Z stdout F 2024-07-18T08:43:40.179Z [postgresql]     self.run()
2024-07-18T08:43:40.179601942Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3.10/threading.py", line 953, in run
2024-07-18T08:43:40.179805582Z stdout F 2024-07-18T08:43:40.179Z [postgresql]     self._target(*self._args, **self._kwargs)
2024-07-18T08:43:40.179841797Z stdout F 2024-07-18T08:43:40.179Z [postgresql]   File "/usr/lib/python3/dist-packages/patroni/api.py", line 1631, in process_request_thread
2024-07-18T08:43:40.18015769Z stdout F 2024-07-18T08:43:40.180Z [postgresql]     request.do_handshake()
2024-07-18T08:43:40.180202199Z stdout F 2024-07-18T08:43:40.180Z [postgresql]   File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
2024-07-18T08:43:40.180470562Z stdout F 2024-07-18T08:43:40.180Z [postgresql]     self._sslobj.do_handshake()
2024-07-18T08:43:40.180539751Z stdout F 2024-07-18T08:43:40.180Z [postgresql] ssl.SSLError: [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
2024-07-18T08:43:40.180758071Z stderr F 2024-07-18T08:43:40.180Z [pebble] Check "postgresql" failure 403 (threshold 3): Get "http://postgresql-k8s-1.postgresql-k8s-endpoints:8008/health": read tcp 192.168.252.69:35004->192.168.252.69:8008: read: connection reset by peer

@marceloneppel
Copy link
Member

Hi, @jeffreychang911 and @asbalderson! Thanks for the report. This issue was scheduled for the next pulse.

@marceloneppel
Copy link
Member

Hi, @jeffreychang911 and @asbalderson! Do you have any environment we could access to reproduce this issue? I tried both on a VM and a PS6 model but couldn't reproduce it.

@jeffreychang911
Copy link
Author

I checked our test log, this issue only happened twice in July with rev 281. We didn't see that in last 90+ runs since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants