Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The postgresql peeble service went down and it wasn't restarted automatically #540

Closed
andrejvelichkovski opened this issue Jul 1, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@andrejvelichkovski
Copy link

Please see the Matrix thread here in case the full message context helps in understanding this issue (link)

Steps to reproduce

  1. Deploy/run a postgresql-k8s charm
  2. Kill the pebble postgresql service in the k8s pod.
  3. The service doesn't restart itself automatically after it is killed unless manually triggered

Expected behavior

We would expect that the service restarts itself automatically

Actual behavior

The postgresql pebble service is inactive and doesn't restart itself.

Versions

Operating system: PS5 environment

Juju CLI: 2.9.49-ubuntu-amd64

Juju agent: 14.11

Charm revision: 247

microk8s:

Log output

Juju debug log:

2024-06-29T10:01:59.660Z [pebble] GET /v1/services?names=postgresql 207.373µs 200
2024-06-29T10:07:57.673Z [pebble] GET /v1/services?names=postgresql 279.914µs 200
2024-06-29T10:12:55.959Z [pebble] GET /v1/services?names=postgresql 379.654µs 200
2024-06-29T10:17:53.183Z [pebble] GET /v1/services?names=postgresql 318.372µs 200
2024-06-29T10:23:50.628Z [pebble] GET /v1/services?names=postgresql 291.239µs 200
2024-06-29T10:29:19.645Z [pebble] GET /v1/services?names=postgresql 176.82µs 200
2024-06-29T10:34:28.689Z [pebble] GET /v1/services?names=postgresql 328.981µs 200
2024-06-29T10:38:53.221Z [pebble] GET /v1/services?names=postgresql 185.35µs 200
2024-06-29T10:44:05.568Z [pebble] GET /v1/services?names=postgresql 314.562µs 200
2024-06-29T10:48:27.660Z [pebble] GET /v1/services?names=postgresql 270.051µs 200
2024-06-29T10:53:48.630Z [pebble] GET /v1/services?names=postgresql 226.33µs 200
2024-06-29T10:58:39.514Z [pebble] GET /v1/services?names=postgresql 223.46µs 200
2024-06-29T11:03:25.105Z [pebble] Check "postgresql" failure 1 (threshold 3): received non-20x status code 503
2024-06-29T11:03:35.104Z [pebble] Check "postgresql" failure 2 (threshold 3): received non-20x status code 503
2024-06-29T11:03:45.103Z [pebble] Check "postgresql" failure 3 (threshold 3): received non-20x status code 503
2024-06-29T11:03:45.103Z [pebble] Check "postgresql" failure threshold 3 hit, triggering action
2024-06-29T11:03:51.260Z [postgresql] localhost:5432 - no response
2024-06-29T11:03:52.270Z [postgresql] localhost:5432 - no response
2024-06-29T11:03:53.277Z [postgresql] localhost:5432 - no response
2024-06-29T11:03:54.287Z [postgresql] localhost:5432 - no response
2024-06-29T11:03:55.103Z [pebble] Check "postgresql" failure 4 (threshold 3): received non-20x status code 503
2024-06-29T11:03:55.295Z [postgresql] localhost:5432 - no response
2024-06-29T11:03:56.302Z [postgresql] localhost:5432 - no response
2024-06-29T11:03:57.310Z [postgresql] localhost:5432 - no response

Additional context

Patroni logs:

2024-06-29 11:02:20 UTC [82243]: INFO: no action. I am (db-0), the leader with the lock 
2024-06-29 11:02:30 UTC [82243]: INFO: no action. I am (db-0), the leader with the lock 
2024-06-29 11:02:40 UTC [82243]: INFO: no action. I am (db-0), the leader with the lock 
2024-06-29 11:02:50 UTC [82243]: INFO: no action. I am (db-0), the leader with the lock 
2024-06-29 11:03:00 UTC [82243]: INFO: no action. I am (db-0), the leader with the lock 
2024-06-29 11:03:10 UTC [82243]: INFO: Lock owner: db-0; I am db-0 
2024-06-29 11:03:15 UTC [82243]: ERROR: Request to server https://10...:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10...', port=443): Read timed out. (read timeout=4.999667637050152)") 
2024-06-29 11:03:18 UTC [82243]: ERROR: Request to server https://10...:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10...', port=443): Read timed out. (read timeout=2.214830946177244)")
2024-06-29 11:03:20 UTC [82243]: ERROR: Error communicating with DCS 
2024-06-29 11:03:20 UTC [82243]: INFO: demoting self because DCS is not accessible and I was a leader 
2024-06-29 11:03:20 UTC [82243]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-06-29 11:03:20 UTC [82243]: INFO: Demoting self (offline) 
2024-06-29 11:03:20 UTC [82243]: INFO: Lock owner: db-0; I am db-0 
2024-06-29 11:03:22 UTC [82243]: INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader 
2024-06-29 11:03:28 UTC [82243]: INFO: Lock owner: None; I am db-0 
2024-06-29 11:03:28 UTC [82243]: INFO: Demoting primary during demoting self because DCS is not accessible and I was a leader 
2024-06-29 11:03:28 UTC [82243]: INFO: Demoting self (immediate-nolock) 
2024-06-29 11:03:49 UTC [82243]: INFO: closed patroni connection to the postgresql cluster 
2024-06-29 11:03:51 UTC [82243]: INFO: postmaster pid=111648 
2024-06-29 11:03:54 UTC [82243]: INFO: lost leader lock during demoting self because DCS is not accessible and I was a leader 
2024-06-29 11:03:54 UTC [82243]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-06-29 11:03:54 UTC [82243]: INFO: Lock owner: None; I am db-0 
2024-06-29 11:03:54 UTC [82243]: INFO: not healthy enough for leader race 
2024-06-29 11:03:59 UTC [82243]: INFO: demoting self because DCS is not accessible and I was a leader in progress 
2024-06-29 11:04:04 UTC [82243]: INFO: Lock owner: None; I am db-0 
2024-06-29 11:04:04 UTC [82243]: INFO: not healthy enough for leader race 
2024-06-29 11:04:04 UTC [82243]: INFO: demoting self because DCS is not accessible and I was a leader in progress 
2024-06-29 11:04:14 UTC [82243]: INFO: Lock owner: None; I am db-0 
2024-06-29 11:04:14 UTC [82243]: INFO: not healthy enough for leader race 
2024-06-29 11:04:14 UTC [82243]: INFO: demoting self because DCS is not accessible and I was a leader in progress 
2024-06-29 11:04:18 UTC [82243]: WARNING: Exception happened during processing of request from 10...:42438 
2024-06-29 11:04:18 UTC [82243]: WARNING: Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 683, in process_request_thread
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3/dist-packages/patroni/api.py", line 105, in __init__
    super(RestApiHandler, self).__init__(request, client_address, server)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/python3.10/http/server.py", line 433, in handle
    self.handle_one_request()
  File "/usr/lib/python3/dist-packages/patroni/api.py", line 1325, in handle_one_request
    BaseHTTPRequestHandler.handle_one_request(self)
  File "/usr/lib/python3.10/http/server.py", line 421, in handle_one_request
    method()
  File "/usr/lib/python3/dist-packages/patroni/api.py", line 368, in do_GET
    self._write_status_response(status_code, response)
  File "/usr/lib/python3/dist-packages/patroni/api.py", line 214, in _write_status_response
    self._write_json_response(status_code, response)
  File "/usr/lib/python3/dist-packages/patroni/api.py", line 167, in _write_json_response
    self.write_response(status_code, json.dumps(response, default=str), content_type='application/json')
  File "/usr/lib/python3/dist-packages/patroni/api.py", line 157, in write_response
    self.wfile.write(body.encode('utf-8'))
  File "/usr/lib/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
BrokenPipeError: [Errno 32] Broken pipe
2024-06-29 11:04:24 UTC [82243]: INFO: Lock owner: None; I am db-0 
2024-06-29 11:04:24 UTC [82243]: INFO: not healthy enough for leader race 
2024-06-29 11:04:24 UTC [82243]: INFO: demoting self because DCS is not accessible and I was a leader in progress 
2024-06-29 11:04:28 UTC [82243]: WARNING: Exception happened during processing of request from 10...:39822 
<repeats the broken pipe python error>
@andrejvelichkovski andrejvelichkovski added the bug Something isn't working label Jul 1, 2024
Copy link
Contributor

github-actions bot commented Jul 1, 2024

@andrejvelichkovski
Copy link
Author

The same issue happened again, just a week later.

@marceloneppel
Copy link
Member

Hi, @andrejvelichkovski! Thanks for the report.

We'll be working to fix this issue in the next two weeks.

@lucasgameiroborges
Copy link
Member

Hello @andrejvelichkovski !

This issue was specifically addressed in #581 and (as far as our tests have shown) the issue is solved.

If you see it again, please let us know by reopening this issue or pinging us on matrix/mattermost!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants