-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-2728] Handle scaling to zero units #331
Conversation
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…ng-to-zero-units Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…ng-to-zero-units Signed-off-by: Marcelo Henrique Neppel <[email protected]>
@@ -387,3 +388,43 @@ async def test_network_cut( | |||
), "Connection is not possible after network restore" | |||
|
|||
await is_cluster_updated(ops_test, primary_name) | |||
|
|||
|
|||
async def test_scaling_to_zero(ops_test: OpsTest, continuous_writes) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should most probably move this test in another suite, since it's a requirement for the self healing tests to be able to run against an existing cluster and this is a potentially destructive test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I created https://warthogs.atlassian.net/browse/DPE-3094 to handle that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix. I see more problematic corner cases here, but let's document the proper way of restoring from zero units and test this implementation for some time before all the further improvements. Tnx!
…ng-to-zero-units Signed-off-by: Marcelo Henrique Neppel <[email protected]>
cecd9ae
to
c064e25
Compare
c064e25
* Handle scaling to zero units Signed-off-by: Marcelo Henrique Neppel <[email protected]> * Update units tests Signed-off-by: Marcelo Henrique Neppel <[email protected]> * Remove unused constants Signed-off-by: Marcelo Henrique Neppel <[email protected]> * Don't set unknown status Signed-off-by: Marcelo Henrique Neppel <[email protected]> --------- Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Issue
When the cluster is scaled to 0 units and later scaled back up again, it gets into an error state. It happens due to some conflicts in the unit data and the missing leader key in the Patroni K8S Endpoint that makes the leader unit try to get the cluster info, but it's unable to do that.
Solution
Remove unit data when scaling to zero and add the leader key back if it's missing when scaling back up again. Also, don't set Unknown in the unit status if it's the original status of the unit (otherwise it would trigger an error).
One more detail: the logic from the VM charm was copied to
src/relations/db.py
andsrc/relations/postgresql_provider.py
to avoid deleting the relation user when the PostgreSQL charm is scaled down.