Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-1177] Change 'master' to 'primary' in Patroni leader role #532

Merged
merged 18 commits into from
Jul 22, 2024

Conversation

marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Jun 26, 2024

Issue

Patroni sets some labels in the pods to indicate which unit is the primary and which are the replicas. Unfortunately, in older versions, it's using the master value for the role label.

Solution

We upgraded to use the primary value for the role label because we wanted to use better language. It's now possible to use custom values for the labels.

Notes:

  • The version constraints were upgraded on src/dependency.json to later, in the next stable after the one that will contain these changes, we can remove the upgrade logic by updating the version constraints again.
  • src/charm.py don't patch the charm custom k8s services during an upgrade to avoid creating moments of downtime (like, for example, if there is a unit still with a role label equal to master and we already switch to the value equal to primary in the k8s service).
  • src/upgrade.py contains temporary logic to handle the label value update during the upgrade process (moving to the primary label only when the last unit - unit 0 - will be upgraded to avoid downtime).
  • tests/integration/ha_tests/test_rollback_to_master_label.py and tests/integration/ha_tests/test_upgrade_to_primary_label.py are almost a copy from tests/integration/ha_tests/test_upgrade.py, but they should be removed later, as the intent is only to validate the role label value migration at this point.
  • More unit tests will be added through another PR because this PR is already over 500 lines.

Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 63.04348% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.74%. Comparing base (9ea09b7) to head (289e392).

Files Patch % Lines
src/relations/async_replication.py 30.00% 7 Missing ⚠️
src/upgrade.py 30.00% 5 Missing and 2 partials ⚠️
src/charm.py 88.46% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #532      +/-   ##
==========================================
- Coverage   70.92%   70.74%   -0.18%     
==========================================
  Files          10       10              
  Lines        2817     2834      +17     
  Branches      525      531       +6     
==========================================
+ Hits         1998     2005       +7     
- Misses        720      727       +7     
- Partials       99      102       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…first unit upgraded)

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
…first unit upgraded) in the rollback test

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
Signed-off-by: Marcelo Henrique Neppel <[email protected]>
@marceloneppel marceloneppel marked this pull request as ready for review July 11, 2024 12:28
ops_test.model.deploy(
DATABASE_APP_NAME,
num_units=3,
channel="14/stable",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a specific revision here? New releases would use the new label by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I updated it on f02e964.

logger.info("Wait for applications to become active")
async with ops_test.fast_forward():
await ops_test.model.wait_for_idle(
apps=[DATABASE_APP_NAME, APPLICATION_NAME], status="active"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional of course, but if you want to avoid https://warthogs.atlassian.net/browse/DPE-4855 you can add raise_on_error=False in this specific wait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for letting me know about that! I updated this PR with that on f02e964.

logger.info("Wait for applications to become active")
async with ops_test.fast_forward():
await ops_test.model.wait_for_idle(
apps=[DATABASE_APP_NAME, APPLICATION_NAME], status="active"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR with that on f02e964.

try:
self._create_services()
except ApiError:
logger.exception("failed to create k8s services")
Copy link
Member

@lucasgameiroborges lucasgameiroborges Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a question here, sorry 😅 but will log.exception (without the error as argument) display the error info for debugging purposes (eg. message, stack trace, etc)? I imagine so, just want to confirm so this doesn´t become a mistery later on!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. I did a quick local test to double-check it.

Copy link
Member

@lucasgameiroborges lucasgameiroborges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writes_increasing checks seem to be failing consistently on juju 3.

Signed-off-by: Marcelo Henrique Neppel <[email protected]>
ops_test.model.deploy(
DATABASE_APP_NAME,
num_units=3,
channel="14/stable",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. I updated it on f02e964.

logger.info("Wait for applications to become active")
async with ops_test.fast_forward():
await ops_test.model.wait_for_idle(
apps=[DATABASE_APP_NAME, APPLICATION_NAME], status="active"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for letting me know about that! I updated this PR with that on f02e964.

logger.info("Wait for applications to become active")
async with ops_test.fast_forward():
await ops_test.model.wait_for_idle(
apps=[DATABASE_APP_NAME, APPLICATION_NAME], status="active"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this PR with that on f02e964.

try:
self._create_services()
except ApiError:
logger.exception("failed to create k8s services")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does. I did a quick local test to double-check it.

Comment on lines +131 to +134
connection = psycopg2.connect(
f"dbname='{database if database else self.database}' user='{self.user}' host='{host}'"
f"password='{self.password}' connect_timeout=1"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted due to failures in the async replication test (timeouts).

@marceloneppel marceloneppel merged commit 988d3ca into main Jul 22, 2024
84 checks passed
@marceloneppel marceloneppel deleted the dpe-1177-primary-role-label branch July 22, 2024 19:56
@taurus-forever taurus-forever mentioned this pull request Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants