Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ec2 IPv6 IMDS Support (SC-336) #1160

Merged
merged 54 commits into from
Apr 14, 2022

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Dec 15, 2021

Add support for dual stack ipv6/ipv4 IMDS to Ec2

- add support for parallel http(s) requests to wait_for_url()
- implementation based loosely on RFC6555: "Happy Eyeballs" [1]
- update ec2 datasource to support dual-stack ipv6/ipv4
- only "nitro" instances have ipv6 IMDS, favor ipv4 initially
- replace httpretty with requests for ec2 tests

[1] https://datatracker.ietf.org/doc/html/rfc6555

Additional Context

In contrast to "happy eyeballs", this implementation supports parallel attempts to communicate with a small delay favoring ipv4. This is datasource-specific, and other datasources that wish favor ipv6 can do so trivially.

Technical Description of Changes:

  1. use a threadpool executor to wrap synchronous calls in async scheduler (see url_helper.py:dual_stack()) that does 0.150s delay, cancels pending responses, etc (basically a stateless generic happy eyeballs implementation for synchronous libraries but at the http layer, not the socket/connection layer)
  2. refactor wait_for_url() into reusable pieces to support optionally parallel http requests to endpoints
  3. add tests that verify behavior by simulating a slow/unresponsive endpoint
  4. update Ec2 datasource to support parallel GETs (PUTs are left synchronous)
  5. update Ec2 tests to test with ipv6
  6. make Ec2 favor ipv4 (for now), when majority of cloud has ipv6 support this can change to favor ipv6

@github-actions
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Dec 30, 2021
@TheRealFalcon TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Dec 30, 2021
@holmanb holmanb force-pushed the holmanb/ec2-dualstack branch 2 times, most recently from 25dc4e1 to ddad201 Compare January 12, 2022 14:23
@holmanb holmanb changed the title Add ec2 ipv6 imds support Add ec2 IPv6 IMDS Support (SC-336) Jan 20, 2022
@holmanb holmanb marked this pull request as ready for review January 20, 2022 04:45
@holmanb
Copy link
Member Author

holmanb commented Jan 20, 2022

Edit: It seems the distro-provided version of httpretty causes issues with some tests, switching to pip version (newer) resolved this issue.

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @holmanb . This has come together really nicely. I have just a few small comments.

Do we need to change the call on line 250 of DataSourceEc2.py to be asynchronous as well?
I think the start_time and status_cb assignments in wait_for_url might be better placed under all of the nested functions.

I went back and forth about how abstract this is. On one hand, the code is elegant and well structured. On the other hand, it can be hard to follow the flow through the layers of callbacks, partials, and closures. But then again, unwinding it all or adding types might cause it to be even harder to follow. I think at the end of the day, this is just a more functional approach, and it's really just a few functions, so I think it's fine.

Have you tested this on a live ec2 instance yet? I think it'd be worth ensuring we can fetch metadata in both ipv4-only and ipv6-only environments. If not, we should do that before we merge this. Can you post your test steps in the PR description so they can be easily reproduced in the future?

I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.

cloudinit/url_helper.py Outdated Show resolved Hide resolved
cloudinit/url_helper.py Outdated Show resolved Hide resolved
@holmanb
Copy link
Member Author

holmanb commented Jan 21, 2022

This has come together really nicely.

Thank you :)

Do we need to change the call on line 250 of DataSourceEc2.py to be asynchronous as well?

I debated this myself, and honestly I still don't know the right answer. I need a better understanding of the semantics of AWS tokens. I should have noted this in the comments when I switched to requesting review, apologies.

I do know that if we do make the PUT on line 250 async, the assertion at test_aws_token_redacted():673 will fail with len == 5, because the PUT happens multiple times. What does this actually mean?

I think the start_time and status_cb assignments in wait_for_url might be better placed under all of the nested functions.

Agreed

I went back and forth about how abstract this is. On one hand, the code is elegant and well structured. On the other hand, it can be hard to follow the flow through the layers of callbacks, partials, and closures. But then again, unwinding it all or adding types might cause it to be even harder to follow. I think at the end of the day, this is just a more functional approach, and it's really just a few functions, so I think it's fine.

My sentiments exactly. I should note that since this code already had lots of callbacks it didn't seem too out of place when writing it. That said, I can always restructure to something less functional if needed. One concern I have with doing that is that there would likely be more code duplication and resulting potential for subtle differences in behavior between the sync/parallel implementations.

Have you tested this on a live ec2 instance yet? I think it'd be worth ensuring we can fetch metadata in both ipv4-only and ipv6-only environments. If not, we should do that before we merge this. Can you post your test steps in the PR description so they can be easily reproduced in the future?

I have not yet. Since I haven't heard too strong of objections on this yet I'll get started on that next.

I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.

I like that idea.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work. Please add a fairly detailed suggested commit message in the description of the PR as it merits a lot more context about the design/architecture here.
Also a pointer to the RFC 6555 link in the commit message will be helpful.

cloudinit/url_helper.py Outdated Show resolved Hide resolved
cloudinit/url_helper.py Outdated Show resolved Hide resolved
cloudinit/sources/DataSourceEc2.py Outdated Show resolved Hide resolved
cloudinit/sources/DataSourceEc2.py Outdated Show resolved Hide resolved
return_exception,
returned_address,
)
raise return_exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to specifically look at requests.ConnectionError failures which basically tell us no route to host for IPv6 and ignore them as no route means we really aren't ever going to succeed there right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question. I don't think that "no route to host" should ever happen (because we're using link-local addresses[1]), but maybe I'm wrong. Generally the idea of "ignore connection failures in case the others succeed" is something I can get behind.

[1] To my knowledge Linux doesn't require fib entries for sending directly to a link-local addresses.

A quick experiment:

[ec2-user@ip-172-31-6-134 ~]$ ip r
default via 172.31.0.1 dev eth0 
169.254.169.254 dev eth0 
172.31.0.0/20 dev eth0 proto kernel scope link src 172.31.6.134 
[ec2-user@ip-172-31-6-134 ~]$ sudo ip r del 169.254.169.254
[ec2-user@ip-172-31-6-134 ~]$ ip route show table all | grep -e 169 -e ec2
[ec2-user@ip-172-31-6-134 ~]$ wget http://169.254.169.254
--2022-01-25 23:39:54--  http://169.254.169.254/
Connecting to 169.254.169.254:80... connected.
HTTP request sent, awaiting response... 200 OK

Even though the link local route was removed from the routing table it was still able to communicate using that address.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh maybe I was misinterpreting the log I saw on an IPv4 only instance:
2022-01-25 22:34:46,195 - url_helper.py[WARNING]: Calling 'None' failed [0/120s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7facb9c98400>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context. "Network is unreachable" is often related to routing issues, I see what you were getting at. Regardless of what "Network is unreachable" is caused by, I think this suggestion is a good one. Specifically this would make connectivity possible in the case that there is a slow first address and an unreachable second address.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question. I don't think that "no route to host" should ever happen (because we're using link-local addresses[1]), but maybe I'm wrong.

Update: I was wrong.

My former argument here was invalid. The link local address used on ec2 instances is actually the destination address, not source address. Http(s) traffic leaves via the default route on a private address (RFC1918).

Remove default routes and this error could happen:

[ec2-user@ip-172-31-6-134 ~]$ wget http://[fd00:ec2::254]/latest/meta-data/
--2022-02-10 22:36:17--  http://[fd00:ec2::254]/latest/meta-data/
Connecting to [fd00:ec2::254]:80... failed: Network is unreachable.

I still think that the route I deleted in the example above (169.254.169.254 dev eth0) is unneeded, but for different reasons.

This comment doesn't imply a different course of action, but I wanted to set the record straight here.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to set request changes for moving the connect_synchronously up into _maybe_fetch_api_token.

@blackboxsw
Copy link
Collaborator

blackboxsw commented Jan 26, 2022

I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.

pycloudlib does have capacity to launch dual-stack ipv4/ipv6 vpc (though this may not yet against exclusive IPv6 IMDS) via something like:

vpc = ec2.get_or_create_vpc(name="cloud-nit-dualstack-vpc")
ec2.launch(image_id, vpc=vpc)

So I think we can subclass tests.integrationtests.cloud.Ec2Cloud._perform_launch to call the get_or_create_vpc first in cases we want dual-stack support for testing

@github-actions
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Feb 25, 2022
@github-actions github-actions bot closed this Mar 4, 2022
@TheRealFalcon TheRealFalcon reopened this Mar 4, 2022
@TheRealFalcon TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Mar 7, 2022
@github-actions
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Mar 22, 2022
@TheRealFalcon TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Mar 22, 2022
@holmanb holmanb force-pushed the holmanb/ec2-dualstack branch 3 times, most recently from d3f2b45 to 4a99f79 Compare March 22, 2022 20:23
@TheRealFalcon
Copy link
Member

TheRealFalcon commented Apr 4, 2022

The failing test is an annoying bug that was fixed a few years ago:
getsentry/responses#212

Here's a workaround:

diff --git a/tests/unittests/sources/test_ec2.py b/tests/unittests/sources/test_ec2.py
index 3848924d..9e373632 100644
--- a/tests/unittests/sources/test_ec2.py
+++ b/tests/unittests/sources/test_ec2.py
@@ -530,6 +530,14 @@ class TestEc2(test_helpers.HttprettyTestCase):
             md={"md": old_metadata},
         )
         self.assertTrue(ds.get_data())
+
+        # Workaround https://github.com/getsentry/responses/issues/212
+        if hasattr(responses.mock, "_urls"):
+            for index, url in enumerate(responses.mock._urls):
+                if url["url"].startswith(
+                    "http://169.254.169.254/2009-04-04/meta-data/"
+                ):
+                    del responses.mock._urls[index]
         # Provide new revision of metadata that contains network data
         register_mock_metaserver(
             "http://169.254.169.254/2009-04-04/meta-data/", DEFAULT_METADATA

Since it's only one test, I'm fine just inserting it there. If this becomes a bigger issue we'd probably want to find a more elegant solution.

@paride
Copy link
Contributor

paride commented Apr 5, 2022

@TheRealFalcon nice! Also matches my findings about responses 0.17.0 being the first version without this issue. Maybe worth adding a line to the comment explaining that the workaround can be removed once there will be no need for testing with requests < 0.17.0 (that is, post-Focal EOL)?

@TheRealFalcon
Copy link
Member

Yeah, good idea, though I think the workaround is only necessary in Bionic.

@holmanb
Copy link
Member Author

holmanb commented Apr 6, 2022

The failing test is an annoying bug that was fixed a few years ago: getsentry/responses#212

Thanks! This works locally so I have included it.

Also, I had to rebase on the tip of main since the latest commit introduced a merge conflict in one of the tests I changed.

Previously the 403 test verified that the following output was logged:

	"PUT /latest/api/token HTTP/1.1" 403 0

This log output is from urllib3.connectionpool, and is redundant with
the other checked logs, one of which is only logged when code==403. We don't
get this log currently when run under dual-stack, so drop it from the test.
This allows a ConnectionError exception to be thrown in one of the
threads without canceling the other thread - the first successful call
is returned, the rest are canceled. If all threads have exceptions,
log them all and raise the last one (need choose one somehow, right? I
chose the last; tell me if there is a reason we should raise the first
in order of execution or first in alphabetical order or whatever)

Tests to verify the above described behavior have been added.
- type hints
- make sure response gets sent to optional time callback, because that's what it used to do
- don't assign lamda functions to kwargs, who does that
@holmanb
Copy link
Member Author

holmanb commented Apr 8, 2022

I believe this is ready for re-review.

@TheRealFalcon
Copy link
Member

I made an integration test for this. If you use canonical/pycloudlib#188 and apply this patch, it should work

diff --git a/tests/integration_tests/clouds.py b/tests/integration_tests/clouds.py
index e5fe56d1..5ea41cbb 100644
--- a/tests/integration_tests/clouds.py
+++ b/tests/integration_tests/clouds.py
@@ -207,10 +207,14 @@ class Ec2Cloud(IntegrationCloud):
 
     def _perform_launch(self, launch_kwargs, **kwargs):
         """Use a dual-stack VPC for cloud-init integration testing."""
-        launch_kwargs["vpc"] = self.cloud_instance.get_or_create_vpc(
-            name="ec2-cloud-init-integration"
-        )
+        if "vpc" not in launch_kwargs:
+            launch_kwargs["vpc"] = self.cloud_instance.get_or_create_vpc(
+                name="ec2-cloud-init-integration"
+            )
+        if "Ipv6AddressCount" not in launch_kwargs:
+            launch_kwargs["Ipv6AddressCount"] = 1
         pycloudlib_instance = self.cloud_instance.launch(**launch_kwargs)
+        pycloudlib_instance.enable_ipv6_instance_metadata()
         return pycloudlib_instance
 
 
diff --git a/tests/integration_tests/datasources/test_ec2.py b/tests/integration_tests/datasources/test_ec2.py
new file mode 100644
index 00000000..8cde4dc9
--- /dev/null
+++ b/tests/integration_tests/datasources/test_ec2.py
@@ -0,0 +1,43 @@
+import re
+
+import pytest
+
+from tests.integration_tests.instances import IntegrationInstance
+
+
+def _test_crawl(client, ip):
+    assert client.execute("cloud-init clean --logs").ok
+    assert client.execute("cloud-init init --local").ok
+    log = client.read_from_file("/var/log/cloud-init.log")
+    assert f"Using metadata source: '{ip}'" in log
+    result = re.findall(
+        r"Crawl of metadata service took (\d+.\d+) seconds", log
+    )
+    if len(result) != 1:
+        pytest.fail(f"Expected 1 metadata crawl time, got {result}")
+    # 20 would still be a crazy long time for metadata service to crawl,
+    # but it's short enough to know we're not waiting for a response
+    assert float(result[0]) < 20
+
+
+@pytest.mark.ec2
+def test_dual_stack(client: IntegrationInstance):
+    # Drop IPv4 responses
+    assert client.execute("iptables -I INPUT -s 169.254.169.254 -j DROP").ok
+    _test_crawl(client, "http://[fd00:ec2::254]")
+
+    # Block IPv4 requests
+    assert client.execute("iptables -I OUTPUT -d 169.254.169.254 -j REJECT").ok
+    _test_crawl(client, "http://[fd00:ec2::254]")
+
+    # Re-enable IPv4
+    assert client.execute("iptables -D OUTPUT -d 169.254.169.254 -j REJECT").ok
+    assert client.execute("iptables -D INPUT -s 169.254.169.254 -j DROP").ok
+
+    # Drop IPv6 responses
+    assert client.execute("ip6tables -I INPUT -s fd00:ec2::254 -j DROP").ok
+    _test_crawl(client, "http://169.254.169.254")
+
+    # Block IPv6 requests
+    assert client.execute("ip6tables -I OUTPUT -d fd00:ec2::254 -j REJECT").ok
+    _test_crawl(client, "http://169.254.169.254")

It'll default all of our ec2 tests to being dual-stack with ipv6 metadata enabled, but I don't see any reason that would be a problem.

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments inline, but otherwise looks good. Once we get the dual-stack timeout fixed and the integration test passes, we should be good to go.

Additionally, some tickets we should create for follow-up:

  1. Add pytest-responses to Build-Depends of all the packaging branches. I also plan on adding pytest-mock, so we could do those at the same time.
  2. Replace httpretty with responses everywhere. No need to have 2 libraries doing the same thing.
  3. Ensure our EphemeralDHCPv4 code gets updated accordingly for an IPv6-only environment

cloudinit/url_helper.py Outdated Show resolved Hide resolved
cloudinit/url_helper.py Outdated Show resolved Hide resolved
tests/unittests/sources/test_ec2.py Outdated Show resolved Hide resolved
tests/unittests/sources/test_ec2.py Outdated Show resolved Hide resolved
cloudinit/url_helper.py Outdated Show resolved Hide resolved
cloudinit/url_helper.py Outdated Show resolved Hide resolved
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Thanks for all the work you put into this!

@TheRealFalcon TheRealFalcon merged commit 4cd1f9f into canonical:main Apr 14, 2022
holmanb pushed a commit that referenced this pull request Nov 17, 2022
- Add openstack IPv6 metadata url fe80::a9fe:a9fe
- Enable requesting multiple metadata sources in parallel

This PR is very similar to #1160, reusing the provided `url_heper` logic.

LP: #1906849
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants