Add ec2 IPv6 IMDS Support (SC-336) #1160

holmanb · 2021-12-15T20:11:00Z

Add support for dual stack ipv6/ipv4 IMDS to Ec2

- add support for parallel http(s) requests to wait_for_url()
- implementation based loosely on RFC6555: "Happy Eyeballs" [1]
- update ec2 datasource to support dual-stack ipv6/ipv4
- only "nitro" instances have ipv6 IMDS, favor ipv4 initially
- replace httpretty with requests for ec2 tests

[1] https://datatracker.ietf.org/doc/html/rfc6555

Additional Context

In contrast to "happy eyeballs", this implementation supports parallel attempts to communicate with a small delay favoring ipv4. This is datasource-specific, and other datasources that wish favor ipv6 can do so trivially.

Technical Description of Changes:

use a threadpool executor to wrap synchronous calls in async scheduler (see url_helper.py:dual_stack()) that does 0.150s delay, cancels pending responses, etc (basically a stateless generic happy eyeballs implementation for synchronous libraries but at the http layer, not the socket/connection layer)
refactor wait_for_url() into reusable pieces to support optionally parallel http requests to endpoints
add tests that verify behavior by simulating a slow/unresponsive endpoint
update Ec2 datasource to support parallel GETs (PUTs are left synchronous)
update Ec2 tests to test with ipv6
make Ec2 favor ipv4 (for now), when majority of cloud has ipv6 support this can change to favor ipv6

github-actions · 2021-12-30T00:02:40Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.)

holmanb · 2022-01-20T05:10:37Z

Edit: It seems the distro-provided version of httpretty causes issues with some tests, switching to pip version (newer) resolved this issue.

TheRealFalcon

Thanks @holmanb . This has come together really nicely. I have just a few small comments.

Do we need to change the call on line 250 of DataSourceEc2.py to be asynchronous as well?
I think the start_time and status_cb assignments in wait_for_url might be better placed under all of the nested functions.

I went back and forth about how abstract this is. On one hand, the code is elegant and well structured. On the other hand, it can be hard to follow the flow through the layers of callbacks, partials, and closures. But then again, unwinding it all or adding types might cause it to be even harder to follow. I think at the end of the day, this is just a more functional approach, and it's really just a few functions, so I think it's fine.

Have you tested this on a live ec2 instance yet? I think it'd be worth ensuring we can fetch metadata in both ipv4-only and ipv6-only environments. If not, we should do that before we merge this. Can you post your test steps in the PR description so they can be easily reproduced in the future?

I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.

cloudinit/url_helper.py

holmanb · 2022-01-21T22:39:45Z

This has come together really nicely.

Thank you :)

Do we need to change the call on line 250 of DataSourceEc2.py to be asynchronous as well?

I debated this myself, and honestly I still don't know the right answer. I need a better understanding of the semantics of AWS tokens. I should have noted this in the comments when I switched to requesting review, apologies.

I do know that if we do make the PUT on line 250 async, the assertion at test_aws_token_redacted():673 will fail with len == 5, because the PUT happens multiple times. What does this actually mean?

I think the start_time and status_cb assignments in wait_for_url might be better placed under all of the nested functions.

Agreed

I went back and forth about how abstract this is. On one hand, the code is elegant and well structured. On the other hand, it can be hard to follow the flow through the layers of callbacks, partials, and closures. But then again, unwinding it all or adding types might cause it to be even harder to follow. I think at the end of the day, this is just a more functional approach, and it's really just a few functions, so I think it's fine.

My sentiments exactly. I should note that since this code already had lots of callbacks it didn't seem too out of place when writing it. That said, I can always restructure to something less functional if needed. One concern I have with doing that is that there would likely be more code duplication and resulting potential for subtle differences in behavior between the sync/parallel implementations.

Have you tested this on a live ec2 instance yet? I think it'd be worth ensuring we can fetch metadata in both ipv4-only and ipv6-only environments. If not, we should do that before we merge this. Can you post your test steps in the PR description so they can be easily reproduced in the future?

I have not yet. Since I haven't heard too strong of objections on this yet I'll get started on that next.

I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.

I like that idea.

blackboxsw

This is great work. Please add a fairly detailed suggested commit message in the description of the PR as it merits a lot more context about the design/architecture here.
Also a pointer to the RFC 6555 link in the commit message will be helpful.

cloudinit/url_helper.py

cloudinit/sources/DataSourceEc2.py

blackboxsw · 2022-01-25T23:12:31Z

cloudinit/url_helper.py

+                return_exception,
+                returned_address,
+            )
+            raise return_exception


Would we want to specifically look at requests.ConnectionError failures which basically tell us no route to host for IPv6 and ignore them as no route means we really aren't ever going to succeed there right?

Interesting question. I don't think that "no route to host" should ever happen (because we're using link-local addresses[1]), but maybe I'm wrong. Generally the idea of "ignore connection failures in case the others succeed" is something I can get behind.

[1] To my knowledge Linux doesn't require fib entries for sending directly to a link-local addresses.

A quick experiment:

[ec2-user@ip-172-31-6-134 ~]$ ip r default via 172.31.0.1 dev eth0 169.254.169.254 dev eth0 172.31.0.0/20 dev eth0 proto kernel scope link src 172.31.6.134 [ec2-user@ip-172-31-6-134 ~]$ sudo ip r del 169.254.169.254 [ec2-user@ip-172-31-6-134 ~]$ ip route show table all | grep -e 169 -e ec2 [ec2-user@ip-172-31-6-134 ~]$ wget http://169.254.169.254 --2022-01-25 23:39:54-- http://169.254.169.254/ Connecting to 169.254.169.254:80... connected. HTTP request sent, awaiting response... 200 OK

Even though the link local route was removed from the routing table it was still able to communicate using that address.

Ahh maybe I was misinterpreting the log I saw on an IPv4 only instance:
2022-01-25 22:34:46,195 - url_helper.py[WARNING]: Calling 'None' failed [0/120s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7facb9c98400>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]

Thanks for the context. "Network is unreachable" is often related to routing issues, I see what you were getting at. Regardless of what "Network is unreachable" is caused by, I think this suggestion is a good one. Specifically this would make connectivity possible in the case that there is a slow first address and an unreachable second address.

Interesting question. I don't think that "no route to host" should ever happen (because we're using link-local addresses[1]), but maybe I'm wrong.

Update: I was wrong.

My former argument here was invalid. The link local address used on ec2 instances is actually the destination address, not source address. Http(s) traffic leaves via the default route on a private address (RFC1918).

Remove default routes and this error could happen:

[ec2-user@ip-172-31-6-134 ~]$ wget http://[fd00:ec2::254]/latest/meta-data/ --2022-02-10 22:36:17-- http://[fd00:ec2::254]/latest/meta-data/ Connecting to [fd00:ec2::254]:80... failed: Network is unreachable.

I still think that the route I deleted in the example above (169.254.169.254 dev eth0) is unneeded, but for different reasons.

This comment doesn't imply a different course of action, but I wanted to set the record straight here.

blackboxsw

forgot to set request changes for moving the connect_synchronously up into _maybe_fetch_api_token.

blackboxsw · 2022-01-26T23:06:01Z

I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.

pycloudlib does have capacity to launch dual-stack ipv4/ipv6 vpc (though this may not yet against exclusive IPv6 IMDS) via something like:

vpc = ec2.get_or_create_vpc(name="cloud-nit-dualstack-vpc")
ec2.launch(image_id, vpc=vpc)

So I think we can subclass tests.integrationtests.cloud.Ec2Cloud._perform_launch to call the get_or_create_vpc first in cases we want dual-stack support for testing

github-actions · 2022-02-25T00:03:11Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

github-actions · 2022-03-22T00:03:17Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

TheRealFalcon · 2022-04-04T22:09:28Z

The failing test is an annoying bug that was fixed a few years ago:
getsentry/responses#212

Here's a workaround:

diff --git a/tests/unittests/sources/test_ec2.py b/tests/unittests/sources/test_ec2.py
index 3848924d..9e373632 100644
--- a/tests/unittests/sources/test_ec2.py
+++ b/tests/unittests/sources/test_ec2.py
@@ -530,6 +530,14 @@ class TestEc2(test_helpers.HttprettyTestCase):
             md={"md": old_metadata},
         )
         self.assertTrue(ds.get_data())
+
+        # Workaround https://github.com/getsentry/responses/issues/212
+        if hasattr(responses.mock, "_urls"):
+            for index, url in enumerate(responses.mock._urls):
+                if url["url"].startswith(
+                    "http://169.254.169.254/2009-04-04/meta-data/"
+                ):
+                    del responses.mock._urls[index]
         # Provide new revision of metadata that contains network data
         register_mock_metaserver(
             "http://169.254.169.254/2009-04-04/meta-data/", DEFAULT_METADATA

Since it's only one test, I'm fine just inserting it there. If this becomes a bigger issue we'd probably want to find a more elegant solution.

paride · 2022-04-05T13:34:32Z

@TheRealFalcon nice! Also matches my findings about responses 0.17.0 being the first version without this issue. Maybe worth adding a line to the comment explaining that the workaround can be removed once there will be no need for testing with requests < 0.17.0 (that is, post-Focal EOL)?

TheRealFalcon · 2022-04-05T13:35:55Z

Yeah, good idea, though I think the workaround is only necessary in Bionic.

holmanb · 2022-04-06T16:02:59Z

The failing test is an annoying bug that was fixed a few years ago: getsentry/responses#212

Thanks! This works locally so I have included it.

Also, I had to rebase on the tip of main since the latest commit introduced a merge conflict in one of the tests I changed.

Previously the 403 test verified that the following output was logged: "PUT /latest/api/token HTTP/1.1" 403 0 This log output is from urllib3.connectionpool, and is redundant with the other checked logs, one of which is only logged when code==403. We don't get this log currently when run under dual-stack, so drop it from the test.

This allows a ConnectionError exception to be thrown in one of the threads without canceling the other thread - the first successful call is returned, the rest are canceled. If all threads have exceptions, log them all and raise the last one (need choose one somehow, right? I chose the last; tell me if there is a reason we should raise the first in order of execution or first in alphabetical order or whatever) Tests to verify the above described behavior have been added.

…nfigured addresses)

- type hints - make sure response gets sent to optional time callback, because that's what it used to do - don't assign lamda functions to kwargs, who does that

holmanb · 2022-04-08T23:23:42Z

I believe this is ready for re-review.

TheRealFalcon · 2022-04-11T19:28:59Z

I made an integration test for this. If you use canonical/pycloudlib#188 and apply this patch, it should work

diff --git a/tests/integration_tests/clouds.py b/tests/integration_tests/clouds.py
index e5fe56d1..5ea41cbb 100644
--- a/tests/integration_tests/clouds.py
+++ b/tests/integration_tests/clouds.py
@@ -207,10 +207,14 @@ class Ec2Cloud(IntegrationCloud):
 
     def _perform_launch(self, launch_kwargs, **kwargs):
         """Use a dual-stack VPC for cloud-init integration testing."""
-        launch_kwargs["vpc"] = self.cloud_instance.get_or_create_vpc(
-            name="ec2-cloud-init-integration"
-        )
+        if "vpc" not in launch_kwargs:
+            launch_kwargs["vpc"] = self.cloud_instance.get_or_create_vpc(
+                name="ec2-cloud-init-integration"
+            )
+        if "Ipv6AddressCount" not in launch_kwargs:
+            launch_kwargs["Ipv6AddressCount"] = 1
         pycloudlib_instance = self.cloud_instance.launch(**launch_kwargs)
+        pycloudlib_instance.enable_ipv6_instance_metadata()
         return pycloudlib_instance
 
 
diff --git a/tests/integration_tests/datasources/test_ec2.py b/tests/integration_tests/datasources/test_ec2.py
new file mode 100644
index 00000000..8cde4dc9
--- /dev/null
+++ b/tests/integration_tests/datasources/test_ec2.py
@@ -0,0 +1,43 @@
+import re
+
+import pytest
+
+from tests.integration_tests.instances import IntegrationInstance
+
+
+def _test_crawl(client, ip):
+    assert client.execute("cloud-init clean --logs").ok
+    assert client.execute("cloud-init init --local").ok
+    log = client.read_from_file("/var/log/cloud-init.log")
+    assert f"Using metadata source: '{ip}'" in log
+    result = re.findall(
+        r"Crawl of metadata service took (\d+.\d+) seconds", log
+    )
+    if len(result) != 1:
+        pytest.fail(f"Expected 1 metadata crawl time, got {result}")
+    # 20 would still be a crazy long time for metadata service to crawl,
+    # but it's short enough to know we're not waiting for a response
+    assert float(result[0]) < 20
+
+
+@pytest.mark.ec2
+def test_dual_stack(client: IntegrationInstance):
+    # Drop IPv4 responses
+    assert client.execute("iptables -I INPUT -s 169.254.169.254 -j DROP").ok
+    _test_crawl(client, "http://[fd00:ec2::254]")
+
+    # Block IPv4 requests
+    assert client.execute("iptables -I OUTPUT -d 169.254.169.254 -j REJECT").ok
+    _test_crawl(client, "http://[fd00:ec2::254]")
+
+    # Re-enable IPv4
+    assert client.execute("iptables -D OUTPUT -d 169.254.169.254 -j REJECT").ok
+    assert client.execute("iptables -D INPUT -s 169.254.169.254 -j DROP").ok
+
+    # Drop IPv6 responses
+    assert client.execute("ip6tables -I INPUT -s fd00:ec2::254 -j DROP").ok
+    _test_crawl(client, "http://169.254.169.254")
+
+    # Block IPv6 requests
+    assert client.execute("ip6tables -I OUTPUT -d fd00:ec2::254 -j REJECT").ok
+    _test_crawl(client, "http://169.254.169.254")

It'll default all of our ec2 tests to being dual-stack with ipv6 metadata enabled, but I don't see any reason that would be a problem.

TheRealFalcon

A few minor comments inline, but otherwise looks good. Once we get the dual-stack timeout fixed and the integration test passes, we should be good to go.

Additionally, some tickets we should create for follow-up:

Add pytest-responses to Build-Depends of all the packaging branches. I also plan on adding pytest-mock, so we could do those at the same time.
Replace httpretty with responses everywhere. No need to have 2 libraries doing the same thing.
Ensure our EphemeralDHCPv4 code gets updated accordingly for an IPv6-only environment

cloudinit/url_helper.py

tests/unittests/sources/test_ec2.py

cloudinit/url_helper.py

TheRealFalcon

This looks good to me now. Thanks for all the work you put into this!

No longer needed

- Add openstack IPv6 metadata url fe80::a9fe:a9fe - Enable requesting multiple metadata sources in parallel This PR is very similar to #1160, reusing the provided `url_heper` logic. LP: #1906849

github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Dec 30, 2021

TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Dec 30, 2021

holmanb force-pushed the holmanb/ec2-dualstack branch 2 times, most recently from 25dc4e1 to ddad201 Compare January 12, 2022 14:23

holmanb changed the title ~~Add ec2 ipv6 imds support~~ Add ec2 IPv6 IMDS Support (SC-336) Jan 20, 2022

holmanb marked this pull request as ready for review January 20, 2022 04:45

TheRealFalcon requested changes Jan 21, 2022

View reviewed changes

cloudinit/url_helper.py Outdated Show resolved Hide resolved

cloudinit/url_helper.py Outdated Show resolved Hide resolved

blackboxsw reviewed Jan 25, 2022

View reviewed changes

cloudinit/url_helper.py Outdated Show resolved Hide resolved

cloudinit/url_helper.py Outdated Show resolved Hide resolved

cloudinit/sources/DataSourceEc2.py Outdated Show resolved Hide resolved

cloudinit/sources/DataSourceEc2.py Outdated Show resolved Hide resolved

blackboxsw reviewed Jan 25, 2022

View reviewed changes

blackboxsw previously requested changes Jan 26, 2022

View reviewed changes

holmanb force-pushed the holmanb/ec2-dualstack branch from 3679de8 to e70ad7e Compare January 26, 2022 20:43

holmanb force-pushed the holmanb/ec2-dualstack branch from 25d232e to 0d890d4 Compare January 31, 2022 21:36

github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Feb 25, 2022

github-actions bot closed this Mar 4, 2022

TheRealFalcon reopened this Mar 4, 2022

TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Mar 7, 2022

github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Mar 22, 2022

TheRealFalcon removed the stale-pr Pull request is stale; will be auto-closed soon label Mar 22, 2022

holmanb force-pushed the holmanb/ec2-dualstack branch 3 times, most recently from d3f2b45 to 4a99f79 Compare March 22, 2022 20:23

holmanb added 2 commits April 6, 2022 09:13

work around bug in responses: getsentry/responses#212

0e80fa8

use responses with new test (rebased)

d1f9222

holmanb force-pushed the holmanb/ec2-dualstack branch from cdafd46 to d1f9222 Compare April 6, 2022 15:49

fixup! use responses with new test (rebased)

40fd9dc

holmanb added 9 commits April 6, 2022 12:02

pass callback as argument

0579ef0

make logger callback be passed as an argument, comment functions

1042fbd

Force readurl to hit the retry loop (requires failures > number of co…

50cb1db

…nfigured addresses)

8bit diff

89638f4

Make the pythons happy:

e8c9f67

- type hints - make sure response gets sent to optional time callback, because that's what it used to do - don't assign lamda functions to kwargs, who does that

fmt

60e8d07

lazy logging per pylint

d99febf

holmanb requested review from blackboxsw and TheRealFalcon April 8, 2022 23:23

TheRealFalcon requested changes Apr 11, 2022

View reviewed changes

holmanb and others added 5 commits April 11, 2022 15:25

review fixes

ead44e7

integration test

79f260f

bump pycloudlib

8b03d30

do not wait for timeout

a5fa336

undo test change

58e8cf3

TheRealFalcon approved these changes Apr 14, 2022

View reviewed changes

TheRealFalcon merged commit 4cd1f9f into canonical:main Apr 14, 2022

This was referenced May 12, 2023

Top-level 'network' key results in error using v2 config #3815

Closed

Release 22.2 #3976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ec2 IPv6 IMDS Support (SC-336) #1160

Add ec2 IPv6 IMDS Support (SC-336) #1160

holmanb commented Dec 15, 2021 •

edited

Loading

github-actions bot commented Dec 30, 2021

holmanb commented Jan 20, 2022 •

edited

Loading

TheRealFalcon left a comment

holmanb commented Jan 21, 2022

blackboxsw left a comment

blackboxsw Jan 25, 2022

holmanb Jan 26, 2022

blackboxsw Jan 26, 2022

holmanb Jan 26, 2022

holmanb Feb 10, 2022

blackboxsw left a comment

blackboxsw commented Jan 26, 2022 •

edited

Loading

github-actions bot commented Feb 25, 2022

github-actions bot commented Mar 22, 2022

TheRealFalcon commented Apr 4, 2022 •

edited

Loading

paride commented Apr 5, 2022

TheRealFalcon commented Apr 5, 2022

holmanb commented Apr 6, 2022

holmanb commented Apr 8, 2022

TheRealFalcon commented Apr 11, 2022

TheRealFalcon left a comment •

edited

Loading

TheRealFalcon left a comment

Add ec2 IPv6 IMDS Support (SC-336) #1160

Add ec2 IPv6 IMDS Support (SC-336) #1160

Conversation

holmanb commented Dec 15, 2021 • edited Loading

Additional Context

Technical Description of Changes:

github-actions bot commented Dec 30, 2021

holmanb commented Jan 20, 2022 • edited Loading

TheRealFalcon left a comment

Choose a reason for hiding this comment

holmanb commented Jan 21, 2022

blackboxsw left a comment

Choose a reason for hiding this comment

blackboxsw Jan 25, 2022

Choose a reason for hiding this comment

holmanb Jan 26, 2022

Choose a reason for hiding this comment

blackboxsw Jan 26, 2022

Choose a reason for hiding this comment

holmanb Jan 26, 2022

Choose a reason for hiding this comment

holmanb Feb 10, 2022

Choose a reason for hiding this comment

blackboxsw left a comment

Choose a reason for hiding this comment

blackboxsw commented Jan 26, 2022 • edited Loading

github-actions bot commented Feb 25, 2022

github-actions bot commented Mar 22, 2022

TheRealFalcon commented Apr 4, 2022 • edited Loading

paride commented Apr 5, 2022

TheRealFalcon commented Apr 5, 2022

holmanb commented Apr 6, 2022

holmanb commented Apr 8, 2022

TheRealFalcon commented Apr 11, 2022

TheRealFalcon left a comment • edited Loading

Choose a reason for hiding this comment

TheRealFalcon left a comment

Choose a reason for hiding this comment

holmanb commented Dec 15, 2021 •

edited

Loading

holmanb commented Jan 20, 2022 •

edited

Loading

blackboxsw commented Jan 26, 2022 •

edited

Loading

TheRealFalcon commented Apr 4, 2022 •

edited

Loading

TheRealFalcon left a comment •

edited

Loading