Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: AWS getting started re-write #9095

Merged
merged 1 commit into from
Aug 9, 2024
Merged

Conversation

rothgar
Copy link
Member

@rothgar rothgar commented Aug 1, 2024

Updated with mulit-AZ subnets for control plane group and better copy/paste ability


Change to your desired region and CIDR block and create a VPC:

> Make sure your subnet does not overlap with `10.244.0.0/16` or `10.96.0.0/12` the [default pod and services subnets in Kubernetes]({{% ref "/v1.7/introduction/troubleshooting.md#conflict-on-kubernetes-and-host-subnets" %}}).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use relref, not ref, and without the version, to keep docs same across versions

@@ -7,52 +7,102 @@ aliases:

## Creating a Cluster via the AWS CLI

In this guide we will create an HA Kubernetes cluster with 3 worker nodes.
We assume an existing VPC, and some familiarity with AWS.
In this guide we will create an HA Kubernetes cluster with 3 control plane nodes across 3 availability zones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was always confused by spreading the etcd across AZes:

  • increased latency
  • a chance of network partitioning

As long as you don't have 3x workers spread across AZes same way, what kind of benefit does it give to the availability of the cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AZ spread is recommended for all AWS HA architecture. Single AZs have problems more often than people realize. All AZs are supposed to be in "single digit" ms latency between locations (Azure and GCP don't guarantee that). I know this is a getting started guide, but we should still recommend some HA configurations for the components that matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is a bit different, imagine I have 3 CPs + 9 workers. CPs are spread across AZs.

Now AZ0 goes down, which means that two other CPs have quorum, that's great.

But what about my workers? Are they spread across AZs same way? If not, what's the point of having a controlplane when workers can't reach it? If they are spread, imagine 3 workers per AZ, then I'm down to 6 "working" workers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's not that I'm trying to say that spreading CPs across AZs is a bad idea, but I want to make sure that users understand what gets actually protected and what does not by spreading CPs across AZs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of the doc didn't have the autoscaling group. I just updated it with the correct version. Workers are created in an ASG and if an AZ goes away the ASG will re-balance nodes or create new nodes in AZs that are available.

--cidr 0.0.0.0/0
--group-id $SECURITY_GROUP_ID \
--ip-permissions \
IpProtocol=tcp,FromPort=50000,ToPort=50001,IpRanges="[{CidrIp=0.0.0.0/0}]" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused with this rule. We should expose 50000 on the controlplanes to the world, but we should never expose 50001 to the world.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized 6443 should only be exposed on the VPC too. I'll update it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the existing guide exposes 6443, 50000, and 50001 to 0.0.0.0/0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50001 should be never exposed outside of the cluster.

50000 - we recommend only controlplanes
6443 - same only controlplanes (unless you use AWS LB), in that case you don't need to expose it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```

### Create the Machine Configuration Files

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines.
> Note that the `port` used here is the externally accessible port configured on the load balancer - 443 - not the internal port of 6443:
We will create a [machine config patch]({{% ref "/v1.7/talos-guides/configuration/patching.md#rfc6902-json-patches" %}}) to use the AWS time servers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same about relref

@rothgar rothgar force-pushed the aws-docs branch 4 times, most recently from 6dcbf14 to bf1a87e Compare August 7, 2024 20:20
Copy link

@elreydetoda elreydetoda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking me to these updated instructions @rothgar, they were awesome!
When walking through them on the AWS cloudshell I ran into a few issues, and I've commented on the PR to let you know what I ran into. Hope this helps 😁
(P.S. I only looked at the v1.7 docs)

One of the other things to consider is possible mentioning or integrating (I haven't tested it yet) is the AWS Cloud Controller Manager (CCM) that the talos blog re-posted here: https://www.siderolabs.com/blog/deploying-talos-on-aws-with-cdk/

Next create a subnet in each availability zones.

```bash
CIDR=1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you did the AWS cli commands in a zsh shell, but since zsh starts it's indexing at 1 this causes a bash shell (i.e. default in AWS cloud shell) to skip the first IPV4_CIDRS array object (i.e. index 0 in bash).

So, I don't know how you'd like to handle that, but just wanted to give you a heads up about that nuance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling that out. I ran through an early version of the doc in bash and didn't test it with the final version. I'll figure something out.

talosctl --talosconfig talosconfig config endpoint <control plane 1 PUBLIC IP>
talosctl --talosconfig talosconfig config node <control plane 1 PUBLIC IP>
talosctl config endpoints $(aws ec2 describe-instances \
--instance-ids $(echo $CP_INSTANCES[@]) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main suggestion here is ensuring that you wrap your array accesses with {} (it caused it to completely fail for me without that), but I wanted to also mention that you can use [*] as well & that should output the array as a space delimited string.

Suggested change
--instance-ids $(echo $CP_INSTANCES[@]) \
--instance-ids ${CP_INSTANCES[*]} \

--region $REGION \
--vpc-id $VPC \
--cidr-block ${CIDR_BLOCK}
IPV4_CIDRS=( $(ipcalc -S 22 --no-decorate $IPV4_CIDR | head -n 3) )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ipcalc didn't work for me at all on linux (AWS cloudshell), but I just did a space separated list of IP addresses (i.e. '10.1.0.0/24' '10.1.1.0/24' '10.1.2.0/24' ). I didn't dig into it much though, so it might just require another package.

image

IDK what kind of CIDRs this generates, but you could offer something like this as a substitution:

IPV4_CIDRS=( $(printf '10.1.0.0/24\n10.1.1.0/24\n10.1.2.0/24\n') )

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug with ipcalc packaged with brew. I'll change it to a manual step and make a note that people need to adjust it for their CIDRs

--output text)

talosctl config nodes $(aws ec2 describe-instances \
--instance-ids $(echo $CP_INSTANCES[1]) \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually execute anything past here, since I knew how to grab this content, but again with the {} for your arrays:

Suggested change
--instance-ids $(echo $CP_INSTANCES[1]) \
--instance-ids $(echo ${CP_INSTANCES[1]}) \

@rothgar rothgar force-pushed the aws-docs branch 2 times, most recently from 4587b4d to c05fad3 Compare August 9, 2024 00:31
@rothgar
Copy link
Member Author

rothgar commented Aug 9, 2024

/m

Updated with autoscaling group for workers, better copy/paste ability, and not using default VPC

Signed-off-by: Justin Garrison <[email protected]>
@rothgar
Copy link
Member Author

rothgar commented Aug 9, 2024

/m

@talos-bot talos-bot merged commit 0698a49 into siderolabs:main Aug 9, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants