-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: AWS getting started re-write #9095
Conversation
|
||
Change to your desired region and CIDR block and create a VPC: | ||
|
||
> Make sure your subnet does not overlap with `10.244.0.0/16` or `10.96.0.0/12` the [default pod and services subnets in Kubernetes]({{% ref "/v1.7/introduction/troubleshooting.md#conflict-on-kubernetes-and-host-subnets" %}}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use relref
, not ref
, and without the version, to keep docs same across versions
@@ -7,52 +7,102 @@ aliases: | |||
|
|||
## Creating a Cluster via the AWS CLI | |||
|
|||
In this guide we will create an HA Kubernetes cluster with 3 worker nodes. | |||
We assume an existing VPC, and some familiarity with AWS. | |||
In this guide we will create an HA Kubernetes cluster with 3 control plane nodes across 3 availability zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was always confused by spreading the etcd across AZes:
- increased latency
- a chance of network partitioning
As long as you don't have 3x workers spread across AZes same way, what kind of benefit does it give to the availability of the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AZ spread is recommended for all AWS HA architecture. Single AZs have problems more often than people realize. All AZs are supposed to be in "single digit" ms latency between locations (Azure and GCP don't guarantee that). I know this is a getting started guide, but we should still recommend some HA configurations for the components that matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is a bit different, imagine I have 3 CPs + 9 workers. CPs are spread across AZs.
Now AZ0 goes down, which means that two other CPs have quorum, that's great.
But what about my workers? Are they spread across AZs same way? If not, what's the point of having a controlplane when workers can't reach it? If they are spread, imagine 3 workers per AZ, then I'm down to 6 "working" workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's not that I'm trying to say that spreading CPs across AZs is a bad idea, but I want to make sure that users understand what gets actually protected and what does not by spreading CPs across AZs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version of the doc didn't have the autoscaling group. I just updated it with the correct version. Workers are created in an ASG and if an AZ goes away the ASG will re-balance nodes or create new nodes in AZs that are available.
--cidr 0.0.0.0/0 | ||
--group-id $SECURITY_GROUP_ID \ | ||
--ip-permissions \ | ||
IpProtocol=tcp,FromPort=50000,ToPort=50001,IpRanges="[{CidrIp=0.0.0.0/0}]" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused with this rule. We should expose 50000 on the controlplanes to the world, but we should never expose 50001 to the world.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized 6443 should only be exposed on the VPC too. I'll update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the existing guide exposes 6443, 50000, and 50001 to 0.0.0.0/0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50001 should be never exposed outside of the cluster.
50000 - we recommend only controlplanes
6443 - same only controlplanes (unless you use AWS LB), in that case you don't need to expose it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` | ||
|
||
### Create the Machine Configuration Files | ||
|
||
Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines. | ||
> Note that the `port` used here is the externally accessible port configured on the load balancer - 443 - not the internal port of 6443: | ||
We will create a [machine config patch]({{% ref "/v1.7/talos-guides/configuration/patching.md#rfc6902-json-patches" %}}) to use the AWS time servers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same about relref
6dcbf14
to
bf1a87e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linking me to these updated instructions @rothgar, they were awesome!
When walking through them on the AWS cloudshell I ran into a few issues, and I've commented on the PR to let you know what I ran into. Hope this helps 😁
(P.S. I only looked at the v1.7
docs)
One of the other things to consider is possible mentioning or integrating (I haven't tested it yet) is the AWS Cloud Controller Manager (CCM) that the talos blog re-posted here: https://www.siderolabs.com/blog/deploying-talos-on-aws-with-cdk/
Next create a subnet in each availability zones. | ||
|
||
```bash | ||
CIDR=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you did the AWS cli commands in a zsh shell, but since zsh starts it's indexing at 1 this causes a bash shell (i.e. default in AWS cloud shell) to skip the first IPV4_CIDRS array object (i.e. index 0 in bash).
So, I don't know how you'd like to handle that, but just wanted to give you a heads up about that nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling that out. I ran through an early version of the doc in bash and didn't test it with the final version. I'll figure something out.
talosctl --talosconfig talosconfig config endpoint <control plane 1 PUBLIC IP> | ||
talosctl --talosconfig talosconfig config node <control plane 1 PUBLIC IP> | ||
talosctl config endpoints $(aws ec2 describe-instances \ | ||
--instance-ids $(echo $CP_INSTANCES[@]) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main suggestion here is ensuring that you wrap your array accesses with {}
(it caused it to completely fail for me without that), but I wanted to also mention that you can use [*]
as well & that should output the array as a space delimited string.
--instance-ids $(echo $CP_INSTANCES[@]) \ | |
--instance-ids ${CP_INSTANCES[*]} \ |
--region $REGION \ | ||
--vpc-id $VPC \ | ||
--cidr-block ${CIDR_BLOCK} | ||
IPV4_CIDRS=( $(ipcalc -S 22 --no-decorate $IPV4_CIDR | head -n 3) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ipcalc
didn't work for me at all on linux (AWS cloudshell), but I just did a space separated list of IP addresses (i.e. '10.1.0.0/24' '10.1.1.0/24' '10.1.2.0/24'
). I didn't dig into it much though, so it might just require another package.
IDK what kind of CIDRs this generates, but you could offer something like this as a substitution:
IPV4_CIDRS=( $(printf '10.1.0.0/24\n10.1.1.0/24\n10.1.2.0/24\n') )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug with ipcalc packaged with brew. I'll change it to a manual step and make a note that people need to adjust it for their CIDRs
--output text) | ||
|
||
talosctl config nodes $(aws ec2 describe-instances \ | ||
--instance-ids $(echo $CP_INSTANCES[1]) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't actually execute anything past here, since I knew how to grab this content, but again with the {}
for your arrays:
--instance-ids $(echo $CP_INSTANCES[1]) \ | |
--instance-ids $(echo ${CP_INSTANCES[1]}) \ |
4587b4d
to
c05fad3
Compare
/m |
Updated with autoscaling group for workers, better copy/paste ability, and not using default VPC Signed-off-by: Justin Garrison <[email protected]>
/m |
Updated with mulit-AZ subnets for control plane group and better copy/paste ability