Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCP tftp install fails with 100 GbitE Mellenox adapter #226

Open
theresax opened this issue Aug 30, 2021 · 7 comments
Open

OCP tftp install fails with 100 GbitE Mellenox adapter #226

theresax opened this issue Aug 30, 2021 · 7 comments

Comments

@theresax
Copy link

theresax commented Aug 30, 2021

This is not a new issue for OCP 4.9 we have seen it since OC 4.6, and very likely not OCP install specific given that this happens when the first OCP bootstrap node is installed.

The OCP bare metal install with 100 GbitE Mellenox adapter is likely an usage scenario that expose the problem. We used 3 S922 systems, each with a 100 GbitE Melenox adapter and a 1 GbitE adapter. The OCP cluster's private network is defined on the SRIOV shared 100 GbitE network interface.

After defining DHCP, dnsmasq, httpd and firewall rules and haproxy, when the OCP node is activated using HMC's System Management Service shell, I can see that the node (bootstrap) can't reach the tftp servers. The install on that node ends with error "!BA017021". I have tried different layouts where the bootstrap node is on the same server or different servers as the bastion (the node for dhcp, dnsmasq, httpd and haproxy) - neither case works. When I switched from tftp boot to virtual media on a VIOS server (for the iso image), the install worked. It is able to pull other files from the httpd server from the private network (without any firewall, dns or httpd changes).

@manojnkumar
Copy link

manojnkumar commented Aug 30, 2021

@theresax : Is your bootstrap node in this state? Can you provide access details to the bastion?

@theresax
Copy link
Author

theresax commented Aug 30, 2021

@manojnkumar, My team has to move quickly to complete our perf plan. So the machine is no longer in this state. I have used virtual storage via VIOS server during node activation to do the install, and now the cluster is already setup. This proved the dhcp, dns, firewall and httpd + haproxy configurations we have are all good (not the cause of the problem).

@theresax
Copy link
Author

theresax commented Aug 30, 2021

I have provided a "test" LPAR under the same cluster and recreated the tftp problem:

HMC login:
ssh [email protected] pw: abc123
use the test lpar under cpzz1fsp server to repro. Activate using default profile through SMS. The expected MAC address and the private IP has already been setup under the dhcp server, and the client & server IPs are setup for the tftp boot under the "remote IPL" option. Please DO NOT change the state or config of other servers or LPARs under this HMC.

@theresax
Copy link
Author

I have opened a bugzilla defect per request from Brian King:
https://bugzilla.linux.ibm.com/show_bug.cgi?id=194377

@bpradipt
Copy link
Contributor

bpradipt commented Sep 2, 2021

I have provided a "test" LPAR under the same cluster and recreated the tftp problem:

HMC login:
ssh [email protected] pw: abc123
use the test lpar under cpzz1fsp server to repro. Activate using default profile through SMS. The expected MAC address and the private IP has already been setup under the dhcp server, and the client & server IPs are setup for the tftp boot under the "remote IPL" option. Please DO NOT change the state or config of other servers or LPARs under this HMC.

@theresax this is public github and it might not be appropriate to mention internal system details including passwords

@theresax
Copy link
Author

theresax commented Sep 2, 2021

@bpradipt , thanks you so much for pointing this out. Even though the issue is marked internal, I should not have included passwords. I have updated the ticket.

@theresax
Copy link
Author

theresax commented Sep 2, 2021

The defect 194377 has been rejected, and new defect 194410 (https://bugzilla.linux.ibm.com/show_bug.cgi?id=194410) has been opened to replace that with the same people currently assigned to the previous defect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants