Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent timeouts when creating new deployments #70

Open
elmiko opened this issue Feb 2, 2018 · 2 comments
Open

intermittent timeouts when creating new deployments #70

elmiko opened this issue Feb 2, 2018 · 2 comments

Comments

@elmiko
Copy link

elmiko commented Feb 2, 2018

occasionally i will see a timeout when openshifter attempts to contact an instance through ssh during the creation process, this will cause the process to fail. if i restart the create command then it usually proceeds as normal. i think this may have to do with vm instances taking longer than expected to become live. here is the output log i see occasionally:

INFO:Provisioner(gce):Validating master existence
INFO:Provisioner(gce):Getting node
INFO:Provisioner(gce):Master exists (35.190.201.132)
INFO:Provisioner(gce):Validating node node-0 existence
INFO:Provisioner(gce):Getting node
INFO:Provisioner(gce):Node node-0 exists (35.189.199.255)
INFO:Provisioner(gce):Validating node node-1 existence
INFO:Provisioner(gce):Getting node
INFO:Provisioner(gce):Node node-1 exists (35.195.218.93)
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_7.4)
INFO:paramiko.transport:Authentication (publickey) successful!
INFO:paramiko.transport:Connected (version 2.0, client OpenSSH_7.4)
INFO:paramiko.transport:Authentication (publickey) successful!
Traceback (most recent call last):
  File "../main.py", line 14, in <module>
    openshifter.cli.cli()
  File "/usr/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/root/openshifter/cli.py", line 47, in create
    openshifter.create()
  File "/root/openshifter/__init__.py", line 50, in create
    self.install()
  File "/root/openshifter/__init__.py", line 34, in install
    features.execute("pre_install", self.deployment, self.cluster)
  File "/root/features/__init__.py", line 39, in execute
    ssh_client = Ssh(deployment, cluster)
  File "/root/openshifter/ssh.py", line 21, in __init__
    self.connect("node", node.public_address)
  File "/root/openshifter/ssh.py", line 26, in connect
    self.clients[address].connect()
  File "/root/openshifter/ssh.py", line 83, in connect
    allow_agent=False, look_for_keys=False)
  File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 357, in connect
    raise NoValidConnectionsError(errors)
paramiko.ssh_exception.NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 35.195.218.93

i am not able to consistently reproduce this, but i feel adding some sort of delay or retry for these commands might help.

for reference i am running this image:

docker.io/osevg/openshifter        latest              eaffb778a868        2 weeks ago         846.2 MB
@marekjelen
Copy link
Contributor

yeah, I know about that one ... it's similar to the one when deleting cluster and network deletion fails.

Google used to be deterministic in a sense that timeouts were not happening ... recently way more things take way more time and things time out ... need to put correct check into the right places.

@elmiko
Copy link
Author

elmiko commented Feb 6, 2018

i didn't look at the code, but i figured it was something that would just require a longer wait period. thanks for the update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants