Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCV Jobs failing #2

Open
sean-smith opened this issue Apr 2, 2021 · 7 comments
Open

DCV Jobs failing #2

sean-smith opened this issue Apr 2, 2021 · 7 comments

Comments

@sean-smith
Copy link
Contributor

Not sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get:

sbatch failed (parameters: -J Linux_Desktop -D /fsx/nice/enginframe/sessions/ec2-user/tmp4716553958834750820.session.ef -C dcv2, exit value: 1)
@nicolaven
Copy link
Contributor

you need to use the proper serivce to launch the desktop.
The one build-in is not appropriate. We are working on building a repository for EF services.

@sean-smith
Copy link
Contributor Author

Is there any documentation on what the correct service is?

@mirneshalilovic
Copy link

@sean-smith @nicolaven
I have the same problem. Could you please share one example here how to start interactive service?

Thank in advance.

@nicolaven
Copy link
Contributor

hi @mirneshalilovic
Thanks for your request. Can you try deploy a new cluster using the latest version of 1Click-HPC (a few updates have been released just recently). Then log into EF and import the following test service: https://github.com/aws-samples/1click-hpc/blob/main/enginframe/ef-services.Linux%20Desktop.2022-11-10T12-18-39.zip

Thanks

@mirneshalilovic
Copy link

mirneshalilovic commented Nov 10, 2022

hi @nicolaven
Thanks for information and for updated service. It's working but only for dcv gue. I can't run interactive with gpu enabled.
I did the last deployment last night.

With this command vdi.launch.session --queue dcv sessions can be launched and I can enter.

When I specified vdi.launch.session --queue dcv-gpu --submitopts "-C g4dn.2xlarge" sessions is in pending state and no machine in background.

On slurmctld logs I can see this:


`2022-11-10T15:08:44.974+01:00	[2022-11-10T14:08:44.974] sched: Allocate JobId=58 NodeList=dcv-gpu-dy-g4dn-2xlarge-1 #CPUs=1 Partition=dcv-gpu

2022-11-10T15:09:03.687+01:00	[2022-11-10T14:09:03.687] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 reason set to: (Code:VcpuLimitExceeded)Failure when resuming nodes

2022-11-10T15:09:03.687+01:00	[2022-11-10T14:09:03.687] requeue job JobId=58 due to failure of node dcv-gpu-dy-g4dn-2xlarge-1

2022-11-10T15:09:03.688+01:00	[2022-11-10T14:09:03.688] Requeuing JobId=58

2022-11-10T15:09:03.688+01:00	[2022-11-10T14:09:03.688] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 state set to DOWN

2022-11-10T15:09:03.706+01:00	[2022-11-10T14:09:03.706] error: get_addr_info: getaddrinfo() failed: Name or service not known

2022-11-10T15:09:03.706+01:00

Copy
[2022-11-10T14:09:03.706] error: slurm_set_addr: Unable to resolve "dcv-gpu-dy-g4dn-2xlarge-1"
[2022-11-10T14:09:03.706] error: slurm_set_addr: Unable to resolve "dcv-gpu-dy-g4dn-2xlarge-1"

2022-11-10T15:09:03.706+01:00	[2022-11-10T14:09:03.706] error: fwd_tree_thread: can't find address for host dcv-gpu-dy-g4dn-2xlarge-1, check slurm.conf

2022-11-10T15:10:01.590+01:00	[2022-11-10T14:10:01.590] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 reason set to: Scheduler health check failed

2022-11-10T15:10:01.590+01:00	[2022-11-10T14:10:01.590] powering down node dcv-gpu-dy-g4dn-2xlarge-1`

@nicolaven
Copy link
Contributor

you need to request a limit increase for g* instances.
Then you should be fine.

@mirneshalilovic
Copy link

@nicolaven
Thank you very much for your quick reply and help.
I managed to solve this.

nicolaven pushed a commit that referenced this issue Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants