DCV Jobs failing #2

sean-smith · 2021-04-02T19:15:15Z

Not sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get:

sbatch failed (parameters: -J Linux_Desktop -D /fsx/nice/enginframe/sessions/ec2-user/tmp4716553958834750820.session.ef -C dcv2, exit value: 1)

The text was updated successfully, but these errors were encountered:

nicolaven · 2021-04-12T08:01:31Z

you need to use the proper serivce to launch the desktop.
The one build-in is not appropriate. We are working on building a repository for EF services.

sean-smith · 2021-04-12T17:29:17Z

Is there any documentation on what the correct service is?

mirneshalilovic · 2022-11-09T22:57:26Z

@sean-smith @nicolaven
I have the same problem. Could you please share one example here how to start interactive service?

Thank in advance.

nicolaven · 2022-11-10T12:46:44Z

hi @mirneshalilovic
Thanks for your request. Can you try deploy a new cluster using the latest version of 1Click-HPC (a few updates have been released just recently). Then log into EF and import the following test service: https://github.com/aws-samples/1click-hpc/blob/main/enginframe/ef-services.Linux%20Desktop.2022-11-10T12-18-39.zip

Thanks

mirneshalilovic · 2022-11-10T14:40:06Z

hi @nicolaven
Thanks for information and for updated service. It's working but only for dcv gue. I can't run interactive with gpu enabled.
I did the last deployment last night.

With this command vdi.launch.session --queue dcv sessions can be launched and I can enter.

When I specified vdi.launch.session --queue dcv-gpu --submitopts "-C g4dn.2xlarge" sessions is in pending state and no machine in background.

On slurmctld logs I can see this:


`2022-11-10T15:08:44.974+01:00	[2022-11-10T14:08:44.974] sched: Allocate JobId=58 NodeList=dcv-gpu-dy-g4dn-2xlarge-1 #CPUs=1 Partition=dcv-gpu

2022-11-10T15:09:03.687+01:00	[2022-11-10T14:09:03.687] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 reason set to: (Code:VcpuLimitExceeded)Failure when resuming nodes

2022-11-10T15:09:03.687+01:00	[2022-11-10T14:09:03.687] requeue job JobId=58 due to failure of node dcv-gpu-dy-g4dn-2xlarge-1

2022-11-10T15:09:03.688+01:00	[2022-11-10T14:09:03.688] Requeuing JobId=58

2022-11-10T15:09:03.688+01:00	[2022-11-10T14:09:03.688] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 state set to DOWN

2022-11-10T15:09:03.706+01:00	[2022-11-10T14:09:03.706] error: get_addr_info: getaddrinfo() failed: Name or service not known

2022-11-10T15:09:03.706+01:00

Copy
[2022-11-10T14:09:03.706] error: slurm_set_addr: Unable to resolve "dcv-gpu-dy-g4dn-2xlarge-1"
[2022-11-10T14:09:03.706] error: slurm_set_addr: Unable to resolve "dcv-gpu-dy-g4dn-2xlarge-1"

2022-11-10T15:09:03.706+01:00	[2022-11-10T14:09:03.706] error: fwd_tree_thread: can't find address for host dcv-gpu-dy-g4dn-2xlarge-1, check slurm.conf

2022-11-10T15:10:01.590+01:00	[2022-11-10T14:10:01.590] update_node: node dcv-gpu-dy-g4dn-2xlarge-1 reason set to: Scheduler health check failed

2022-11-10T15:10:01.590+01:00	[2022-11-10T14:10:01.590] powering down node dcv-gpu-dy-g4dn-2xlarge-1`

nicolaven · 2022-11-10T14:50:57Z

you need to request a limit increase for g* instances.
Then you should be fine.

mirneshalilovic · 2022-11-22T09:41:48Z

@nicolaven
Thank you very much for your quick reply and help.
I managed to solve this.

fix

nicolaven pushed a commit that referenced this issue Jul 4, 2023

Merge pull request #2 from cmbrehm/fix/pc3.6

92d0de7

fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCV Jobs failing #2

DCV Jobs failing #2

sean-smith commented Apr 2, 2021

nicolaven commented Apr 12, 2021

sean-smith commented Apr 12, 2021

mirneshalilovic commented Nov 9, 2022

nicolaven commented Nov 10, 2022

mirneshalilovic commented Nov 10, 2022 •

edited

Loading

nicolaven commented Nov 10, 2022

mirneshalilovic commented Nov 22, 2022

DCV Jobs failing #2

DCV Jobs failing #2

Comments

sean-smith commented Apr 2, 2021

nicolaven commented Apr 12, 2021

sean-smith commented Apr 12, 2021

mirneshalilovic commented Nov 9, 2022

nicolaven commented Nov 10, 2022

mirneshalilovic commented Nov 10, 2022 • edited Loading

nicolaven commented Nov 10, 2022

mirneshalilovic commented Nov 22, 2022

mirneshalilovic commented Nov 10, 2022 •

edited

Loading