Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCV authentication issue with AD users at the creation of Linux DCV sessions in a 1Click-HPC cluster #17

Open
vbosquier opened this issue Jun 9, 2022 · 0 comments

Comments

@vbosquier
Copy link

Ciao Nicola!

As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment.
I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.

I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...

The symptom is the following;

  • launching a DCV Session as system user "centos" using a standard Linux Desktop Service in EF works fine
  • launching a DCV session as a user created in the AD using the exact same standard Linux Desktop Service in EF fails because of an autentication issue.

The error message got in slurm-$JobID.out is the following:

[2022/06/09 14:40:15]  INFO  Starting DCV session...
[2022/06/09 14:40:15]  INFO  DCV version supports --gl-display parameter
[2022/06/09 14:40:15]  INFO  Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR  Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL  Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL  Exiting with code 1

After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:

id "${USER}"

Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...

With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:

srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session

And then, we have tried all the following options:

  • restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working

  • restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working

  • changing /etc/pam.d/dcv with the following contents

#%PAM-1.0
# Default NICE DCV PAM configuration.
# This file is auto-generated, user changes will be destroyed at
# installation/update time.
# To make changes, create a file named dcv.custom in this
# directory and set the 'pam-service-name' parameter in the
# [security] section of dcv.conf to 'dcv.custom'
#auth    include password-auth
#account include password-auth
auth    include password-auth
account     required                                     pam_access.so
account     required                                     pam_unix.so
account     sufficient                                   pam_localuser.so
account     sufficient                                   pam_usertype.so issystem
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required                                     pam_permit.so

=> NOT working

  • running on the remote system the commands:
    $> getent passwd | grep username
    or
    $> getent passwd -s sss | grep username
    or
    $> sssctl cache-upgrade
    => NOT working

  • adding the following command at the very beginning of Slurm's prolog.sh script:
    $> id "${SLURM_JOB_USER}"
    -> NOT working

  • running the following command on the DCV node before the session was created:
    $> id username
    or
    $> sssctl user-checks username
    => SUCCESSFUL

  • connecting on the DCV node with SSH as the user username (or as any other user and the switching with the command: $> su - username) before the session was created

=> SUCCESSFUL

Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.

Our questions are:

  • is it a know issue?
  • can you explain further how the internal authentication methods of DCV work and why in our case DCV has denied the authorization for the user in AD to crete a session?
  • is there a "better" way to solve it than to hack EF code the way we did to allow any user in AD to launch a DCV session?

Please don't hesitate to ask for any complementary information and to let us know what you think.

Best regards,
Vincent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant