CINECA GUIDE

Getting started
Submit a job
Additional infos
Conda and git
Singularity
SLURM

Getting started

For registration and account association follow:

https://wiki.u-gov.it/confluence/display/SCAIUS/UG2.1+Getting+started#expand-3Connectingtothecluster

Update (08/09/2023): If you have already an account on CINECA, notice that it has recently change the authentication procedure for log-in into the cluster:

follow this guide for activating the 2FA (send an email to [email protected] to get the activation link)
follow this guide from point n.3, you will install smallstep for creating a new certificate valid for 12 hours on your pc

	eval $(ssh-agent) # activate the ssh-agent
	step ssh login '<user-email>' --provisioner cineca-hpc #  obtain the certificate

The user has now to put his/her cluster credentials (username and password) and push the button "Sign in". Then, keycloak will ask for the OTP code generated by the Authenticator
Once authenticated, you will see a Success message on your browser meaning that the certificate has been generated and it is available on your PC.

IMPORTANT: the temporary certificate is valid for 12 hours. If you reboot your PC the certificate is lost and you need to download a new one launching again the "step ssh login ..." command.

Command and scripts inside the cluster (CINECA) to submit a job

./train.sh <num_cpu> <max_walltime> (e.g. ./train.sh 12 24:00:00)

train.sh:

	#!/bin/bash
	# >>> Pulling repos
	...
	sbatch --job-name=job_example --cpus-per-task=${1} --time=${2} --output=./slurm_output/job_example.out --error=./slurm_output/job_example.out train.sbatch

train.sbatch:

	#!/bin/bash
	#SBATCH --partition=g100_usr_prod
	#SBATCH --mem=20000M
	#SBATCH --ntasks=1
	#SBATCH --mail-type=ALL
	#SBATCH --mail-user=<your email>

	# >>> IF YOU NEED TO USE A CONTAINER FOLLOW THE CODE BELOW (otherwise use your code here)<<<
	# Load the module
	module load singularity
	# Run the container
	singularity exec --hostname ${SLURM_SUBMIT_HOST}${SLURM_JOB_ID} ./container.sif bash ./container_train.sh

container_train.sh:

	#!/bin/bash
	# >>> Activate the conda enviroment
	...
	# >>> Execute code
	...

CINECA: Additional infos

https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.3%3A+GALILEO100+UserGuide

Cineca allows the usage of TMUX as terminal multiplexer: https://tmuxcheatsheet.com/

Cineca works only offline inside the running node. Therefore:

pull the repos before submitting the job (e.g. in train.sh)

to use a logger (e.g. wandb):

use wandb_mode as offline
to sync with the server, inside the wandb folder: wandb sync --include-offline ./offline-*

Script for syncronize wandb offline runs (supponing to have a group directory containing more than one run)

      #!/bin/bash

      # argument 1: group directory
      conda activate <env_name>

      RAND_ID=$(python3 -c "import wandb; print(wandb.util.generate_id());")

      echo "Syncing runs $1 to run new id $RAND_ID"

      # first, sync last series of logs to new id
      first_dir=$(ls -t $1| head -1)
      wandb sync $1/$first_dir/wandb/$(ls -t $1/$first_dir/wandb/ | grep offline | head -1) --id $RAND_ID

      # then all the others + last again to sync hyper-params.
      for dir in $(ls $1)
      do
  		run=$(ls $1/$dir/wandb/ | grep offline)
 			echo $run
 			wandb sync $1/$dir/wandb/$run --id $RAND_ID;
      done

New setup for tracking live of experiments with wandb

login node has to create a reverse proxy towards to the compute node, while the running job has to wait this proxy is up before using wandb

add this at the begin of your script (use any port you prefer):

 echo Waiting the reverse proxy...
 while ! netstat -an | grep 34567 &> /dev/null; do sleep 1; done
 export HTTP_PROXY=socks5://127.0.0.1:34567
 export HTTPS_PROXY=socks5://127.0.0.1:34567
 export SOCK_PROXY=socks5://127.0.0.1:34567
 echo Reverse proxy is up and running!

this other script must be in execution for all the duration of the job, controlling periodically which job are in run and opening a new proxy for each of them

 #!/bin/bash
 
 INTERVAL=10

 while true; do
 	# Get the list of running jobs for the user
 	nodes=$(squeue -u $USER -h -t R -o "%N" | uniq)

 	for node in $nodes; do

         # Check if a reverse proxy is already set up for this job
 	    n=$(ps -f -u $USER | grep -e "ssh.*$node" | wc -l)
         if [ $n -eq 1 ]; then
 	echo Creating proxy for $node...
         	ssh -oStrictHostKeyChecking=no -N -R 34567 -f $node
     	fi
     done
 
 	sleep $INTERVAL
 done

ssh connection has kept in background and killled by cineca when the job ended
this solution works for wandb, huggingfacehub, and any library/application which use requests - NOT for dataset download by torchvision

Installations of conda and git on the cluster

Install conda:

    mkdir -p ~/miniconda3
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
    bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
    rm -rf ~/miniconda3/miniconda.sh
    ~/miniconda3/bin/conda init bash
    ~/miniconda3/bin/conda init zsh

Create a conda environment:

    conda create -n <env_name> python=3.8
    conda activate <env_name>

If singularity is not installed:

    conda install -c conda-forge singularity

Clone git repositories:

    conda install gh --channel conda-forge
    gh auth login
    <gh token>
    git clone <repo>

Singularity: Additional infos

Usually a cluster (e.g. CINECA, HPC) do not allow the use of Docker for security reasons, however it is possible to use Singularity as alternative. Singularity, differently from Docker, creates a container as a directory inside the original host filesystem. Therefore, if you have originally created the Docker container in the path /home/a/b/c, Singularity would virtually create a path /home/a/b/c inside the actual host filesystem. When you use Singularity for the first time you should be take note of these steps:

add in ~/.bashrc file:

	  export SINGULARITY_CACHEDIR=/scratch/gpfs/$USER/SINGULARITY_CACHE
	  export SINGULARITY_TMPDIR=/tmp

pull docker image <docker_path> and convert into a singularity image <container>.sif (in your login node)

     module load singularity
     singularity pull <new_sing_img>.sif docker://<docker_path>

NOTE: if you are not able to pull it from the cluster, you can copy a pre-existent .sif into the cluster
test singularity container using
```
   singularity shell <container>.sif
```

Useful links:

SLURM cheatsheet

submit a job sbatch <file_name>.sbatch
show all jobs squeue
show your jobs squeue -u <username>
show job infos scontrol show job <job_id>
partions status sinfo
delete a job scancel <job_id>
running an interactive session on a node srun --nodes=1 --tasks-per-node=1 --pty /bin/bash

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
arduino.md		arduino.md
wsl.md		wsl.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CINECA GUIDE

Getting started

Command and scripts inside the cluster (CINECA) to submit a job

CINECA: Additional infos

New setup for tracking live of experiments with wandb

Installations of conda and git on the cluster

Singularity: Additional infos

SLURM cheatsheet

About

Releases

Packages

andreaprotopapa/cheatsheets

Folders and files

Latest commit

History

Repository files navigation

CINECA GUIDE

Getting started

Command and scripts inside the cluster (CINECA) to submit a job

CINECA: Additional infos

New setup for tracking live of experiments with wandb

Installations of conda and git on the cluster

Singularity: Additional infos

SLURM cheatsheet

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages