Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grafana monitoring not working for static resources #35

Open
rvencu opened this issue Jul 25, 2022 · 2 comments
Open

grafana monitoring not working for static resources #35

rvencu opened this issue Jul 25, 2022 · 2 comments

Comments

@rvencu
Copy link
Contributor

rvencu commented Jul 25, 2022

when we use a non-zero minimum in cluster config for resources, they get alive at cluster launch. then this job-related check will never have a value of True:

if [[ $job_comment == *"Key=Monitoring,Value=ON"* ]]; then

because this must be run in the root context, the only chance to do it is in the prolog script to attach it to a job, so basically the plan would be to

  1. install the docker container anyway in post-install but do not start it
  2. use prolog and epilog to start and stop the container depending on user's choice to monitor or not

the problem is how to send a signal about the job to prolog and epilog since the custom user env variables are not sent, and the job comment is not sent. Because per slurm manuals, we should not perform scontrol from prolog; this will impair the scaling of the jobs similarly to the API calls (this is related to #34 )

Looking at the variables available at prolog/epilog time I only have 2 ideas so far:

  1. SLURM_PRIO_PROCESS Scheduling priority (nice value) at the time of submission. Available in SrunProlog, TaskProlog, SrunEpilog and TaskEpilog. We can #SBATCH --nice 0 or some sensible value to uniquely identify the intention then use the TaskProlog and TaskEpilog to start/stop the monitoring container
  2. use some crafted slurm job name like [GM] my job name then pick and interpret this from SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog. Meaning also the use of TaskProlog and TaskEpilog to start/stop the monitoring container
@rvencu
Copy link
Contributor Author

rvencu commented Jul 25, 2022

I propose this line

export monitoring_home="${SHARED_FS_DIR}/${monitoring_dir_name}/${head_node_hostname}"
changed to

export monitoring_home="${SHARED_FS_DIR}/${monitoring_dir_name}/${stack_name}"

the reason is the stack name is also unique even if it is recycled, and we can use the SLURM_CLUSTER_NAME Name of the cluster executing the job inside the prolog scripts, thus avoiding undesired calls to scontrol.

I will eventually submit PR, now at the stage of brainstorming to get the best idea to work

(hm, the current value of SLURM_CLUSTER_NAME is always parallelcluster, so we might need to modify that to the parallelcluster-stack_name or something. I see this is inside slurm.conf. Therefore we need to modify it during post-install at headnode and compute node in case of static nodes, I suppose)...

@rvencu
Copy link
Contributor Author

rvencu commented Jul 27, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant