ABP nvsmi sample data generation #1108

efajardo-nv · 2023-07-27T20:40:00Z

Description

Add script to ABP nvsmi example for generating sample data
Data generated using script does not contain all the columns used to train the current nvsmi model. Retrain the model using the 18 overlapping columns.
Update model, model config, training notebook/script, feature columns file
Update README with instructions on how to run script

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

efajardo-nv · 2023-07-28T17:05:49Z

The dataset generated using the script contains 18 of the 29 columns used to train model. Therefore, example inference pipeline will fail because of the missing columns. These are the 18 columns:

['nvidia_smi_log.gpu.fb_memory_usage.used',
 'nvidia_smi_log.gpu.fb_memory_usage.free',
 'nvidia_smi_log.gpu.utilization.gpu_util',
 'nvidia_smi_log.gpu.utilization.memory_util',
 'nvidia_smi_log.gpu.temperature.gpu_temp',
 'nvidia_smi_log.gpu.temperature.gpu_temp_max_threshold',
 'nvidia_smi_log.gpu.temperature.gpu_temp_slow_threshold',
 'nvidia_smi_log.gpu.power_readings.power_draw',
 'nvidia_smi_log.gpu.clocks.graphics_clock',
 'nvidia_smi_log.gpu.clocks.sm_clock',
 'nvidia_smi_log.gpu.clocks.mem_clock',
 'nvidia_smi_log.gpu.applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.max_clocks.graphics_clock',
 'nvidia_smi_log.gpu.max_clocks.sm_clock',
 'nvidia_smi_log.gpu.max_clocks.mem_clock']

Tried retraining model with just the 18 columns using:
https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/models/training-tuning-scripts/abp-models/abp-nvsmi-xgb-20210310.ipynb.

Accuracy was still 100%. Deployed new model to Triton. Ran the inference pipeline against the new model using the dataset generated from the script. Pipeline ran all the way through with no errors.

@gbatmaz would it be possible to use model trained on the 18 columns instead so that datasets from both NetQ and the script (pyNVML/nvidia-smi) can be used for the pipeline?

mdemoret-nv · 2023-07-28T21:03:24Z

@efajardo-nv I believe that this isnt specific to NetQ, but is more specific to the GPU you are using. Have you run the script on a different GPU (quadro vs geforce vs tesla). Some may not have the necessary means of monitoring

efajardo-nv · 2023-07-31T14:41:34Z

@efajardo-nv I believe that this isnt specific to NetQ, but is more specific to the GPU you are using. Have you run the script on a different GPU (quadro vs geforce vs tesla). Some may not have the necessary means of monitoring

@mdemoret-nv Yes, that's correct. The columns generated from the script are slightly different between Tesla V100 (126 columns) and Quadro RTX 8000 (124 columns) but both have same 18 columns that overlap with:
https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/models/data/columns_fil.txt

gbatmaz · 2023-07-31T15:32:47Z

The dataset generated using the script contains 18 of the 29 columns used to train model. Therefore, example inference pipeline will fail because of the missing columns. These are the 18 columns:
['nvidia_smi_log.gpu.fb_memory_usage.used',
 'nvidia_smi_log.gpu.fb_memory_usage.free',
 'nvidia_smi_log.gpu.utilization.gpu_util',
 'nvidia_smi_log.gpu.utilization.memory_util',
 'nvidia_smi_log.gpu.temperature.gpu_temp',
 'nvidia_smi_log.gpu.temperature.gpu_temp_max_threshold',
 'nvidia_smi_log.gpu.temperature.gpu_temp_slow_threshold',
 'nvidia_smi_log.gpu.power_readings.power_draw',
 'nvidia_smi_log.gpu.clocks.graphics_clock',
 'nvidia_smi_log.gpu.clocks.sm_clock',
 'nvidia_smi_log.gpu.clocks.mem_clock',
 'nvidia_smi_log.gpu.applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.max_clocks.graphics_clock',
 'nvidia_smi_log.gpu.max_clocks.sm_clock',
 'nvidia_smi_log.gpu.max_clocks.mem_clock']
Tried retraining model with just the 18 columns using: https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/models/training-tuning-scripts/abp-models/abp-nvsmi-xgb-20210310.ipynb.

Accuracy was still 100%. Deployed new model to Triton. Ran the inference pipeline against the new model using the dataset generated from the script. Pipeline ran all the way through with no errors.

@gbatmaz would it be possible to use model trained on the 18 columns instead so that datasets from both NetQ and the script (pyNVML/nvidia-smi) can be used for the pipeline?

Yes, it should be okay for them to modify and use whatever is available for the notebook since you've done the sanity check. For production, it's probably a better idea to retrain with their own data to make sure correct distinction can be done with their type of workloads.

mdemoret-nv · 2023-08-30T18:39:43Z

@efajardo-nv Is this still a draft?

efajardo-nv · 2023-08-30T18:48:50Z

@mdemoret-nv i need to retrain the model using the 18 columns in order for pipeline to work using inference data generated from the script. should be okay according to @gbatmaz. i can do that now unless you see any issues with that.

copy-pr-bot · 2023-08-30T19:00:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…into abp-nvsmi-data-gen

review-notebook-app · 2023-08-31T19:46:04Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

efajardo-nv · 2023-08-31T19:47:26Z

/ok to test

efajardo-nv · 2023-08-31T20:20:09Z

/ok to test

efajardo-nv · 2023-08-31T20:52:49Z

/ok to test

gbatmaz

LGTM

efajardo-nv · 2023-08-31T22:39:57Z

/merge

abp nvsmi data generation script

f25f908

efajardo-nv added non-breaking Non-breaking change improvement Improvement to existing functionality labels Jul 27, 2023

efajardo-nv requested a review from a team as a code owner July 27, 2023 20:40

add copyright header

95cf8aa

efajardo-nv marked this pull request as draft July 28, 2023 17:06

Merge branch 'branch-23.11' into abp-nvsmi-data-gen

2d4d124

jarmak-nv assigned efajardo-nv Aug 22, 2023

Merge branch 'branch-23.11' into abp-nvsmi-data-gen

179aa55

Merge branch 'branch-23.11' into abp-nvsmi-data-gen

daa6bb3

mdemoret-nv approved these changes Aug 30, 2023

View reviewed changes

efajardo-nv added 3 commits August 31, 2023 11:27

Merge branch 'branch-23.11' of https://github.com/nv-morpheus/Morpheus …

700ccc7

…into abp-nvsmi-data-gen

update abp nvsmi model and training script/nb

e790078

Merge branch 'branch-23.11' of https://github.com/nv-morpheus/Morpheus …

61309e1

…into abp-nvsmi-data-gen

efajardo-nv marked this pull request as ready for review August 31, 2023 19:59

efajardo-nv requested a review from a team as a code owner August 31, 2023 19:59

pylint fixes

416f4ed

rename training script for pylint

c1b885f

gbatmaz approved these changes Aug 31, 2023

View reviewed changes

update abp unit tests

4d7a7cd

rapids-bot bot merged commit e73b03a into nv-morpheus:branch-23.11 Aug 31, 2023
2 checks passed

efajardo-nv mentioned this pull request Sep 19, 2023

[BUG]: AssertionError: Number of features #1202

Closed

2 tasks

efajardo-nv deleted the abp-nvsmi-data-gen branch July 29, 2024 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ABP nvsmi sample data generation #1108

ABP nvsmi sample data generation #1108

efajardo-nv commented Jul 27, 2023 •

edited

Loading

efajardo-nv commented Jul 28, 2023 •

edited

Loading

mdemoret-nv commented Jul 28, 2023

efajardo-nv commented Jul 31, 2023

gbatmaz commented Jul 31, 2023

mdemoret-nv commented Aug 30, 2023

efajardo-nv commented Aug 30, 2023

copy-pr-bot bot commented Aug 30, 2023

review-notebook-app bot commented Aug 31, 2023

efajardo-nv commented Aug 31, 2023

efajardo-nv commented Aug 31, 2023

efajardo-nv commented Aug 31, 2023

gbatmaz left a comment

efajardo-nv commented Aug 31, 2023

ABP nvsmi sample data generation #1108

ABP nvsmi sample data generation #1108

Conversation

efajardo-nv commented Jul 27, 2023 • edited Loading

Description

Checklist

efajardo-nv commented Jul 28, 2023 • edited Loading

mdemoret-nv commented Jul 28, 2023

efajardo-nv commented Jul 31, 2023

gbatmaz commented Jul 31, 2023

mdemoret-nv commented Aug 30, 2023

efajardo-nv commented Aug 30, 2023

copy-pr-bot bot commented Aug 30, 2023

review-notebook-app bot commented Aug 31, 2023

efajardo-nv commented Aug 31, 2023

efajardo-nv commented Aug 31, 2023

efajardo-nv commented Aug 31, 2023

gbatmaz left a comment

Choose a reason for hiding this comment

efajardo-nv commented Aug 31, 2023

efajardo-nv commented Jul 27, 2023 •

edited

Loading

efajardo-nv commented Jul 28, 2023 •

edited

Loading