Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABP nvsmi sample data generation #1108

Merged
merged 11 commits into from
Aug 31, 2023

Conversation

efajardo-nv
Copy link
Contributor

@efajardo-nv efajardo-nv commented Jul 27, 2023

Description

  • Add script to ABP nvsmi example for generating sample data
  • Data generated using script does not contain all the columns used to train the current nvsmi model. Retrain the model using the 18 overlapping columns.
  • Update model, model config, training notebook/script, feature columns file
  • Update README with instructions on how to run script

Closes #1097

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

@efajardo-nv efajardo-nv added non-breaking Non-breaking change improvement Improvement to existing functionality labels Jul 27, 2023
@efajardo-nv efajardo-nv requested a review from a team as a code owner July 27, 2023 20:40
@efajardo-nv
Copy link
Contributor Author

efajardo-nv commented Jul 28, 2023

The dataset generated using the script contains 18 of the 29 columns used to train model. Therefore, example inference pipeline will fail because of the missing columns. These are the 18 columns:

['nvidia_smi_log.gpu.fb_memory_usage.used',
 'nvidia_smi_log.gpu.fb_memory_usage.free',
 'nvidia_smi_log.gpu.utilization.gpu_util',
 'nvidia_smi_log.gpu.utilization.memory_util',
 'nvidia_smi_log.gpu.temperature.gpu_temp',
 'nvidia_smi_log.gpu.temperature.gpu_temp_max_threshold',
 'nvidia_smi_log.gpu.temperature.gpu_temp_slow_threshold',
 'nvidia_smi_log.gpu.power_readings.power_draw',
 'nvidia_smi_log.gpu.clocks.graphics_clock',
 'nvidia_smi_log.gpu.clocks.sm_clock',
 'nvidia_smi_log.gpu.clocks.mem_clock',
 'nvidia_smi_log.gpu.applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.max_clocks.graphics_clock',
 'nvidia_smi_log.gpu.max_clocks.sm_clock',
 'nvidia_smi_log.gpu.max_clocks.mem_clock']

Tried retraining model with just the 18 columns using:
https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/models/training-tuning-scripts/abp-models/abp-nvsmi-xgb-20210310.ipynb.

Accuracy was still 100%. Deployed new model to Triton. Ran the inference pipeline against the new model using the dataset generated from the script. Pipeline ran all the way through with no errors.

@gbatmaz would it be possible to use model trained on the 18 columns instead so that datasets from both NetQ and the script (pyNVML/nvidia-smi) can be used for the pipeline?

@efajardo-nv efajardo-nv marked this pull request as draft July 28, 2023 17:06
@mdemoret-nv
Copy link
Contributor

@efajardo-nv I believe that this isnt specific to NetQ, but is more specific to the GPU you are using. Have you run the script on a different GPU (quadro vs geforce vs tesla). Some may not have the necessary means of monitoring

@efajardo-nv
Copy link
Contributor Author

@efajardo-nv I believe that this isnt specific to NetQ, but is more specific to the GPU you are using. Have you run the script on a different GPU (quadro vs geforce vs tesla). Some may not have the necessary means of monitoring

@mdemoret-nv Yes, that's correct. The columns generated from the script are slightly different between Tesla V100 (126 columns) and Quadro RTX 8000 (124 columns) but both have same 18 columns that overlap with:
https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/models/data/columns_fil.txt

@gbatmaz
Copy link
Contributor

gbatmaz commented Jul 31, 2023

The dataset generated using the script contains 18 of the 29 columns used to train model. Therefore, example inference pipeline will fail because of the missing columns. These are the 18 columns:

['nvidia_smi_log.gpu.fb_memory_usage.used',
 'nvidia_smi_log.gpu.fb_memory_usage.free',
 'nvidia_smi_log.gpu.utilization.gpu_util',
 'nvidia_smi_log.gpu.utilization.memory_util',
 'nvidia_smi_log.gpu.temperature.gpu_temp',
 'nvidia_smi_log.gpu.temperature.gpu_temp_max_threshold',
 'nvidia_smi_log.gpu.temperature.gpu_temp_slow_threshold',
 'nvidia_smi_log.gpu.power_readings.power_draw',
 'nvidia_smi_log.gpu.clocks.graphics_clock',
 'nvidia_smi_log.gpu.clocks.sm_clock',
 'nvidia_smi_log.gpu.clocks.mem_clock',
 'nvidia_smi_log.gpu.applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.graphics_clock',
 'nvidia_smi_log.gpu.default_applications_clocks.mem_clock',
 'nvidia_smi_log.gpu.max_clocks.graphics_clock',
 'nvidia_smi_log.gpu.max_clocks.sm_clock',
 'nvidia_smi_log.gpu.max_clocks.mem_clock']

Tried retraining model with just the 18 columns using: https://github.com/nv-morpheus/Morpheus/blob/branch-23.11/models/training-tuning-scripts/abp-models/abp-nvsmi-xgb-20210310.ipynb.

Accuracy was still 100%. Deployed new model to Triton. Ran the inference pipeline against the new model using the dataset generated from the script. Pipeline ran all the way through with no errors.

@gbatmaz would it be possible to use model trained on the 18 columns instead so that datasets from both NetQ and the script (pyNVML/nvidia-smi) can be used for the pipeline?

Yes, it should be okay for them to modify and use whatever is available for the notebook since you've done the sanity check. For production, it's probably a better idea to retrain with their own data to make sure correct distinction can be done with their type of workloads.

@mdemoret-nv
Copy link
Contributor

@efajardo-nv Is this still a draft?

@efajardo-nv
Copy link
Contributor Author

@mdemoret-nv i need to retrain the model using the 18 columns in order for pipeline to work using inference data generated from the script. should be okay according to @gbatmaz. i can do that now unless you see any issues with that.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 30, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@efajardo-nv
Copy link
Contributor Author

/ok to test

@efajardo-nv efajardo-nv marked this pull request as ready for review August 31, 2023 19:59
@efajardo-nv efajardo-nv requested a review from a team as a code owner August 31, 2023 19:59
@efajardo-nv
Copy link
Contributor Author

/ok to test

@efajardo-nv
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@gbatmaz gbatmaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@efajardo-nv
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e73b03a into nv-morpheus:branch-23.11 Aug 31, 2023
2 checks passed
@efajardo-nv efajardo-nv deleted the abp-nvsmi-data-gen branch July 29, 2024 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement to existing functionality non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[DOC]: Add information on how to generate new sample data for the ABP nvsmi example
3 participants