Skip to content

Latest commit

 

History

History
193 lines (133 loc) · 16.7 KB

File metadata and controls

193 lines (133 loc) · 16.7 KB

Intro to RAPIDS Course for Content Creators

Introduction

In this intro course, we cover the basic skills you need to accelerate your data analytics and ML pipeline with RAPIDS. Get to know the RAPIDS core libraries: cuDF, cuML, cuGraph, and cuXFilter, as well as community libraries, including: XGBoost, cuPy, Dask, and DaskSQL, to accelerate how you:

  • Ingest data
  • Perform your prepare your data with ETL
  • Run modeling, inferencing, and predicting algorithms on the data in a GPU dataframe
  • Visualize your data throughout the process.

Each of the three modules should take less than 2 hours to complete. When complete, you should be able to:

  1. Take an existing workflow in a data science or ML pipeline and use a RAPIDS to accelerate it with your GPU
  2. Create your own workflows from scratch

This course was written with the expectation that you know Python and Jupyter Lab. It is helpful, but not necessary, to have at least some understanding of Pandas, Scikit Learn, NetworkX, and Datashader.

You should be able to run these exercises and use these libraries on any machine with these prerequisites, which namely are

  • OS of Ubuntu 22.04 or 20.04 or CentOS7 with gcc 5.4 & 7.3
  • an NVIDIA GPU of Pascal Architecture or better (basically 10xx series or newer)

RAPIDS works on a broad range of GPUs, including NVIDIA GeForce, TITAN, Quadro, Tesla, A100, and DGX systems

NVIDIA Titan RTX

Questions?

There are a few channels to ask questions or start a discussion:

  • GoAI Slack to discuss issues and troubleshoot with the RAPIDS community
  • RAPIDS GitHub to submit feature requests and report bugs

Getting Started

There are 3 steps to installing RAPIDS

  1. Provisioning a GPU enabled workspace
  2. Installing RAPIDS Prerequisites
  3. Installing RAPIDS libraries

1. Provisioning a GPU-Enabled Workspace

When installing RAPIDS, first provision a RAPIDS Compatible GPU. The GPU must be NVIDIA Pascal™ or better with compute capability 6.0+. Here is a list of compatible GPUs. This GPU can local, like in a workstation or server, or in the Cloud. GPUs can reside in:

  • Shared cloud
  • Dedicated cloud
  • Local workspace

Using Cloud Instance(s)

There are two option for using Cloud Instances:

  1. Shared, free instances like app. Google Colab and Sagemaker Studio Lab.
  2. Dedicated, paid [usually] GPU instances from providers like AWS, Azure, GCP, Paperspace, and more.

Shared Cloud via Free/Almost Free Instances

Free cloud instances have quick start capabilities or scripts to ease onboarding.

  • Google Colab: Pip based environment that we can also install conda on. The installation will take about 3 using pip and 8-15 minutes using conda. First select a GPU instance from Runtime type. After, use the provided RAPIDS installation scripts, found here by copying and pasting into a code cell. Please note, RAPIDS will not run on an unsupported GPU instance like K80 - ONLY the T4, P4, and P100s (Refer to !nvidia-smi). If you are given a K80, please factory reset your instance and the check again. You can upgrade for $10 a month to Colab Pro to greatly increase the chances of getting a RAPIDS compatible GPU. Will run a single notebook and data -incuding installation- is not stored between instances.
  • SageMaker Studio Lab: Conda based environment running jupyterlab. These instances will save your data, your installation, and run multiple notebooks.
  • Paperspace: Docker based environment. These instances can be preloaded with RAPIDS and you can start right away, but the free GPUs are only free with a subscription to other GPUs.
  • NVIDIA Launchpad: corporate only instances to let you kick the tires of many NVIDIA technologies

Dedicated Cloud via Paid Instances

There are several ways to provision a dedicated cloud GPU workspace, and our instructions are found here. Your OS will need to be Ubuntu or RHEL/CentOS 7. For installing RAPIDS, These instances follow the same installation process as a local instance.

2. Installing RAPIDS Prerequisites

Downloads

You can satisfy your prerequisites to install RAPIDS by:

  1. Install OS and GPU Drivers and OS
  2. Install Packaging Environment (Docker or Conda)

OS and GPU Drivers

Please ensure that your workstation has these the correct OS, NVIDIA drivers and CUDA version, and Python verion installed as per our prerequisites, found HERE

Install Packaging Environment (Docker or Conda)

Depending on if you prefer to use RAPIDS with Docker or Conda, you will need these also installed:

  • If Docker: Docker CE v19.03+ and nvidia-container-toolkit

    • Legacy Support - Docker CE v17-18 and nvidia-docker2
  • If Conda, please install

  • Miniconda for a minimal conda installation

  • Anaconda for full conda installation

  • Mamba inside of conda for a faster conda solving (untested)

3. Install RAPIDS Libraries

  • Use the Interactive RAPIDS release selector to install RAPIDS as you want it. The install script at the bottom will update as you change your install parameters of method, desired RAPIDS release, desired RAPIDS packages, Linux verison, and CUDA version. Here is an image of it below.

Great! Now that you're done getting up and running, let's move on to the Data Science!

1. The Basics of RAPIDS: cuDF and Dask

Introduction

cuDF lets you create and manipulate your dataframes on GPUs. All other RAPIDS libraries use cuDF to model, infer, regress, reduce, and predict outcomes. The cuDF API is designed to be similar to Pandas with minimal code changes.

There are situations where the dataframe is larger than available GPU memory. Dask is used to help RAPIDS algorithms scale up through distributed computing. Whether you have a single GPU, multiple GPUs, or clusters of multiple GPUs, Dask is used for distributed computing calculations and orchstrattion of the processing of GPU dataframe, no matter the size, just like a regular CPU cluster.

Let's get started with a couple videos!

Videos

Video Title Description
Video- Getting Started with RAPIDS. Walks through the 01_Introduction_to_RAPIDS notebook which shows, at a high level, what each of the packages in RAPIDS are as well as what they do.
Video - RAPIDS: Dask and cuDF NYCTaxi Screencast Shows you have you can use RAPIDS and Dask to easily ingest and model a large dataset (1 year's worth of NYCTaxi data) and then create a model around the question "when do you get the best tips". This same workload can be done on any GPU.

Learning Notebooks

Notebook Title Description
01_Introduction_to_RAPIDS This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do.
02_Introduction_to_cuDF This notebook shows how to work with cuDF DataFrames in RAPIDS.
03_Introduction_to_Dask This notebook shows how to work with Dask using basic Python primitives like integers and strings.
04_Introduction_to_Dask_using_cuDF_DataFrames This notebook shows how to work with cuDF DataFrames using Dask.
Guide to UDFs This notebook provides and overview of User Defined Functions with cuDF
11_Introduction_to_Strings This notebook shows how to manipulate strings with cuDF DataFrames
12_Introduction_to_Exploratory_Data_Analysis_using_cuDF This notebook shows how to perform basic EDA with cuDF DataFrames
13_Introduction_to_Time_Series_Data_Analysis_using_cuDF This notebook shows how to do EDA on time-series DataFrame with cuDF

Extra credit and Exercises

2. Accelerating those Algorithms: cuML and XGBoost

Introduction

Congrats learning the basics of cuDF and Dask-cuDF for your data forming. Now let's take a look at cuML to run GPU accelerated machine learning algorithms.

cuML runs many common scikit-learn algorithms and methods on cuDF dataframes to model, infer, regress, reduce, and predict outcomes on the data. Among the ever growing suite of algorithms, you can perform several GPU accelerated algortihms for each of these methods:

  • Classification / Regression
  • Inference
  • Clustering
  • Decomposition & Dimensionality Reduction
  • Time Series

While we look at cuML, we'll take a look at how further on how to increase your speed up with XGBoost, scale it out with Dask XGboost, then see how to use cuML for Dimensionality Reduction and Clustering.

Let's look at a few video walkthroughs of XGBoost, as it may be an unfamiliar concept to some, and then experience how to use the above in your learning notebooks.

Videos

Video Title Description
Video - Introduction to XGBoost Walks through the 07_Introduction_to_XGBoost notebook and shows how to work with GPU accelerated XGBoost in RAPIDS.

Learning Notebooks

Notebook Title Description
06_Introduction_to_Supervised_Learning This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS.
07_Introduction_to_XGBoost This notebook shows how to work with GPU accelerated XGBoost in RAPIDS.
09_Introduction_to_Dimensionality_Reduction This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS.
10_Introduction_to_Clustering This notebook shows how to do GPU accelerated Clustering in RAPIDS.
14_Introduction_to_Machine_Learning_using_cuML This notebook brings it all together and shows to do GPU accelerated machine learning for core tasks like Classification, Regression, and Clustering in a workflow

Extra credit and Exercises

RAPIDS cuML Example Notebooks

Conclusion to Sections 1 and 2

Here ends the basics of cuDF, cuML, Dask, and XGBoost. These are libraries that everyone who uses RAPIDS will go to every day. Our next sections will cover libraries that are more niche in usage, but are powerful to accomplish your analytics.

3. Graphs on RAPIDS: Intro to cuGraph

It is often useful to look at the relationships contained in the data, which we do that thought the use of graph analytics. Representing data as a graph is an extremely powerful techniques that has grown in popularity. Graph analytics are used to helps Netflix recommend shows, Google rank sites in their search engine, connects bits of discrete knowledge into a comprehensive corpus, schedules NFL games, and can even help you optimize seating for your wedding (and it works too!). KDNuggests has a great in depth guide to graphs here. Up until now, running a graph analytics was a painfully slow, particularly as the size of the graph (number of nodes and edges) grew.

RAPIDS' cuGraph library makes graph analytics effortless, as it boasts some of our best speedups, (up to 25,000x). To put it in persepctive, what can take over 20 hours, cuGraph can lets you do in less than a minute (3 seconds). In this section, we'll look at some examples of cuGraph methods for your graph analytics and look at a simple use case.

RAPIDS cuGraph Example Notebooks