Esm2 on Sagemaker Hyperpod #387

awsankur · 2024-07-25T06:32:47Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Signed-off-by: Ankur Srivastava <[email protected]>

KeitaW · 2024-07-25T08:01:38Z

Do we have any SMHP specific feature in this test case?
If not we may organize the test case per scheduler:

23.esm
├── kubernetes
└── slurm

see also #381

KeitaW · 2024-07-30T23:22:56Z

3.test_cases/23.SMHP-esm2/README.md

+
+|  Model | device_batch_size | num_nodes | torch.compile |     Instance   |   Throughput   |
+|:------:|:-----------------:|:---------:|:-------------:| :------------: | :------------: |
+|  ESM2  |         8         |     2     |       No      |  g5.12xlarge   |  160 samples/s | 


The set up instruction advise to use 24xl but actually 12xl was used?

KeitaW · 2024-07-30T23:37:18Z

3.test_cases/23.SMHP-esm2/README.md

+## What is ESM-2?
+[ESM-2](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1) is a pLM trained using unsupervied masked language modelling on 250 Million protein sequences by researchers at [Facebook AI Research (FAIR)](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1). It is available in several sizes, ranging from 8 Million to 15 Billion parameters. The smaller models are suitable for various sequence and token classification tasks. The FAIR team also adapted the 3 Billion parameter version into the ESMFold protein structure prediction algorithm. They have since used ESMFold to predict the struture of [more than 700 Million metagenomic proteins](https://esmatlas.com/about).
+
+ESM-2 is a powerful pLM. We will demonstrate how to use QLoRA to fine-tune ESM-2 on g5.24xlarge instances. We will use ESM-2 to predict [subcellular localization](https://academic.oup.com/nar/article/50/W1/W228/6576357?login=false). Understanding where proteins appear in cells can help us understand their role in disease and find new drug targets.


Is this test case demonstrating pretraining? or finetuning? I believe latter but the title states former.

KeitaW · 2024-07-30T23:37:36Z

3.test_cases/23.SMHP-esm2/README.md

@@ -0,0 +1,168 @@
+# How to pretrain ESM2 with SageMaker Hyperpod using Amazon G5 instances


Suggested change

# How to pretrain ESM2 with SageMaker Hyperpod using Amazon G5 instances

# How to finetune ESM2 with SageMaker Hyperpod using Amazon G5 instances

KeitaW · 2024-07-30T23:39:28Z

3.test_cases/23.SMHP-esm2/3.train_fsdp.sh

+#!/bin/bash
+
+#SBATCH --job-name=esm2-accelerate
+#SBATCH -D .


Suggested change

#SBATCH -D .

This line may not be necessary.

-D, --chdir=
Set the working directory of the batch script to directory before it is executed. The path can be specified as full path or relative path to the directory where the command is executed.

KeitaW · 2024-07-30T23:40:37Z

3.test_cases/23.SMHP-esm2/2.train_ddp.sh

+# SPDX-License-Identifier: MIT-0
+
+#SBATCH --nodes=2 # number of nodes to use
+#SBATCH --job-name=FSDP # name of your job


Suggested change

#SBATCH --job-name=FSDP # name of your job

#SBATCH --job-name=DDP # name of your job

KeitaW · 2024-07-30T23:41:48Z

3.test_cases/23.SMHP-esm2/3.train_fsdp.sh

+#SBATCH -D .
+#SBATCH --output=accelerate-%x.%j.out
+#SBATCH --nodes=2              # number of nodes
+#SBATCH --ntasks-per-node=1         # number of MP tasks


Maybe you want

Suggested change

#SBATCH --ntasks-per-node=1 # number of MP tasks

#SBATCH --exclusive # job has exclusive use of the resource, no sharing

instead?

awsankur added 4 commits July 3, 2024 18:04

Added files

8631a20

Signed-off-by: Ankur Srivastava <[email protected]>

Updated with training example

251e015

Signed-off-by: Ankur Srivastava <[email protected]>

Added ESM2 training on SMHP

82d2e4e

Signed-off-by: Ankur Srivastava <[email protected]>

Added ESM2 training on SMHP

a2d7766

Signed-off-by: Ankur Srivastava <[email protected]>

awsankur requested review from KeitaW and amanshanbhag July 25, 2024 06:32

KeitaW reviewed Jul 30, 2024

View reviewed changes

KeitaW assigned awsankur Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Esm2 on Sagemaker Hyperpod #387

Esm2 on Sagemaker Hyperpod #387

awsankur commented Jul 25, 2024

KeitaW commented Jul 25, 2024 •

edited

Loading

KeitaW Jul 30, 2024

KeitaW Jul 30, 2024

KeitaW Jul 30, 2024

KeitaW Jul 30, 2024

KeitaW Jul 30, 2024

KeitaW Jul 30, 2024

		@@ -0,0 +1,168 @@
		# How to pretrain ESM2 with SageMaker Hyperpod using Amazon G5 instances

	# How to pretrain ESM2 with SageMaker Hyperpod using Amazon G5 instances
	# How to finetune ESM2 with SageMaker Hyperpod using Amazon G5 instances

	#SBATCH --job-name=FSDP # name of your job
	#SBATCH --job-name=DDP # name of your job

	#SBATCH --ntasks-per-node=1 # number of MP tasks
	#SBATCH --exclusive # job has exclusive use of the resource, no sharing

Esm2 on Sagemaker Hyperpod #387

Are you sure you want to change the base?

Esm2 on Sagemaker Hyperpod #387

Conversation

awsankur commented Jul 25, 2024

KeitaW commented Jul 25, 2024 • edited Loading

KeitaW Jul 30, 2024

Choose a reason for hiding this comment

KeitaW Jul 30, 2024

Choose a reason for hiding this comment

KeitaW Jul 30, 2024

Choose a reason for hiding this comment

KeitaW Jul 30, 2024

Choose a reason for hiding this comment

KeitaW Jul 30, 2024

Choose a reason for hiding this comment

KeitaW Jul 30, 2024

Choose a reason for hiding this comment

KeitaW commented Jul 25, 2024 •

edited

Loading