Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esm2 on Sagemaker Hyperpod #387

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Esm2 on Sagemaker Hyperpod #387

wants to merge 4 commits into from

Conversation

awsankur
Copy link
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Signed-off-by: Ankur Srivastava <[email protected]>
Signed-off-by: Ankur Srivastava <[email protected]>
Signed-off-by: Ankur Srivastava <[email protected]>
Signed-off-by: Ankur Srivastava <[email protected]>
@KeitaW
Copy link
Contributor

KeitaW commented Jul 25, 2024

Do we have any SMHP specific feature in this test case?
If not we may organize the test case per scheduler:

23.esm
├── kubernetes
└── slurm

see also #381


| Model | device_batch_size | num_nodes | torch.compile | Instance | Throughput |
|:------:|:-----------------:|:---------:|:-------------:| :------------: | :------------: |
| ESM2 | 8 | 2 | No | g5.12xlarge | 160 samples/s |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set up instruction advise to use 24xl but actually 12xl was used?

## What is ESM-2?
[ESM-2](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1) is a pLM trained using unsupervied masked language modelling on 250 Million protein sequences by researchers at [Facebook AI Research (FAIR)](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1). It is available in several sizes, ranging from 8 Million to 15 Billion parameters. The smaller models are suitable for various sequence and token classification tasks. The FAIR team also adapted the 3 Billion parameter version into the ESMFold protein structure prediction algorithm. They have since used ESMFold to predict the struture of [more than 700 Million metagenomic proteins](https://esmatlas.com/about).

ESM-2 is a powerful pLM. We will demonstrate how to use QLoRA to fine-tune ESM-2 on g5.24xlarge instances. We will use ESM-2 to predict [subcellular localization](https://academic.oup.com/nar/article/50/W1/W228/6576357?login=false). Understanding where proteins appear in cells can help us understand their role in disease and find new drug targets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test case demonstrating pretraining? or finetuning? I believe latter but the title states former.

@@ -0,0 +1,168 @@
# How to pretrain ESM2 with SageMaker Hyperpod using Amazon G5 instances
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# How to pretrain ESM2 with SageMaker Hyperpod using Amazon G5 instances
# How to finetune ESM2 with SageMaker Hyperpod using Amazon G5 instances

#!/bin/bash

#SBATCH --job-name=esm2-accelerate
#SBATCH -D .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#SBATCH -D .

This line may not be necessary.

-D, --chdir=
Set the working directory of the batch script to directory before it is executed. The path can be specified as full path or relative path to the directory where the command is executed.

# SPDX-License-Identifier: MIT-0

#SBATCH --nodes=2 # number of nodes to use
#SBATCH --job-name=FSDP # name of your job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#SBATCH --job-name=FSDP # name of your job
#SBATCH --job-name=DDP # name of your job

#SBATCH -D .
#SBATCH --output=accelerate-%x.%j.out
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MP tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want

Suggested change
#SBATCH --ntasks-per-node=1 # number of MP tasks
#SBATCH --exclusive # job has exclusive use of the resource, no sharing

instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants