Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on M1 "MPS" #28

Open
okpatil4u opened this issue Jan 8, 2023 · 45 comments
Open

Training on M1 "MPS" #28

okpatil4u opened this issue Jan 8, 2023 · 45 comments

Comments

@okpatil4u
Copy link

Most of the people do not have access to 8XA100 40GB systems. But a single M1 Max laptop with 64 GB memory could host the training. How difficult is it to port this code to "MPS" ?

@okpatil4u
Copy link
Author

I take it back. Seems like these are 8 x 40 GB systems.

There is a good paper on cramming [Cramming: Training a Language Model on a Single GPU in One Day]
https://arxiv.org/abs/2212.14034

I thought some work on these lines was done here as well.

@karpathy karpathy reopened this Jan 8, 2023
@karpathy
Copy link
Owner

karpathy commented Jan 8, 2023

Actually I think this issue is great to keep open, in case anyone investigates nanoGPT in mps context. I haven't tried yet.

@okpatil4u
Copy link
Author

What is the actual memory requirement ? Will Mac Studio with 128 GB RAM be sufficient for training ?

@jwkirchenbauer
Copy link

Refining the above comment slightly, do you currently have any (rough is fine) estimates on the relative sizes of the memory footprint for the just the model parameters, params plus the forward activations as a fn of bsz, versus the backward graph as a fn of bsz, on the 8xA100 40gb configuration? Where does it peak across the server during training?

That might start to inform some people on how to go about laying this out on the resources they have.

@NightMachinery
Copy link

Also relevant for inference.

@personsg
Copy link

I haven't had a chance to do any benchmarking yet but training starts just fine on M1 Ultra with --device=mps.

@acheong08
Copy link

acheong08 commented Jan 12, 2023

@itakafu
Copy link

itakafu commented Jan 16, 2023

I tried out "i only have a MacBook" from README but with --device="mps" and it seems to run faster. With CPU, one iteration is roughly about 100ms whereas with mps is about ~40ms. My machine is a base line Mac Studio.

@okpatil4u
Copy link
Author

That's for training a very small transformer. My machine is 64 gb RAM, M1 Max. For bert-medium like architecture, this is how it goes.

Overriding: dataset = shakespeare
Overriding: n_layer = 8
Overriding: n_head = 512
Overriding: n_embd = 512
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 128

Initializing a new model from scratch
number of parameters: 50.98M
step 0: train loss 10.9816, val loss 10.9783
iter 0: loss 10.9711, time 4613.50ms
iter 1: loss 10.9673, time 5791.48ms
iter 2: loss 10.9647, time 7842.40ms
iter 3: loss 10.9646, time 10196.35ms
iter 4: loss 10.9604, time 11602.34ms
iter 5: loss 10.9495, time 9393.25ms
iter 6: loss 10.9615, time 10373.34ms

@karpathy
Copy link
Owner

@itakafu thank you for reporting, i'll add mentions of mps to the readme&code.

@SiyuanHuang95
Copy link

test on MacBook Air M2, without charger:

with mps: roughly 150~200ms for one iteration
without mps: roughly 450 ~ 500ms for one iteration

just for one reference

@itakafu thank you for reporting, i'll add mentions of mps to the readme&code.

@tomeck
Copy link

tomeck commented Jan 20, 2023

Confirmed works great w/device='mps'. But make sure to install this version of pytorch:

$ pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

I'm getting <40ms

Thank you SO MUCH for this

@tombenj
Copy link

tombenj commented Jan 29, 2023

@tomeck Weird, I'm getting 300ms on M2 (Macbook Air 16GB):

python3 train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64 --device='mps' --compile=False --eval_iters=1 --block_size=64 --batch_size=8
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
vocab_size not found in data/shakespeare/meta.pkl, using GPT-2 default of 50257
Initializing a new model from scratch
number of parameters: 3.42M
step 0: train loss 10.8177, val loss 10.8162
iter 0: loss 10.8288, time 438.06ms
iter 1: loss 10.8117, time 303.12ms
iter 2: loss 10.8236, time 301.04ms
iter 3: loss 10.8265, time 299.64ms
iter 4: loss 10.8128, time 299.96ms
iter 5: loss 10.8173, time 299.72ms
iter 6: loss 10.8066, time 300.76ms
iter 7: loss 10.8084, time 299.86ms
iter 8: loss 10.8244, time 299.47ms

@coltac
Copy link

coltac commented Jan 29, 2023

Just out of curiosity, I'm getting 17ms with a ryzen7 5700x and a 3060ti, 64 gb ram. What kind of iteration time does a A100 do? Are they horribly faster? I have a friend with 2x 3080s and I'm considering doing the big one...

@tombenj
Copy link

tombenj commented Jan 29, 2023

Yep the README documentation doesn't make sense in terms of ms calculations on A100. It states:
"Training on an 8 x A100 40GB node for ~500,000 iters (~1 day) atm gets down to ~3.1"

This would mean - 500000/86400 = 5.787 itr / 1000 ms = 172.8 ms per itr.
And times that by 8 to get a single A100... doesn't make sense.

@coltac
Copy link

coltac commented Jan 29, 2023

Oh I'm being stupid, I'm getting 17ms on Shakespeare, I bet it'd way higher on openwebtext

@simonw
Copy link

simonw commented Feb 1, 2023

Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2

@simonw
Copy link

simonw commented Feb 1, 2023

I also built a little tool you can copy and paste the log output from training into to get a chart:

https://observablehq.com/@simonw/plot-loss-from-nanogpt

Example output:

image

@strikeroot
Copy link

I think the mps section of the readme may be inaccurate: my understanding is that mps just utilizes the on-chip GPU. To use the Neural Engine you'd have to port it to CoreML — which may or may not speed up training but should do wonders for inference. See PyTorch announcement here.

@okpatil4u
Copy link
Author

For training, you have to use MPS. For inference you can use ANE.

@1234igor
Copy link

1234igor commented Feb 15, 2023

Hey @simonw , thanks for sharing tutorial on your website!

I tried on my MacBook Air M2 and getting much worse performance:

time python3 train.py \
  --dataset=shakespeare \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=64 \
  --compile=False \
  --eval_iters=1 \
  --block_size=64 \
  --batch_size=8 \
  --device=mps
Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 64
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 64
Overriding: batch_size = 8
Overriding: device = mps
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 3.42M
using fused AdamW: False
step 0: train loss 10.8153, val loss 10.8133
iter 0: loss 10.8181, time 5264.63ms, mfu -100.00%
iter 1: loss 10.8291, time 1650.46ms, mfu -100.00%
iter 2: loss 10.8164, time 1651.38ms, mfu -100.00%
iter 3: loss 10.7927, time 1639.94ms, mfu -100.00%
iter 4: loss 10.8212, time 1644.10ms, mfu -100.00%
iter 5: loss 10.8067, time 1639.57ms, mfu 0.08%
iter 6: loss 10.8307, time 1635.84ms, mfu 0.08%
iter 7: loss 10.8345, time 1635.17ms, mfu 0.08%
iter 8: loss 10.8262, time 1637.88ms, mfu 0.08%
iter 9: loss 10.8275, time 1643.70ms, mfu 0.08%
iter 10: loss 10.8100, time 1643.38ms, mfu 0.08%
iter 11: loss 10.8100, time 1641.18ms, mfu 0.08%
iter 12: loss 10.8258, time 1647.17ms, mfu 0.08%
iter 13: loss 10.8169, time 1643.93ms, mfu 0.08%
iter 14: loss 10.8139, time 1645.54ms, mfu 0.08%
iter 15: loss 10.8107, time 1642.27ms, mfu 0.08%
iter 16: loss 10.8114, time 1642.16ms, mfu 0.08%
iter 17: loss 10.7969, time 1641.59ms, mfu 0.08%
iter 18: loss 10.8150, time 1643.31ms, mfu 0.08%

Currently on Python 3.11. Spent couple hours trying to reinstall everything but it didn't help. Does anyone have ideas what can be wrong here?

@iSevenDays
Copy link

iSevenDays commented Feb 20, 2023

Macbook M1 MAX results on train_shakespeare_char

python train.py config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
batch_size = 16
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 4
n_head = 4
n_embd = 256
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-6 # learning_rate / 10 usually
beta2 = 0.999 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
device = 'mps'  # run on cpu only
compile = False # do not torch compile the model

found vocab_size = 65 (inside data/shakespeare_char/meta.pkl)

step 0: train loss 4.2326, val loss 4.2303
iter 0: loss 4.2329, time 9686.70ms, mfu -100.00%
step 5000: train loss 0.7204, val loss 1.5878
iter 5000: loss 0.9658, time 10224.29ms, mfu 0.48%

python sample.py --out_dir=out-shakespeare-char
Overriding: out_dir = out-shakespeare-char
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
WARNING: using slow attention. Flash Attention atm needs PyTorch nightly and dropout=0.0
number of parameters: 3.16M
Loading meta from data/shakespeare_char/meta.pkl...

The like order precious soner stout the morning's strength;
The month of his son bounded bones and rough
Since the common people'd courtesy 'gainst their times,
Your brats bear betwixt them away, and nothing
Against the gracious patern of their heads,
For their father is not their silly mouths,
Even in their voices and their loves.

MENENIUS:
You are received;
For they wear them, no more good to bed,
Your people have are endured with them not:
You'll have done as good to them be to brief

@deepaktalwardt
Copy link

It appears that after 086ebe1 was merged the training performance on M1/M2 is significantly slower.

@nirajvenkat
Copy link

Thanks @deepaktalwardt!

I am using the command suggested by @simonw:

time python3 train.py \
  --dataset=shakespeare \
  --n_layer=4 \
  --n_head=4 \
  --n_embd=64 \
  --compile=False \
  --eval_iters=1 \
  --block_size=64 \
  --batch_size=8 \
  --device=mps

After reverting that commit this is literally flying on my Macbook Pro M2 Max! So just make sure the gradient_accumulation_steps is always equal to 1. Without reverting 086ebe1 it will be 800ms per iter.

Stopped training after 10k iters which took 4min18s.

iter 10139: loss 3.9768, time 25.31ms, mfu 0.13%

KeyboardInterrupt

python train.py --dataset=shakespeare --n_layer=4 --n_head=4 --n_embd=64       232.40s user 72.33s system 117% cpu 4:18.81 total

@hanfluid
Copy link

Has someone tried 'mps' together with 'compile=True' and succeed?

@bcipolli
Copy link

+1 to reverting 086ebe1; I went from 1500ms to 70ms per iteration.

@rozek
Copy link

rozek commented Mar 29, 2023

indeed, I also made my own fork and reverted 086ebe1, resulting in a dramatic speedup on my Mac mini M1!

@rozek
Copy link

rozek commented Mar 29, 2023

Thanks to this thread I got it working on my M2 MacBook Pro - I wrote up some detailed notes here: https://til.simonwillison.net/llms/nanogpt-shakespeare-m2

Simon, thank you very much for your walk-through of an installation of nanoGPT on Apple silicon. By the way, I just tried to run python sample.py after changing the device to mps and it seems to work now: the script spits out a few warnings but then generates output without any problems, but it has to be run under macOS 13.x Ventura.

@Pixxinger
Copy link

Pixxinger commented Apr 3, 2023

Has someone tried 'mps' together with 'compile=True' and succeed?

Yep and as follows,

Overriding: dataset = shakespeare
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 128
Overriding: compile = True
Overriding: eval_iters = 20
Overriding: block_size = 64
Overriding: batch_size = 12
Overriding: device = mps
Overriding: log_interval = 1
Overriding: max_iters = 2000
Overriding: lr_decay_iters = 2000
Overriding: dropout = 0.0
Overriding: gradient_accumulation_steps = 1
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 7.23M
using fused AdamW: False
compiling the model... (takes a ~minute)
step 0: train loss 10.8272, val loss 10.8203
iter 0: loss 10.8421, time 2852.64ms, mfu -100.00%
iter 1: loss 10.8099, time 522.30ms, mfu -100.00%
...
iter 2000: loss 2.6286, time 1241.70ms, mfu 0.16%
python train.py config/train_shakespeare_char.py --dataset=shakespeare 420.38s user 105.07s system 49% cpu 17:34.84 total

~/nanoGPT master ± pip list | grep torch
torch 2.1.0.dev20230401
torchaudio 2.1.0.dev20230401
torchvision 0.16.0.dev20230401

~/nanoGPT master ± python --version
Python 3.9.6

@0dB
Copy link

0dB commented Apr 23, 2023

Reverting commit 086ebe1 or overriding gradient_accumulation_steps to 1 is not needed anymore. This seems to have been fixed via the file config/train_shakespeare_char.py with commit 21f9bff. I can confirm 30ms or 775ms iteration times on M1 Pro with mps and depending on whether using "I have a MacBook" settings or plain python train.py config/train_shakespeare_char.py --device=mps --compile=False.

BTW I also did not need the nightly PyTorch build for this. The version available on MacPorts did fine. I did have to comment out code in train.py regarding init_process_group, destroy_process_group and ddp (parallel processing on multiple GPUs).

@rozek
Copy link

rozek commented May 12, 2023

Unfortunately, I cannot confirm the above statement: using a fresh installation of this repo, trying to train "Shakespeare" took approx. 2.2s per iteration on a Mac mini M1 with 16GB RAM - after reverting 086ebe1 again, every iteration took only 0.067s or even less (what a dramatic change!)

@0dB
Copy link

0dB commented May 12, 2023

Unfortunately, I cannot confirm the above statement

That's strange. When I revert, which effectively sets gradient_accumulation_steps to 1, I get no change, so to me it seems commit 21f9bff resolves things. Ideas anyone?

@rozek
Copy link

rozek commented May 12, 2023

well, if you look into commit 21f9bff and compare that with the statement you used for testing (python train.py config/train_shakespeare_char.py --device=mps --compile=False) you will see, that the commit adds one line to config/train_shakespeare_char.py, namely gradient_accumulation_steps = 1 - that's what reverting commit 086ebe1 did generically.

Did you also test train.py with other configurations that do not include gradient_accumulation_steps = 1?

@0dB
Copy link

0dB commented May 12, 2023

if you look into commit 21f9bff and compare that with the statement you used for testing (python train.py config/train_shakespeare_char.py --device=mps --compile=False) you will see, that the commit adds one line to config/train_shakespeare_char.py, namely gradient_accumulation_steps = 1 - that's what reverting commit 086ebe1 did generically.

I know, this overrides the setting in train.py from what I can report, but does this not work for you?

Did you also test train.py with other configurations that do not include gradient_accumulation_steps = 1?

No, I only have one GPU, so from my understanding from this issue I want this value to be at 1. The only thing I tried just right now is to revert the commit you mention, which sets gradient_accumulation_steps = 1 in train.py again (instead of to 40), but to me it (as to be expected) has the same effect as just using the current code, which now sets an override to this value in config/train_shakespeare_char.py.

It is not clear to me why the code current at the time of writing is dramatically slower for you than the code after reverting the commit. Are you seeing different values for tokens per iteration (output by train.py) when you revert and don't revert the commit? For me this is just batch_size times block_size, so the override is working and gradient_accumulation_steps is for me being set to 1.

@rozek
Copy link

rozek commented May 12, 2023

Well, I think the reason why setting gradient_accumulation_steps = 1 has such a dramatic effect is still not really clear - at least, not to me.

I tested nanoGPT with the Shakespeare dataset, not with Shakespeare_char which is why I ran into the same problem as a few weeks ago.

And since setting gradient_accumulation_steps = 1 in every configuration file is too tedious, I still recommend to do so in train.py itself - i.e., to revert 086ebe1

@Pixxinger
Copy link

Maybe simply try,

python train.py config/train_gpt2.py
--compile=True
--eval_iters=20
--block_size=64
--device=mps
--max_iters=6000
--lr_decay_iters=6000
--gradient_accumulation_steps=1

@abhimonangi
Copy link

abhimonangi commented Jul 6, 2023

M1 pro machine running spotify million song dataset:

nanoGPT % python3.10 train.py config/train_meet_summ.py --device=mps --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
Overriding config with config/train_meet_summ.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-meet_summ'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'meet_summ'
wandb_run_name = 'mini-gpt'

dataset = 'meet_summ'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

Overriding: device = mps
Overriding: compile = False
Overriding: eval_iters = 20
Overriding: log_interval = 1
Overriding: block_size = 64
Overriding: batch_size = 12
Overriding: n_layer = 4
Overriding: n_head = 4
Overriding: n_embd = 128
Overriding: max_iters = 2000
Overriding: lr_decay_iters = 2000
Overriding: dropout = 0.0
tokens per iteration will be: 768
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 7.23M
/opt/homebrew/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:120: UserWarning: torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.
  warnings.warn("torch.cuda.amp.GradScaler is enabled, but CUDA is not available.  Disabling.")
num decayed parameter tensors: 18, with 7,233,536 parameters
num non-decayed parameter tensors: 9, with 1,152 parameters
using fused AdamW: False
step 0: train loss 10.8466, val loss 10.8504
iter 0: loss 10.8558, time 1160.57ms, mfu -100.00%
iter 1: loss 10.8502, time 65.83ms, mfu -100.00%
iter 2: loss 10.8342, time 94.48ms, mfu -100.00%
iter 3: loss 10.8267, time 61.11ms, mfu -100.00%
iter 4: loss 10.8191, time 61.72ms, mfu -100.00%
iter 5: loss 10.8195, time 61.16ms, mfu 0.18%
iter 6: loss 10.7832, time 61.36ms, mfu 0.18%
iter 7: loss 10.7710, time 60.81ms, mfu 0.18%
iter 8: loss 10.7230, time 60.78ms, mfu 0.18%
iter 9: loss 10.7206, time 60.25ms, mfu 0.18%

@mercicle
Copy link

Hi all, I have an mps error, but only when doing architecture sweeps, can someone comment on this issue?
#343

@gabigabogabu
Copy link

gabigabogabu commented Oct 7, 2023

Do you folks not run into this issue with a buggy torch.multinomial on mps?
pytorch/pytorch#92752

gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this issue Nov 8, 2023
@Venkat1495
Copy link

Hey everyone, I ran nanoGPT training using "python train.py config/train_shakespeare_char.py --device=mps --compile=False" on 'Mac M1 Pro'

Screenshot 2024-02-16 at 11 32 50 PM Screenshot 2024-02-16 at 11 33 18 PM

@davmacario
Copy link

Hi there!
I have been playing with nanoGPT for a while on my mac (m1 pro) and I have noticed that inference is very slow when the length in tokens of the generated output is smaller than the context length. Can anyone confirm this?
I get generation times as low as 1 token/s when using a context length of 256 tokens and generating tokens 200 to 255, but once the context length is passed, generation is much faster.
This does not happen on CUDA.

@AlbertMarashi
Copy link

my m1 pro gets slower when I use mps but faster with cpu??

@maercaestro
Copy link

yeah, having the same problem, but only with inference

@davmacario
Copy link

davmacario commented Apr 16, 2024

yeah, having the same problem, but only with inference

It's been a couple of weeks since I last checked on this, but I had the same issue.
I suspect there is some slowdown when having to truncate the tril triangular matrix that is used as a mask in the attention mechanism (when the number of generated tokens is lower than the context length block_size).
See this.
This is not done in training, as the training elements are created such that they fill the whole context window.

What's even more strange is that, at least for me, this slowdown only happened when generating the first sample (in sample.py).
The only explanation I could give to myself was that Torch performs some sort of caching on that matrix, but I am not sure about the underlying library implementation.

This issue only happens with MPS.

@maercaestro
Copy link

at this point, i might have to subscribe to google colab just to run the code...the downfall of poverty

gkielian added a commit to gkielian/ReaLLMASIC_nanogpt that referenced this issue Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests