Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Multiple instances of crash during SDXL (Finetune) training #362

Open
velourlawsuits opened this issue Jun 25, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@velourlawsuits
Copy link

velourlawsuits commented Jun 25, 2024

What happened?

I've been experiencing a periodic crash during training SDXL finetune. My training settings have been identical for the past 2 months and this bug is recent as of a week ago. I have tried updating, redownloading the requirements, and a fresh install in a separate directory. Each 'fix' has worked for 1 training session but whenever I start a new project it seems to run into the same error. I had previously opened a bug report for this but closed it because I thought the update had fixed it. I will reopen that as well. Again this is a new error and the only things I ever changed were the dataset for the new training, and tweaks to filenames. Furthermore the dataset I'm using has been adjusted 3 times since the first error, and if it was the problem, why would the training work on a fresh install/update?

What did you expect would happen?

The training should have completed without error.

Relevant log output

activating venv D:\one_ai\OneTrainer\venv
Using Python "D:\one_ai\OneTrainer\venv\Scripts\python.exe"
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
D:\one_ai\OneTrainer\venv\src\diffusers\src\diffusers\loaders\single_file.py:340: FutureWarning: `original_config_file` is deprecated and will be removed in version 1.0.0. `original_config_file` argument is deprecated and will be removed in future versions.please use the `original_config` argument instead.
  deprecate("original_config_file", "1.0.0", deprecation_message)
TensorFlow installation not found - running with reduced feature set.
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████| 17/17 [00:00<?, ?it/s]
Loading pipeline components...:  14%|███████▍                                            | 1/7 [00:00<00:01,  4.90it/s]Some weights of the model checkpoint were not used when initializing CLIPTextModel:
 ['text_model.embeddings.position_ids']
Loading pipeline components...:  71%|█████████████████████████████████████▏              | 5/7 [00:01<00:00,  3.80it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.50it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.27it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.27it/s]D:\one_ai\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1406: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 3632/3632 [09:33<00:00,  6.33it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 3632/3632 [01:35<00:00, 37.88it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.50it/s]
D:\one_ai\OneTrainer\venv\lib\site-packages\torch\autograd\graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ..\aten\src\ATen\native\cudnn\Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
step: 100%|██████████████████████████████████████| 3632/3632 [1:53:33<00:00,  1.88s/it, loss=0.0667, smooth loss=0.131]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.49it/s]
Saving workspace/run\save\LiveLeak_1e-05_Refined_RealVis2024-06-25_16-12-26-save-3632-1-0     | 0/3632 [00:00<?, ?it/s]
step:   0%|                                                                                   | 0/3632 [00:24<?, ?it/s]
epoch:  25%|█████████████████▊                                                     | 1/4 [2:05:10<6:15:32, 7510.67s/it]
Traceback (most recent call last):
  File "D:\one_ai\OneTrainer\modules\ui\TrainUI.py", line 538, in __training_thread_function
    trainer.train()
  File "D:\one_ai\OneTrainer\modules\trainer\GenericTrainer.py", line 572, in train
    model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
  File "D:\one_ai\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 276, in predict
    text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
  File "D:\one_ai\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 242, in __encode_text
    text_encoder_1_output = model.text_encoder_1(
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 807, in forward
    return self.text_model(
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 699, in forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 219, in forward
    inputs_embeds = self.token_embedding(input_ids)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\one_ai\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 41, in forward
    return F.embedding(
  File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Saving models/LiveLeak_1e-05_Refined_RealVis

Output of pip freeze

No response

@velourlawsuits velourlawsuits added the bug Something isn't working label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant