You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been experiencing a periodic crash during training SDXL finetune. My training settings have been identical for the past 2 months and this bug is recent as of a week ago. I have tried updating, redownloading the requirements, and a fresh install in a separate directory. Each 'fix' has worked for 1 training session but whenever I start a new project it seems to run into the same error. I had previously opened a bug report for this but closed it because I thought the update had fixed it. I will reopen that as well. Again this is a new error and the only things I ever changed were the dataset for the new training, and tweaks to filenames. Furthermore the dataset I'm using has been adjusted 3 times since the first error, and if it was the problem, why would the training work on a fresh install/update?
What did you expect would happen?
The training should have completed without error.
Relevant log output
activating venv D:\one_ai\OneTrainer\venv
Using Python "D:\one_ai\OneTrainer\venv\Scripts\python.exe"
Clearing cache directory workspace-cache/cache_1! You can disable this if you want to continue using the same cache.
D:\one_ai\OneTrainer\venv\src\diffusers\src\diffusers\loaders\single_file.py:340: FutureWarning: `original_config_file` is deprecated and will be removed in version 1.0.0. `original_config_file` argument is deprecated and will be removed in future versions.please use the `original_config` argument instead.
deprecate("original_config_file", "1.0.0", deprecation_message)
TensorFlow installation not found - running with reduced feature set.
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████| 17/17 [00:00<?, ?it/s]
Loading pipeline components...: 14%|███████▍ | 1/7 [00:00<00:01, 4.90it/s]Some weights of the model checkpoint were not used when initializing CLIPTextModel:
['text_model.embeddings.position_ids']
Loading pipeline components...: 71%|█████████████████████████████████████▏ | 5/7 [00:01<00:00, 3.80it/s]Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)
Loading pipeline components...: 100%|████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.50it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.27it/s]
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.27it/s]D:\one_ai\OneTrainer\venv\src\diffusers\src\diffusers\models\attention_processor.py:1406: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
hidden_states = F.scaled_dot_product_attention(
caching: 100%|█████████████████████████████████████████████████████████████████████| 3632/3632 [09:33<00:00, 6.33it/s]
caching: 100%|█████████████████████████████████████████████████████████████████████| 3632/3632 [01:35<00:00, 37.88it/s]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00, 1.50it/s]
D:\one_ai\OneTrainer\venv\lib\site-packages\torch\autograd\graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ..\aten\src\ATen\native\cudnn\Conv_v8.cpp:919.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
step: 100%|██████████████████████████████████████| 3632/3632 [1:53:33<00:00, 1.88s/it, loss=0.0667, smooth loss=0.131]
sampling: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00, 1.49it/s]
Saving workspace/run\save\LiveLeak_1e-05_Refined_RealVis2024-06-25_16-12-26-save-3632-1-0 | 0/3632 [00:00<?, ?it/s]
step: 0%|| 0/3632 [00:24<?, ?it/s]
epoch: 25%|█████████████████▊ | 1/4 [2:05:10<6:15:32, 7510.67s/it]
Traceback (most recent call last):
File "D:\one_ai\OneTrainer\modules\ui\TrainUI.py", line 538, in __training_thread_function
trainer.train()
File "D:\one_ai\OneTrainer\modules\trainer\GenericTrainer.py", line 572, in train
model_output_data = self.model_setup.predict(self.model, batch, self.config, train_progress)
File "D:\one_ai\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 276, in predict
text_encoder_output, pooled_text_encoder_2_output = self.__encode_text(
File "D:\one_ai\OneTrainer\modules\modelSetup\BaseStableDiffusionXLSetup.py", line 242, in __encode_text
text_encoder_1_output = model.text_encoder_1(
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 807, in forward
return self.text_model(
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 699, in forward
hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 219, in forward
inputs_embeds = self.token_embedding(input_ids)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "D:\one_ai\OneTrainer\modules\module\AdditionalEmbeddingWrapper.py", line 41, in forward
return F.embedding(
File "D:\one_ai\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 2264, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument forargument indexin method wrapper_CUDA__index_select)
Saving models/LiveLeak_1e-05_Refined_RealVis
Output of pip freeze
No response
The text was updated successfully, but these errors were encountered:
What happened?
I've been experiencing a periodic crash during training SDXL finetune. My training settings have been identical for the past 2 months and this bug is recent as of a week ago. I have tried updating, redownloading the requirements, and a fresh install in a separate directory. Each 'fix' has worked for 1 training session but whenever I start a new project it seems to run into the same error. I had previously opened a bug report for this but closed it because I thought the update had fixed it. I will reopen that as well. Again this is a new error and the only things I ever changed were the dataset for the new training, and tweaks to filenames. Furthermore the dataset I'm using has been adjusted 3 times since the first error, and if it was the problem, why would the training work on a fresh install/update?
What did you expect would happen?
The training should have completed without error.
Relevant log output
Output of
pip freeze
No response
The text was updated successfully, but these errors were encountered: