About the relationship between Whisper vs pretrained UNet SDv1.4 #159

huyduong7101 · 2024-08-07T07:23:13Z

In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4 because pretrained model has cross_attention_dim=768 and feature dim of Whisper-tiny is 384. Hence, I wonder why don't use Whisper-small (d_model=768) which has the same dimension as pretrained SDv1.4, then we can utilize the strong pretrained model from SDv1.4

czk32611 · 2024-08-08T06:36:46Z

The reason why we used whisper-tiny is to have a smaller time delay during real-time inference.
We did not use pretrained SDv1.4 because SDv1.4 is an image-to-noise model, not an image-to-image model. However, someone had tried to use pretrained SDv1.4 as initialization and actually it converged faster.
The dimention of audio feature is not important, as one can always use projection networks to have a different shapes.

Hope the above informaiton helps.

huyduong7101 · 2024-08-09T04:11:54Z

Thank you for your quick response. It is very helpful.
Can I ask you one more question relating to another issue #158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?

czk32611 · 2024-08-29T06:24:57Z

Thank you for your quick response. It is very helpful. Can I ask you one more question relating to another issue #158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?

Currently we only use a face detector and did not perform bbox shift during training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the relationship between Whisper vs pretrained UNet SDv1.4 #159

About the relationship between Whisper vs pretrained UNet SDv1.4 #159

huyduong7101 commented Aug 7, 2024

czk32611 commented Aug 8, 2024

huyduong7101 commented Aug 9, 2024

czk32611 commented Aug 29, 2024

About the relationship between Whisper vs pretrained UNet SDv1.4 #159

About the relationship between Whisper vs pretrained UNet SDv1.4 #159

Comments

huyduong7101 commented Aug 7, 2024

czk32611 commented Aug 8, 2024

huyduong7101 commented Aug 9, 2024

czk32611 commented Aug 29, 2024