Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the relationship between Whisper vs pretrained UNet SDv1.4 #159

Open
huyduong7101 opened this issue Aug 7, 2024 · 3 comments
Open

Comments

@huyduong7101
Copy link

In this work, the author adopted Whisper-tiny (d_model=384) to extract audio feature, while training UNet from scratch. I guess the reason behind training from scratch instead of loading pretrained SDv1.4 because pretrained model has cross_attention_dim=768 and feature dim of Whisper-tiny is 384. Hence, I wonder why don't use Whisper-small (d_model=768) which has the same dimension as pretrained SDv1.4, then we can utilize the strong pretrained model from SDv1.4

@czk32611
Copy link
Contributor

czk32611 commented Aug 8, 2024

  1. The reason why we used whisper-tiny is to have a smaller time delay during real-time inference.
  2. We did not use pretrained SDv1.4 because SDv1.4 is an image-to-noise model, not an image-to-image model. However, someone had tried to use pretrained SDv1.4 as initialization and actually it converged faster.
  3. The dimention of audio feature is not important, as one can always use projection networks to have a different shapes.

Hope the above informaiton helps.

@huyduong7101
Copy link
Author

Thank you for your quick response. It is very helpful.
Can I ask you one more question relating to another issue #158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?

@czk32611
Copy link
Contributor

Thank you for your quick response. It is very helpful. Can I ask you one more question relating to another issue #158. How did you crop face and feed into model, like using only face detection or using "bbox shift"?

Currently we only use a face detector and did not perform bbox shift during training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants