Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train #5

Open
pokameng opened this issue Mar 15, 2023 · 3 comments
Open

Unable to train #5

pokameng opened this issue Mar 15, 2023 · 3 comments

Comments

@pokameng
Copy link

``@baofff
你好,我按照readme的说明训练,但是发生错误:
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '2', 'train.py', '--config=configs/celeba64_uvit_small.py']' returned non-zero exit status 1.

我的配置文件:
`import ml_collections

def d(**kwargs):
"""Helper of creating a config dict."""
return ml_collections.ConfigDict(initial_dictionary=kwargs)

def get_config():
config = ml_collections.ConfigDict()

config.seed = 1234
config.pred = 'noise_pred'

config.train = d(
    n_steps=500000,
    batch_size=128,
    mode='uncond',
    log_interval=10,
    eval_interval=5000,
    save_interval=50000,
)

config.optimizer = d(
    name='adamw',
    lr=0.0002,
    weight_decay=0.03,
    betas=(0.99, 0.999),
)

config.lr_scheduler = d(
    name='customized',
    warmup_steps=2500
)

config.nnet = d(
    name='uvit',
    img_size=32,
    patch_size=2,
    embed_dim=512,
    depth=12,
    num_heads=8,
    mlp_ratio=4,
    qkv_bias=False,
    mlp_time_embed=False,
    num_classes=-1,
)

config.dataset = d(
    name='cifar10',
    path='/home/dailongquan/110.014/image_condition_diffusion/UViT/assets/datasets/cifar10',
    random_flip=True,
)

config.sample = d(
    sample_steps=1000,
    n_samples=50000,
    mini_batch_size=500,
    algorithm='euler_maruyama_sde',
    path=''
)

return config

`

@baofff
Copy link
Owner

baofff commented Mar 22, 2023

We update the Dependency. Perhaps you can reinstall your environment and try again.

@F9393
Copy link

F9393 commented Nov 27, 2023

I face the same issue. I have same dependencies as described in the repo and decreased the batch size to 8 but still fails. Does the code actually fit on 2 A 100 GPUs?

@yangluojie
Copy link

I face the same issue and I have same dependencies as described in the repo. May I ask how to solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants