Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in training the tutorial NuQE yaml config when enable GPU #5

Open
timxzz opened this issue May 24, 2020 · 1 comment
Open

Error in training the tutorial NuQE yaml config when enable GPU #5

timxzz opened this issue May 24, 2020 · 1 comment

Comments

@timxzz
Copy link

timxzz commented May 24, 2020

Hi,

I was trying to follow the tutorial in the notebook. When I change the yaml config gpu-id: -1 to gpu-id: 0 which should enable GPU training, an error occured. Following are the log output and the error info:

2020-05-24 13:00:57.181 [root setup:380] This is run ID: c5854f3d72844dd8b842c49c4a29f9fc
2020-05-24 13:00:57.181 [root setup:383] Inside experiment ID: 0 (None)
2020-05-24 13:00:57.182 [root setup:386] Local output directory is: runs/nuqe
2020-05-24 13:00:57.182 [root setup:389] Logging execution to MLflow at: None
2020-05-24 13:00:57.186 [root setup:395] Using GPU: 0
2020-05-24 13:00:57.186 [root setup:400] Artifacts location: None
2020-05-24 13:00:57.193 [kiwi.lib.train run:154] Training the NuQE model
2020-05-24 13:00:59.819 [kiwi.lib.train run:187] NuQE(
  (_loss): CrossEntropyLoss()
  (source_emb): Embedding(6437, 50, padding_idx=1)
  (target_emb): Embedding(7493, 50, padding_idx=1)
  (embeddings_dropout): Dropout(p=0.5, inplace=False)
  (linear_1): Linear(in_features=300, out_features=400, bias=True)
  (linear_2): Linear(in_features=400, out_features=400, bias=True)
  (linear_3): Linear(in_features=400, out_features=200, bias=True)
  (linear_4): Linear(in_features=200, out_features=200, bias=True)
  (linear_5): Linear(in_features=400, out_features=100, bias=True)
  (linear_6): Linear(in_features=100, out_features=50, bias=True)
  (linear_out): Linear(in_features=50, out_features=2, bias=True)
  (gru_1): GRU(400, 200, batch_first=True, bidirectional=True)
  (gru_2): GRU(200, 200, batch_first=True, bidirectional=True)
  (dropout_in): Dropout(p=0.0, inplace=False)
  (dropout_out): Dropout(p=0.0, inplace=False)
)
2020-05-24 13:00:59.819 [kiwi.lib.train run:188] 2347752 parameters
2020-05-24 13:00:59.819 [kiwi.trainers.trainer run:75] Epoch 1 of 3
2020-05-24 13:01:13.122 [kiwi.metrics.stats log:60] tags_F1_MULT: 0.0275, tags_F1_OK: 0.9294, tags_F1_BAD: 0.0296, tags_CORRECT: 0.8683, loss_loss: 892.0779
2020-05-24 13:01:26.385 [kiwi.metrics.stats log:60] tags_F1_MULT: 0.1496, tags_F1_OK: 0.9225, tags_F1_BAD: 0.1622, tags_CORRECT: 0.8582, loss_loss: 835.9351
Batches: 100%|██████████████████████████| 211/211 [00:27<00:00,  7.58 batches/s]
2020-05-24 13:01:27.717 [kiwi.metrics.stats log:60] tags_F1_MULT: 0.2363, tags_F1_OK: 0.8934, tags_F1_BAD: 0.2645, tags_CORRECT: 0.8139, loss_loss: 786.3296
2020-05-24 13:01:29.716 [kiwi.metrics.stats log:60] EVAL_tags_F1_MULT: 0.2828, EVAL_tags_F1_OK: 0.9003, EVAL_tags_F1_BAD: 0.3141, EVAL_tags_CORRECT: 0.8259, EVAL_loss_loss: 789.3109
2020-05-24 13:01:29.717 [root save:183] Saving training state to runs/nuqe/epoch_1
2020-05-24 13:01:29.829 [root save_latest:241] Saving training state to runs/nuqe/temp_latest_epoch
2020-05-24 13:01:29.830 [kiwi.trainers.callbacks _remove_snapshot:178] Removing previous snapshot: runs/nuqe/latest_epoch
2020-05-24 13:01:29.830 [kiwi.trainers.callbacks save_latest:252] Moving runs/nuqe/temp_latest_epoch to runs/nuqe/latest_epoch
Traceback (most recent call last):
  File "/opt/conda/bin/kiwi", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/opt/conda/lib/python3.7/site-packages/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/cli/pipelines/train.py", line 142, in main
    train.train_from_options(options)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/trainers/trainer.py", line 79, in run
    self.checkpointer(self, valid_iterator, epoch=epoch)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/trainers/callbacks.py", line 115, in __call__
    predictions = trainer.predict(valid_iterator)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/trainers/trainer.py", line 167, in predict
    model_pred = self.model.predict(batch)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/models/model.py", line 137, in predict
    mask = self.get_mask(batch, input_key)
  File "/opt/conda/lib/python3.7/site-packages/kiwi/models/model.py", line 205, in get_mask
    input_tensor != pad_id, dtype=torch.uint8
RuntimeError: expected device cuda:0 but got device cpu

Thanks!
Tim

@timxzz
Copy link
Author

timxzz commented May 24, 2020

I had a look, and found out that the problem exists in openkiwi 0.1.2. It has been fixed in the latest openkiwi release 0.1.3. The simple fix for this tutorial is to change the openkiwi version in requirements.txt file from 0.1.2 to 0.1.3, which has been done in the pull request #6 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant