Runtime error in graph search policy network during training #1

nitishajain · 2021-05-26T20:51:04Z

Hello, I am trying to replicate the steps to train and test the model. After performing the data processing and pretraining of embeddings, I keep encountering the following runtime error when training the model for any dataset -

Epoch 0
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 822, in <module>
    run_experiment(args)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 803, in run_experiment
    train(lf)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 267, in train
    lf.run_train(train_data, dev_data)
  File "/home/user/DacKGR/DacKGR-master/src/learn_framework.py", line 96, in run_train
    loss = self.loss(mini_batch)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 115, in loss
    output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps, kg_pred=kg_pred)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 282, in rollout
    e, obs, kg, kg_pred=kg_pred, fn_kg=self.fn_kg, use_action_space_bucketing=self.use_action_space_bucketing, use_kg_pred=self.use_state_prediction)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 138, in transit
    db_action_spaces, db_references = self.get_action_space_in_buckets(e, obs, kg, relation_att=relation_att, inference=inference)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 289, in get_action_space_in_buckets
    e_space_b, r_space_b, action_mask_b = self.get_dynamic_action_space(e_space_b, r_space_b, action_mask_b, e_b, relation_att[l_batch_refs])
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 208, in get_dynamic_action_space
    relation_idx = torch.multinomial(relation_att, additional_relation_size)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
free(): invalid pointer
./experiment-rs.sh: line 87: 560302 Aborted

Any pointers to solve this issue would be most helpful..

The text was updated successfully, but these errors were encountered:

davidlvxin · 2021-05-28T06:35:08Z

To better identify the problem, could you tell me what dataset you are running on?

nitishajain · 2021-05-28T12:42:59Z

I have encountered the exact same issue both with WD-singer as well as FB-15k-237 subsets, makes me think its not quite a dataset specific issue..

davidlvxin · 2021-05-31T04:51:56Z

Could you give your PyTorch version? I redownload and run the code without encountering any errors. Using FB15K-237-20% as an example, make sure you run the following commands in order:

unzip data.zip
./experiment.sh configs/fb15k-237-20.sh --process_data <gpu-id>
./experiment-emb.sh configs/fb15k-237-20-conve.sh --train <gpu-id>
./experiment-rs.sh configs/fb15k-237-20-rs.sh --train <gpu-id>

nitishajain · 2021-05-31T21:43:54Z

The Pytorch version is 1.7.0
I have tried creating a new environment and running the commands again in the correct order, but I am still getting the same error after training for 3 epochs.

davidlvxin · 2021-06-02T05:29:04Z

I am sorry that I have run this code many times, but this error cannot be reproduced. What is your CUDA version?

nitishajain · 2021-06-02T17:28:40Z

The CUDA version is 11.0
thank you for your efforts, could you inform your version as well? I can try to reproduce in same environment.

davidlvxin · 2021-06-04T00:58:28Z

Pytorch: 1.8.1
CUDA: 11.1

It seems that our environments are very similar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime error in graph search policy network during training #1

Runtime error in graph search policy network during training #1

nitishajain commented May 26, 2021

davidlvxin commented May 28, 2021

nitishajain commented May 28, 2021

davidlvxin commented May 31, 2021

nitishajain commented May 31, 2021

davidlvxin commented Jun 2, 2021

nitishajain commented Jun 2, 2021

davidlvxin commented Jun 4, 2021

Runtime error in graph search policy network during training #1

Runtime error in graph search policy network during training #1

Comments

nitishajain commented May 26, 2021

davidlvxin commented May 28, 2021

nitishajain commented May 28, 2021

davidlvxin commented May 31, 2021

nitishajain commented May 31, 2021

davidlvxin commented Jun 2, 2021

nitishajain commented Jun 2, 2021

davidlvxin commented Jun 4, 2021