Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime error in graph search policy network during training #1

Open
nitishajain opened this issue May 26, 2021 · 7 comments
Open

Runtime error in graph search policy network during training #1

nitishajain opened this issue May 26, 2021 · 7 comments

Comments

@nitishajain
Copy link

Hello, I am trying to replicate the steps to train and test the model. After performing the data processing and pretraining of embeddings, I keep encountering the following runtime error when training the model for any dataset -

Epoch 0
Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 822, in <module>
    run_experiment(args)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 803, in run_experiment
    train(lf)
  File "/home/user/DacKGR/DacKGR-master/src/experiments.py", line 267, in train
    lf.run_train(train_data, dev_data)
  File "/home/user/DacKGR/DacKGR-master/src/learn_framework.py", line 96, in run_train
    loss = self.loss(mini_batch)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 115, in loss
    output = self.rollout(e1, r, e2, num_steps=self.num_rollout_steps, kg_pred=kg_pred)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/rs_pg.py", line 282, in rollout
    e, obs, kg, kg_pred=kg_pred, fn_kg=self.fn_kg, use_action_space_bucketing=self.use_action_space_bucketing, use_kg_pred=self.use_state_prediction)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 138, in transit
    db_action_spaces, db_references = self.get_action_space_in_buckets(e, obs, kg, relation_att=relation_att, inference=inference)
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 289, in get_action_space_in_buckets
    e_space_b, r_space_b, action_mask_b = self.get_dynamic_action_space(e_space_b, r_space_b, action_mask_b, e_b, relation_att[l_batch_refs])
  File "/home/user/DacKGR/DacKGR-master/src/rl/graph_search/pn.py", line 208, in get_dynamic_action_space
    relation_idx = torch.multinomial(relation_att, additional_relation_size)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
free(): invalid pointer
./experiment-rs.sh: line 87: 560302 Aborted    

Any pointers to solve this issue would be most helpful..

@davidlvxin
Copy link
Member

To better identify the problem, could you tell me what dataset you are running on?

@nitishajain
Copy link
Author

I have encountered the exact same issue both with WD-singer as well as FB-15k-237 subsets, makes me think its not quite a dataset specific issue..

@davidlvxin
Copy link
Member

Could you give your PyTorch version? I redownload and run the code without encountering any errors. Using FB15K-237-20% as an example, make sure you run the following commands in order:

unzip data.zip
./experiment.sh configs/fb15k-237-20.sh --process_data <gpu-id>
./experiment-emb.sh configs/fb15k-237-20-conve.sh --train <gpu-id>
./experiment-rs.sh configs/fb15k-237-20-rs.sh --train <gpu-id>

@nitishajain
Copy link
Author

The Pytorch version is 1.7.0
I have tried creating a new environment and running the commands again in the correct order, but I am still getting the same error after training for 3 epochs.

@davidlvxin
Copy link
Member

I am sorry that I have run this code many times, but this error cannot be reproduced. What is your CUDA version?

@nitishajain
Copy link
Author

The CUDA version is 11.0
thank you for your efforts, could you inform your version as well? I can try to reproduce in same environment.

@davidlvxin
Copy link
Member

Pytorch: 1.8.1
CUDA: 11.1

It seems that our environments are very similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants