Skip to content

Training

Lando Löper edited this page Aug 14, 2020 · 4 revisions

To train a new model from scratch you first have to download the training data and finally run the training script.

Dataset

Please follow these steps to download and preprocess the py150 dataset.

  1. Dowload and unarchive the parsed AST paths
wget http://files.srl.inf.ethz.ch/data/py150.tar.gz
tar -xzvf py150.tar.gz
  1. Clone the code2seq repository
git clone https://github.com/Kolkir/code2seq.git
cd code2seq/Python150kExtractor
  1. Extract the data
python extract.py --data_dir=<PATH_TO_PY150_FOLDER> --output_dir=<PATH_TO_EXTRACTED_FOLDER> --seed=239
  1. Preprocess the data for training
sh preprocess.sh <PATH_TO_EXTRACTED_FOLDER>

Model

Once you have downloaded and preprocessed the dataset go back this repository.

  1. Build and run the docker image in a container
docker build -t code-embeddings .
docker run --gpus all --rm -it -v <PATH_TO_EXTRACTED_FOLDER>:/tmp/py150 -p 6006:6006 code-embeddings /bin/bash
  1. Run the training script
python ./src/train.py \
--dict <PATH_TO_EXTRACTED_DICT> \
--train <PATH_TO_EXTRACTED_TRAIN> \
--test <PATH_TO_EXTRACTED_TEST>
  1. (Optional) Run tensorboard for better analysis of the training run
tensorboard --logs ./logs

Next Evaluation

Clone this wiki locally