Training

Jump to bottom

Lando Löper edited this page Aug 14, 2020 · 4 revisions

To train a new model from scratch you first have to download the training data and finally run the training script.

Dataset

Please follow these steps to download and preprocess the py150 dataset.

Dowload and unarchive the parsed AST paths

wget http://files.srl.inf.ethz.ch/data/py150.tar.gz
tar -xzvf py150.tar.gz

Clone the code2seq repository

git clone https://github.com/Kolkir/code2seq.git
cd code2seq/Python150kExtractor

Extract the data

python extract.py --data_dir=<PATH_TO_PY150_FOLDER> --output_dir=<PATH_TO_EXTRACTED_FOLDER> --seed=239

Preprocess the data for training

sh preprocess.sh <PATH_TO_EXTRACTED_FOLDER>

Model

Once you have downloaded and preprocessed the dataset go back this repository.

Build and run the docker image in a container

docker build -t code-embeddings .
docker run --gpus all --rm -it -v <PATH_TO_EXTRACTED_FOLDER>:/tmp/py150 -p 6006:6006 code-embeddings /bin/bash

Run the training script

python ./src/train.py \
--dict <PATH_TO_EXTRACTED_DICT> \
--train <PATH_TO_EXTRACTED_TRAIN> \
--test <PATH_TO_EXTRACTED_TEST>

(Optional) Run tensorboard for better analysis of the training run

tensorboard --logs ./logs

Next Evaluation

Toggle table of contents Pages 4

Clone this wiki locally