This README contains the detailed and downloading instructions for the OpenViDial datasets in:
- OpenViDial 1.0: OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
- OpenViDial 2.0: OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts
***** New March 12th, 2021: New cnn/rcnn feature on test/valid dataset *****
We fixed the bug of cnn/rcnn features on valid/test dataset and re-run the experiments on the new data. Evaluation metrics are also updated.
Attribute | value |
---|---|
Number of turns | 1.1M |
Number of images | 1.1M |
Vocab size before BPE | 70K |
Vocab size after BPE | 30K |
Average length of each episode | 14 |
Average length of each turn | 7.6 |
The main folder origin_dir
contains training/valid/test sets, each of which is made up by the following files:
├──origin_dir
└── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.
└── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
└── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
└── 0.jpg
└── 1.jpg
└── ...
└── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
└── test.* (i.e., test.dialogue.jsonl, test.origin.txt, test_images)
Data download:
- Download
[train|valid|test].origin.txt
and[train|valid|test].dialogue.jsonl
here - Download
test_images
(~ 20G) here - Download
valid_images
(~ 20G) here - Download train_images: Since the size of train_images is too large (~ 170G), we split it to 12 zip files. Download seperate files
zip_train
here. Then download and runcat.sh
here to include all files in the same directory. - Move all files to
origin_dir
.
To save the time of extracting features for CNN and Faster-RCNN, we provide the pre-computed CNN features and Faster-RCNN features. You just need to download them following the steps and re-construct the directory as here.
The compression file of preprocessed ResNet50 features (feature_files.tar.gz
) (~3.7G) can be downloaded from here. You can get preprocessed ResNet50 features (*.features.mmap
) by command tar zxvf feature_files.tar.gz
.
The compression file of preprocessed Faster R-CNN objects features (object_files.tar.gz
) (~50G) can be downloaded from here. You can get preprocessed Faster R-CNN objects features (*objects.mmap
, *objects_mask.mmap
) by command tar zxvf object_files.tar.gz
.
Each of files has a hash value by command md5sum fileName
. You can get it from here and we suggest you check each file's hash value before any operations.
Attribute | value |
---|---|
Number of turns | 5.6M |
Number of images | 5.6M |
Vocab size before BPE | 278K |
Vocab size after BPE | 30K |
Average length of each episode | 48 |
Average length of each turn | 8.3 |
The main folder origin_dir
contains training/valid/test sets, each of which is made up by the following files:
├──origin_dir
└── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.
└── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
└── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
└── 0.jpg
└── 1.jpg
└── ...
└── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
└── test.* (i.e., test.dialogue.jsonl, test.origin.txt, test_images)
Data download:
- Download
[train|valid|test].origin.txt
and[train|valid|test].dialogue.jsonl
here - Download
test_images
(~ 123G) here - Download
valid_images
(~ 123G) here - Download
train_images
: Since the size oftrain_images
is too large (~ 1.2T), we split it to 7 zip files. Download seperate dirctorytrain
here. Then run the commandcat * > train_images.zip && unzip -d ./train_images train_images.zip
to generate the all images for training set. - Move all files to
origin_dir
.
To save the time of extracting features for CNN and Faster-RCNN, we provide the pre-computed CNN features and Faster-RCNN features. You just need to download them following the steps and re-construct the directory as here.
The mmap files of preprocessed ResNet50 features for train/valid/test set (*.features.mmap
) (~17G, ~2G, ~2G) can be downloaded from here.
The compression file of preprocessed Faster R-CNN objects features (object_files.tar.gz
) (~49G) can be downloaded from here. You can get preprocessed Faster R-CNN objects features (*objects.mmap
, *objects_mask.mmap
) by command tar zxvf object_files.tar.gz
.
Each of files has a hash value by command md5sum fileName
. You can get it from here and we suggest you check each file's hash value before any operations.