OpenViDial Datasets

This README contains the detailed and downloading instructions for the OpenViDial datasets in:

OpenViDial 1.0: OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
OpenViDial 2.0: OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

OpenViDial 1.0

***** New March 12th, 2021: New cnn/rcnn feature on test/valid dataset *****

We fixed the bug of cnn/rcnn features on valid/test dataset and re-run the experiments on the new data. Evaluation metrics are also updated.

Detailed statistics for OpenViDial

Attribute	value
Number of turns	1.1M
Number of images	1.1M
Vocab size before BPE	70K
Vocab size after BPE	30K
Average length of each episode	14
Average length of each turn	7.6

Download the Original Dataset

The main folder origin_dir contains training/valid/test sets, each of which is made up by the following files:

├──origin_dir
      └── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.    
      └── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
      └── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
            └── 0.jpg
            └── 1.jpg
            └── ...
      └── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
      └── test.*  (i.e., test.dialogue.jsonl, test.origin.txt, test_images)

Data download:

Download [train|valid|test].origin.txt and [train|valid|test].dialogue.jsonl here
Download test_images (~ 20G) here
Download valid_images (~ 20G) here
Download train_images: Since the size of train_images is too large (~ 170G), we split it to 12 zip files. Download seperate files zip_train here. Then download and run cat.sh here to include all files in the same directory.
Move all files to origin_dir.

Download the pre-computed CNN features and Faster-RCNN features

To save the time of extracting features for CNN and Faster-RCNN, we provide the pre-computed CNN features and Faster-RCNN features. You just need to download them following the steps and re-construct the directory as here.

Download CNN-pooling features

The compression file of preprocessed ResNet50 features (feature_files.tar.gz) (~3.7G) can be downloaded from here. You can get preprocessed ResNet50 features (*.features.mmap) by command tar zxvf feature_files.tar.gz.

Download Faster R-CNN features

The compression file of preprocessed Faster R-CNN objects features (object_files.tar.gz) (~50G) can be downloaded from here. You can get preprocessed Faster R-CNN objects features (*objects.mmap, *objects_mask.mmap) by command tar zxvf object_files.tar.gz.

Checkout

Each of files has a hash value by command md5sum fileName. You can get it from here and we suggest you check each file's hash value before any operations.

OpenViDial 2.0

Detailed statistics for OpenViDial

Attribute	value
Number of turns	5.6M
Number of images	5.6M
Vocab size before BPE	278K
Vocab size after BPE	30K
Average length of each episode	48
Average length of each turn	8.3

Download the Original Dataset

The main folder origin_dir contains training/valid/test sets, each of which is made up by the following files:

├──origin_dir
      └── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.    
      └── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
      └── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
            └── 0.jpg
            └── 1.jpg
            └── ...
      └── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
      └── test.*  (i.e., test.dialogue.jsonl, test.origin.txt, test_images)

Data download:

Download [train|valid|test].origin.txt and [train|valid|test].dialogue.jsonl here
Download test_images (~ 123G) here
Download valid_images (~ 123G) here
Download train_images: Since the size of train_images is too large (~ 1.2T), we split it to 7 zip files. Download seperate dirctory train here. Then run the command cat * > train_images.zip && unzip -d ./train_images train_images.zip to generate the all images for training set.
Move all files to origin_dir.

Download the pre-computed CNN features and Faster-RCNN features

To save the time of extracting features for CNN and Faster-RCNN, we provide the pre-computed CNN features and Faster-RCNN features. You just need to download them following the steps and re-construct the directory as here.

Download CNN-pooling features

The mmap files of preprocessed ResNet50 features for train/valid/test set (*.features.mmap) (~17G, ~2G, ~2G) can be downloaded from here.

Download Faster R-CNN features

The compression file of preprocessed Faster R-CNN objects features (object_files.tar.gz) (~49G) can be downloaded from here. You can get preprocessed Faster R-CNN objects features (*objects.mmap, *objects_mask.mmap) by command tar zxvf object_files.tar.gz.

Checkout

Each of files has a hash value by command md5sum fileName. You can get it from here and we suggest you check each file's hash value before any operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OpenViDial Datasets

OpenViDial 1.0

Detailed statistics for OpenViDial

Download the Original Dataset

Download the pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features

Download Faster R-CNN features

Checkout

OpenViDial 2.0

Detailed statistics for OpenViDial

Download the Original Dataset

Download the pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features

Download Faster R-CNN features

Checkout

Files

README.md

Latest commit

History

README.md

File metadata and controls

OpenViDial Datasets

OpenViDial 1.0

Detailed statistics for OpenViDial

Download the Original Dataset

Download the pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features

Download Faster R-CNN features

Checkout

OpenViDial 2.0

Detailed statistics for OpenViDial

Download the Original Dataset

Download the pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features

Download Faster R-CNN features

Checkout