The released GIF Reply dataset gif-reply-dataset.csv
(download here) that contains 1,562,701 text-gif conversation turns.
This dataset has the following fields:
parent_id
: parent tweet IDchild_id
: child tweet IDchild_gif_id
: hash ID of the replied GIF (hash for a GIF can be calculated usingpython3 data/hash_gif.py
)set
: whether this reply is atrain
ortest
sample
NOTE: This downloaded dataset lacks serveral data fields for model training, following the instructions below to prepare the complete version of GIF reply dataset.
First, Get access to the Twitter Official API.
After obtained access to Twitter API, create twitter_credential.json
and filling your credentials in that file:
{
"consumer_key": "PUT YOUR KEY HERE",
"consumer_secret": "PUT YOUR SECRET HERE",
"access_token": "PUT YOUR KEY HERE",
"access_secret": "PUT YOUR SECRET HERE"
}
And then install required packages:
pip install pandas twitter tqdm ujson
Finally, run the preparation script to prepare the dataset:
python3 prepare_gif_reply_dataset.py \
/path/to/downloaded/gif-reply-dataset.csv \
/path/to/processed/gif-reply-dataset.csv \
/path/to/downloaded/gif-metadata.csv \
/path/to/twitter_credential.json \
/tmp/tweets.json
All GIFs need to be downloaded for the purpose of model training. Note that some of the GIFs hosted on Twitter might not be available at download time, as they can be removed by the original poster.
GIFs can be downloaded by using this script:
python3 download_twitter_gif.py \
/path/to/downloaded/gif-reply-dataset.csv \
/path/to/twitter_credential.json \
/path/to/cache/child-tweets.json \
/path/to/store/downloaded/gifs/ # this is the $GIF_PATH of your choice
This script will download all replied GIF files in .mp4
format from Twitter, and then store them in a 3-level directory structure under $GIF_PATH
.
For example, the GIF with hash ID 68e460404503373feee6f1c686007078dec7c0c602026667
will be saved to $GIF_PATH/6/8/e/460404503373feee6f1c686007078dec7c0c602026667.mp4
.
This dataset gif-metadata.csv
(download here) contains metadata for GIFs.
It has the following fields:
gif_id
: hash ID of the replied GIFocr_text
: captions extracted using paddleOCR on four frames sampled from each quartile of the gif’s length, seperated by"[INTER_FRAME_SEP]"
.tags
: annotated tags for GIF. This is NOT needed to reproduce thePEPE
model, and is only provided for replicability of the tag-based model.
ROI metadata is only required to train the PEPE
model.
Additional ROI metadata can be extracted using bottom-up-attention.
Preparation script to extract ROIs can be found here (forthcomming), and it will add the following two fields of data into a updated pickle file gif-metadata-with-roi.pkl
.
roi_feature
: extracted ROI features on four frames sampled from each quartile of the gif’s length.roi_labels
: extracted ROI labels on four frames sampled from each quartile of the gif’s length.
The file gif-id-to-giphy-id-mapping.csv
(download here) contains a mapping from the GIF ID (hash ID) to GIPHY ID.
giphy-id-to-gif-id-mapping.csv
(download here) contains a mapping fron the GIPHY ID to the GIF ID (hash ID).
Both files have the following fields:
gif_id
: hash ID of a GIFgiphy_id
: ID of the matched GIF on GIPHY. The GIF hosted on GIPHY can be found using linkhttps://giphy.com/gifs/[GIPHY_ID_HERE]
.
The file gif-pepe-inferred-features.csv
(download here) contains inferred GIF feature from a trained PEPE
model for 115,404 GIFs.
It has the following fields:
gif_id
: hash ID of a GIFgif_feature
: a feature embedding for the corresponding gif.
gif-pepe-inferred-features-top1k.csv
(download here) is a subset of features for 1000 most used GIFs, and is provided for demonstration purpose.
The model checkpoint PEPE-model-checkpoint.pth
can be download here, further instruction on how to load the checkpoint can be found in the demo here.