This is an implementation of NewsLXMERT described in the paper:
@inproceedings{newslxmert_acmmm22,
author = {Bartolomeu, Cláudio and Nóbrega, Rui and Semedo, David},
title = {Understanding News Text and Images Connection with Context-enriched Multimodal Transformers},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
location = {Lisbon, Portugal},
numpages = {10}
}
Download the NYTimes800k and NewsImages datasets.
There are a set of modification steps on both datasets needed to run this NewsLXMERT implementation.
All documents are extracted from MongoDB to optimize IO operations.
Each document is stored in one out of three possible directories, depending on its split
field: train, valid or test.
A document file is named with the article's id (e.g. 1111-2222-333.json
).
The image features of all articles are extracted before-hand using a FRCNN into a JSON file and stored on a separate folder.
The image features JSON file must be named <image_hash>.jpg
and has the following structure:
{
"roi_features": "<base64 encoded roi features>",
"boxes": "<base64 encoded boxes>",
"normalized_boxes": "<base64 encoded normalized boxes>",
"obj_ids": "<base64 encoded obj ids>",
"obj_probs": "<base64 encoded obj probs>",
"attr_ids": "<base64 encoded attr ids>",
"attr_probs": "<base64 encoded attr probs>",
"n_regions": "<number of regions>"
}
In order to have an file index, there is a need to create a .pickle
that stores a dictionary, where the keys are indexes of an array (0 to number_documents-1) and values are tuples. A tuple t = (<aid>, <iid>)
, where aid
is the article id and iid
is the image index of that article (0 to number_of_image-1).
The news title and text must be translated to english and placed for each article into two new fields title_en
and text_en
.
The image features of all articles are extracted before-hand using a FRCNN into a JSON file and stored on a separate folder.
The image features JSON file must be named <image_hash>.jpg
and has the following structure:
{
"roi_features": "<base64 encoded roi features>",
"boxes": "<base64 encoded boxes>",
"normalized_boxes": "<base64 encoded normalized boxes>",
"obj_ids": "<base64 encoded obj ids>",
"obj_probs": "<base64 encoded obj probs>",
"attr_ids": "<base64 encoded attr ids>",
"attr_probs": "<base64 encoded attr probs>",
"n_regions": "<number of regions>"
}
The faces features of all articles' images are identified using a MTCNN and its features are extracted using a FaceNet into the following new fields:
faces_embeddings
: base64 encoded string of the faces embeddings extracted by FaceNet.n_faces
: number of faces identified by MTCNN.faces_detect_probs
: base64 encoded string of the faces detect probs.faces_size
: size of each face feature.
This implementation supports cpu and single-gpu training, the latter is recommended, because is faster and simpler.
To do unsupervised pre-training of a NewsLXMERT model on NYTimes800k, run:
python -m src.train_nytimes --trainDsDir <train_split_dir> \
--trainIndex <train_split_index_file> \
--validDsDir <validation_split_dir> \
--validIndex <validation_split_index_file> \
--testDsDir <test_split_dir> \
--testIndex <test_split_index_file> \
--featsDir <image_features_dir> \
--output <model_checkpoints_dir> \
--epochs 20 \
--batchSize 256 \
--lr 1e-4 \
--warmupRatio 0.05 \
--mode 7 \
--maskedLmRatio 0.15 \
--maskedFeatsRatio 0.15 \
--maxSeqLen 100 \
--entities True \
--faces True
To do unsupervised fine-tuning of a NewsLXMERT model on NewsImages, run:
python -m src.train_mediaeval --trainDsDir <train_split_dir> \
--trainIndex <train_split_index_file> \
--validDsDir <validation_split_dir> \
--validIndex <validation_split_index_file> \
--testDsDir <test_split_dir> \
--testIndex <test_split_index_file> \
--featsDir <image_features_dir> \
--output <model_checkpoints_dir> \
--epochs 20 \
--batchSize 256 \
--lr 1e-4 \
--warmupRatio 0.05 \
--mode 7 \
--maskedLmRatio 0.15 \
--maskedFeatsRatio 0.15 \
--maxSeqLen 100 \
--entities True \
--faces True \
--loadLxmert <.pth_file_checkpoint>
The previous commands can be run with the --test
flag to evaluate NewsLXMERT in the validation and test split, all metrics are also logged.
Run python -m src.train_nytimes --help
to read all available training parameters description.
The --mode
is a training/test flag that specifies which text elements of the news articles text are used. This flag can have the following values:
Mode | Text Elements Used | NYTimes800k | NewsImages |
---|---|---|---|
1 | Caption | X | |
2 | Caption + Headline | X | |
3 | Caption + Snippet | X | |
4 | Caption + Headline + Snippet | X | |
5 | Headline | X | X |
6 | Snippet | X | X |
7 | Headline + Snippet | X | X |
Our pre-trained NewsLXMERT models on NYTimes800k can be downloaded as following:
epochs | task | mrr@100 (i2t) | mrr@100 (t2i) | model | md5 | |
---|---|---|---|---|---|---|
NewsLXMERT | 20 | News piece-Image Matching | 0.1189 | 0.1044 | download | 8ebfb5953b52fa41ef04c5c7f61e07c4 |
NewsLXMERT | 20 | Image-Caption Matching | 0.3342 | 0.3031 | download | 2c9efb49dea29578b3b117cb76540d90 |
Our finetuned NewsLXMERT models on NewsImages can be downloaded as following:
epochs | pre-train task | mrr@100 (i2t) | mrr@100 (t2i) | model | md5 | |
---|---|---|---|---|---|---|
NewsLXMERT | 20 | News piece-Image Matching | 0.1230 | 0.1247 | download | cf5b73a8facce1e8538a6291b87cbb95 |
NewsLXMERT | 20 | Image-Caption Matching | 0.1373 | 0.1294 | download | 148004368cd6eccd89a5be3b10f6e00f |