The program requires the following dependencies:
- pytorch
- fairseq 0.9.0
- CUDA (for using GPU)
We are using COCO Caption Evaluation library, which uses the Stanford CoreNLP 3.6.0 toolset
cd external/coco-caption
./get_stanford_models.sh
export PYTHONPATH=./external/coco-caption
Pre-process UC Merced images and captions
./preprocess_captions.sh uc-merced
./preprocess_images.sh uc-merced
Add/Replace files to fairseq 0.9.0 from fairseq
Hyperparameters need to be tuned. This is just an example.
python -m fairseq_cli.train \
--save-dir .checkpoints \
--user-dir task \
--task captioning \
--arch default-captioning-arch \
--encoder-layers 3 \
--decoder-layers 6 \
--features obj \
--feature-spatial-encoding \
--optimizer adam \
--adam-betas "(0.9,0.999)" \
--lr 0.0003 \
--lr-scheduler inverse_sqrt \
--min-lr 1e-09 \
--warmup-init-lr 1e-8 \
--warmup-updates 8000 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--weight-decay 0.0001 \
--dropout 0.3 \
--max-epoch 25 \
--max-tokens 4096 \
--max-source-positions 100 \
--encoder-embed-dim 512 \
--num-workers 2
To generate captions for images in test-split
python generate.py \
--user-dir task \
--features grid \
--tokenizer moses \
--bpe subword_nmt \
--bpe-codes output/codes.txt \
--beam 5 \
--split test \
--path .checkpoints-scst/checkpoint24.pt \
--input output/test-ids.txt \
--output output/test-predictions.json \
--output_l output/test-labels-preds.csv
The following example calculates metrics for captions contained in
output/test-predictions.json
.
./score.sh \
--reference-captions external/coco-caption/annotations/captions_val2014.json \
--system-captions output/test-predictions.json
The following example calculates metrics for labels contained in
output/test-labels-preds.csv
.
python score_label.py
--reference-captions output/label_preds.csv \
--system-captions output/test-labels-preds.csv
The trained multi-task model for image captioning with multi-label classification can be downloaded from here
Codebase inspired from https://github.com/krasserm/fairseq-image-captioning
If you find this code useful for your research, please cite our paper:
@article{kandala2022exploring,
title={Exploring Transformer and multi-label classification for remote sensing image captioning},
author={Kandala, Hitesh and Saha, Sudipan and Banerjee, Biplab and Zhu, Xiao Xiang},
journal={IEEE Geoscience and Remote Sensing Letters},
year={2022},
publisher={IEEE}
}