PTAMSC: Pure Transformer-flavored Augmented Multi-modal Sentiment Classifier

Author: GONGGONGJOHN

This is the official repository for the final project of DaSE undergraduate course "Contemporary Artificial Intelligence": Multimodal Sentiment Analysis. All the scripts and results are supposed to be executable and repoducible.

Setup

This project is implemented by Python 3.8, all related dependencies have been exported to requirements.txt. Use the following command to install all the required packages automatically:

pip -r requirements.txt

The main dependencies includes:

chardet 4.0.0
matplotlib 3.3.2
Pillow 9.2.0
scikit_learn 1.1.1
torch 1.9.0
torchvision 0.10.0
tqdm 4.64.0
transformers 4.20.1

Repository Structure

.
├── README.md
├── assets
│   └── ptamsc_structure.png
├── baseline.py  # Baseline testing entry
├── baselines
│   ├── baseline_model.py
│   ├── bert_train.py  # Bert baseline training
│   ├── resnet_train.py  # ResNet baseline training
│   ├── swin_train.py  # Swin Transformer baseline training
│   ├── xlm_hypertune.py  # Hyperparameter tuning program
│   └── xlm_train.py  # XLMRoBERTa baseline training
├── data_pipe.py
├── data_utils.py
├── main.py  # Main entry
├── model
│   ├── model.py
│   └── roberta.png
├── plot
│   ├── plot_clmlf_pretrain.py
│   ├── plot_clmlf_train.py
│   ├── plot_fuse_accuracy.py
│   ├── plot_hypertune.py
└── └── plot_resnet_f1.py

Model Overview

PTAMSC is a pure transformer-based image-text multimodal classification model. The main structure is as follows:

Performance

Model	Accuracy	Precision	Recall	F1
Bert	72.99	65.56	54.47	55.66
XLMRoBERTa	73.58	67.62	59.48	61.52
ResNet	57.93	36.01	33.71	25.59
Swin Transformer	66.73	57.90	54.02	55.38
X+S(Concatenate)	70.25	63.74	54.88	57.12
X+S(Additive)	71.23	66.31	54.79	55.01
PTAMSC	76.51	70.88	63.34	65.69

Reproduce Experiment

Dataset Preparation

To reproduce experiments on the original dataset, please use the following structure to organize your dataset directory:

.
├── [Dataset root]
│   ├── train.txt  # Training label
│   ├── test_without_label.txt  # Test items to be predicted
│   ├── source
│   │   ├── x.txt  # Paired text-image data goes here
└── └── └── x.jpg

As illustrated in the project report, the original dataset contains text instances not encoded in UTF-8/ANSI, please open them up with GBK encoding and translate them into English manually. Otherwise there will be errors when reading data instances into the model.

Baseline Model

python baseline.py --model [Name of the baseline model] --data_path [Dataset Root Directory]

The optional parameters and its meanings are as follows:

--model: The name of the baseline model to be used for training and validation, can be one of: bert, xlmroberta, resnet, swin, concat, additive
--img_scale_size: The target size of images to feed into the model, default: 224
--text_model_name: The pretrained model name of the text embedding model, default: xlm-roberta-base
--img_model_name:The pretrained model name of the image embedding model, default: microsoft/swin-base-patch4-window7-224
--batch_size: The batch size of all the datasets, default: 16
--epoch: Training iteration number
--lr_finetune: The learning rate of the pretrained backbone
--lr_downstream: The learning rate of the downstream structure
--warmup_step: How many steps to go before the learning rate achieve the assigned number

Main Model

python main.py --do_train --do_eval --do_test --data_path [Dataset Root Directory]

The optional parameters and its meanings are as follows:

--img_scale_size: The target size of images to feed into the model, default: 224
--text_model_name: The pretrained model name of the text embedding model, default: xlm-roberta-base
--img_model_name:The pretrained model name of the image embedding model, default: microsoft/swin-base-patch4-window7-224
--batch_size: The batch size of all the datasets, default: 16
--epoch: Training iteration number
--lr_finetune: The learning rate of the pretrained backbone
--lr_downstream: The learning rate of the downstream structure
--warmup_step: How many steps to go before the learning rate achieve the assigned number

References

We only list some major reference papers here, the full references list can be found in the project report:

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2015.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
Zhen Li, Bing Xu, Conghui Zhu, and Tiejun Zhao. CLMLF:a contrastive learning and multi-layer fusion method for multimodal sentiment detection. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics, 2022.
Yan Ling, Jianfei Yu, and Rui Xia. Vision-language pre-training for multimodal aspect-based sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2149– 2159, Dublin, Ireland, May 2022. Association for Computational Linguistics.
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PTAMSC: Pure Transformer-flavored Augmented Multi-modal Sentiment Classifier

Setup

Repository Structure

Model Overview

Performance

Reproduce Experiment

Dataset Preparation

Baseline Model

Main Model

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
assets		assets
baselines		baselines
data		data
model		model
plot		plot
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
data_pipe.py		data_pipe.py
data_utils.py		data_utils.py
main.py		main.py
requirements.txt		requirements.txt
train.py		train.py

gonggongjohn/multimodal-sentiment

Folders and files

Latest commit

History

Repository files navigation

PTAMSC: Pure Transformer-flavored Augmented Multi-modal Sentiment Classifier

Setup

Repository Structure

Model Overview

Performance

Reproduce Experiment

Dataset Preparation

Baseline Model

Main Model

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages