Skip to content

Jacklu0831/Image-Captioning

Repository files navigation

Image Captioning

To explore the fascinating intersections between computer vision and natural language processing, I implemented the image captioning model in Show, Attend and Tell with some tweaks.

In this project, I learned a lot about integrating feature extraction with attention and LSTM, the underlying math equations from papers, and using PyTorch framework. Below is are results of my trained model (30 of them in output).


Background

For a bit of history, the simple encoder-decoder model proved their worth in machine translation tasks so researchers started trying to translate images into natural language in a similar way. However, the biggest difference between translating the extracted features of an image and a French sentence both to an English sentence is that the visual information need to be heavily compressed into a just few bytes with minimal loss of spatial information. Therefore, to build an image captioning model with the encoder-decoder model, an attention mechanism that lets the neural network "focus" on parts of the image when outputting each word is the key to increased performance.

From Show, Attend and Tell:

Automatically generating captions of an image is a task very close to the heart of scene understanding - one of the primary goals of computer vision.

Neural image captioning is about giving machines the ability of compressing salient visual information into descriptive language. The biggest challenges are building the bridge between computer vision and natural language processing models and producing captions that described the most significant aspects of the image.

For detailed background info on feature extraction, soft/hard self-attention, and sequence generation with LSTM, the resources section contains a number of useful links/papers I used. Wrapping my head around how image encoding, attention, and LSTM integrate together led me to understanding this implementation (top-down approach).


Key Info

Below are some of my choices for the implementation in chronological order.

  • PyTorch both for its pythonic syntax and to utilize the strong GPU acceleration. There is less documentation on PyTorch so I ended up learning a lot more by reading some source code
  • Colab's T4 GPU from google, which was strong enough for Flickr30k with small batch sizes (max 12)
  • Flickr30k dataset because MS COCO requires enormous training time and computational power for Colab. Link to download from Andrej Karpathy. He also clarified why there is a 'restval' split in this tweet
  • No pre-trained embedding because training embedding from scratch is not a heavy tasks and it allows my NLP model to fit to context
  • Soft attention (deterministic) for its differentiability (standard backprop). Intuitively, soft attention looks at the whole image while focusing on some parts while hard attention only looks at one randomly weighted choice at a time
  • Mult-layer perceptron for the attention model, as from the paper
  • Doubly stochastic attention regularization parameter was used to encourage the model to pay equal attention to every part of the image over a course of generation. This was used to improve the score in the paper
  • Early stopping to terminate training early. If the BLEU score does not improve for over 10 epochs, the best model checkpoint would be saved
  • BLEU-4 score for both training (early stopping) and evaluation
  • Beam Search to find the most optimal sequence after decoder does the heavy lifting

Performance

For beam size of 4, my final model reached 32.83 BLEU score.


Possible Improvements

  • Better hardware enables the use of MS COCO dataset, higher batch size, higher epoch
  • Try hard-attention (stochastic approach) and compare performances
  • Fine-tune ResNet longer to fit the dataset
  • Instead of teacher forcing for each word, scheduled sampling has been proven to be better based on probability
  • As mentioned in the paper, a major drawback of using attention is distilling the important parts of an image especially on images that have a lot of things going on. This problem is addressed by DenseCap where the objects are first recognized in separate windows

Dependencies

NumPy, os, json, h5py, PyTorch, matplotlib, Pillow, scikit-image, SciPy, Jupyter Notebook/Google Colab, tqdm.


Try it

To see what my model would say about your own image, call Python visualize_result.py -i /path/to/image -o /path/to/output.

Files

README.md            - self

assets               - images for README
input                - input images
output               - output alphas and sentences

train.ipynb          - Colab notebook for training model
evaluation.ipynb     - Colab notebook for evaluating model (BLEU score)

caption.py           - the class for input data
models.py            - encoder, attention, and decoder models
organize_input.py    - script for parsing raw input data into json files
visualize_result.py  - script for testing model and producing results 

Resources

Papers

Miscellaneous

About

Neural image captioning with CV + NLP (PyTorch)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages