Skip to content

Evaluation

Thanos Masouris edited this page Aug 28, 2022 · 1 revision

The performance of the model can be evaluated both qualitatively and quantitatively. 

Qualitative Evaluation

For the qualitative evaluation, synthetic samples are generated in each epoch of training, for the same noise vectors as inputs. Then, using the gifmaker.py script, which produces a GIF file of the evolution of the synthetic samples throughout training, we can investigate the training progress in terms of image quality of the generated samples. The figure below showcases a corresponding file created for the training of the DiStyleGAN model.

Progress GIF

Illustration of the progress of training.

Additionally, inspired by Zhan et al. (2017) [1], the t-SNE algorithm [2] was utilized to investigate if the trained model suffered from a common GAN problem, mode collapse (i.e. the generator produces only a small set of outputs that manage to fool the discriminator). In particular, having generated 5000 samples (500 samples per class), we used the tsne.py script to create a t-SNE visualization of the synthetic samples on a 2D grid by using a pre-trained VGG19 model to extract features of 4096 dimensions, which were then compressed to 300 dimensions using the PCA algorithm and mapped to cartesian coordinates (2D) using the t-SNE algorithm. Following this procedure, similarly looking images are placed into neighbouring tiles in the grid, allowing us to check for mode collapse. The figure below illustrates the aforementioned grid for our pre-trained model. We did not observe any noticeable mode collapse patterns.

t-SNE DiStyleGAN

t-SNE visualization for DiStyleGAN's synthetic samples.

Quantitative Evaluation

For the quantitative evaluation, we calculate the Inception Score (IS) [3] and the Fréchet Inception Distance (FID) [4]. We used the TensorFlow implementation of the two metrics by Junho Kim and Ahmed Fares on GitHub. The results presented below were calculated on 50,000 synthetic images (5,000 per class). For the calculation of the Inception Score, we used 10 splits.

Inception Score Fréchet Inception Distance
6.78 (± 0.08) 42.30

References

[1] Zhang, Han, et al. "Stackgan++: Realistic image synthesis with stacked generative adversarial networks." IEEE transactions on pattern analysis and machine intelligence 41.8 (2018): 1947–1962.

[2] Laurens van der Maaten and Geoffrey Hinton, "Visualizing Data using t-SNE", Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.

[3] Salimans, Tim, et al. "Improved techniques for training gans." Advances in neural information processing systems 29 (2016).

[4] Heusel, Martin, et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in neural information processing systems 30 (2017).