Cross Modal Content Based Objective for Learning Adequate Multimodal Representations MVRL Workshop @ ICML 2016 Presenter: Seungwhan (Shane) Moon Adhiguna Kuncoro Akash Bharadwaj Volkan Cirik Louis-Philippe Morency Chris Dyer 1
Multimodal Machine Translation - Are pictures worth a thousand words?
Input:
+
“Two young, White males are outside near many bushes.”
Output: “Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.”
Multimodal Machine Translation - Motivation
Current leading MT system output:
Multimodal Machine Translation - Reconciling Two Separate Tasks Multimodal Machine Translation Caption (Source Language)
Image
Caption (Target Language)
Multimodal Machine Translation - Reconciling Two Separate Tasks Machine Translation Caption
Caption
(Source Language)
(Target Language)
Challenge: decoder is often overly fluent but inadequate
Image Captioning Image
Caption (Target Language)
Challenge: CNN representation is competent for a discriminative task, but not for a generative task
Baselines
NMT Enc-Dec Model (= Baseline Unimodal Model: “Blind”)
NMMT (+ Text Attention)
Our Approach
CNN Filtering
Motivation for CNN Filtering Source text views may not be relevant since English and German captions are all i.i.d and crowdsourced
English Captions
German Captions
- A man in jeans is reclining on a green metal bench along a busy sidewalk and crowded street .
- viele autos am straße rand geparkt , auf diesem liegt ein mann auf einer bank (= many cars parked on the roadside , in this a man lying on a bench)
- A white male with a blue sweater and gray pants laying on a sidewalk bench .
- ein seite streifen mit parkenden autos und metall säulen die bis zum flucht punkt des bildes alle auf einer linie hinter einander stehen . (= a side strip with parked cars and metal columns that appear on a line to the vanishing point of the image , all in a row.)
- A man in a blue shirt and gray pants is sleeping on a sidewalk bench . - A man sleeping on a bench in a city area .
- ein mann liegt neben geparkten autos auf einer bank . (= A man lies next to parked cars on a bench .)
Filtering CNN Representation
Hypothesis: Strong regularization of projection from CNN features is useful for input to generative process
Filtering CNN Representation Bag of Verbs
Bag of content words
FC 7 (Optional)
C N N
laying sleeping leaning . . .
man bench cars . . .
To decoder for generative process
Bag of Nouns
(Optional)
green white blue . .
Bag of Adjectives
Filtering CNN Representation
CNN Filtering Training Objective:
Datasets ▪
▪
Main Dataset: Flickr30k (Multi-Modal, Multi-Lingual): ▪
Training set: only 29k professionally-translated English-German captions.
▪
Dev set: 1,014 sentences, blind test set: 1,000 sentences
Multi-lingual word embedding: trained to map each German and English word to the same space. Trained purely on the dataset (no external resources) using the Berkeley word aligner as pre-processing
Results ( Image+English → German ) Model
Visual Features
Meteor
NMMT + Attention
fc7 + CNN Filter
30.28
NMMT + Attention
fc7
29.54
NMMT
fc7 + CNN Filter
19.32
NMMT
fc7
18.72
NMT
N/A
18.8
Takeaway
▪ Strong regularization of the CNN representation helps a generative process ▪ It is complementary with the real-valued FC7!
Relevant papers ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪
S. F. E. H. Elliott, Desmond, “Multi-language image description with neural sequence models,” arXiv preprint arXiv:1510.04709,2015. Koehn, Philipp, et. al “Moses: Open source toolkit for statistical machine translation,” In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. pp. 177–180, 2007 D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR , vol. abs/1409.0473, 2014 J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015 K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv preprint arXiv:1502.03044 , 2015 B. Zoph and K. Knight, “Multi-source neural translation,” arXiv preprint arXiv:1601.00710 , 2016. O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural machine translation with a shared attention mechanism,” arXiv preprint arXiv:1601.01073, 2016 Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” arXiv preprint arXiv:1511.02274, 2015 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014 C. D. M. Minh-Thang Luong, Hieu Pham, “Bilingual word representations with monolingual quality in mind,” 1st Workshop on Vector Space Modeling for Natural Language Processing, vol. 1, no. 1, pp. 151–159, 2015 R. Z. Ryan Kiros, Ruslan Salakhutdinov, “Unifying visual-semantic embeddings with multi-modal neural language models,” TACL, 2015 N. Chomsky, Syntactic structures. Mouton, 1957
Cross Modal Content Based Objective for Learning Adequate Multimodal Representations
Questions? Presenter: Seungwhan (Shane) Moon
[email protected] 18
Adhiguna Kuncoro Akash Bharadwaj Volkan Cirik Louis-Philippe Morency Chris Dyer