We explore the learning of highly non-linear relationships that exist among low-level features across different modalities for emotion prediction. Using the deep Boltzmann machine (DBM), a joint density model over the space of multimodal inputs, including visual, auditory, and textual modalities, is developed. The model is trained directly using UGC data without any labeling efforts. While the model learns a joint representation over multimodal inputs, training samples in absence of certain modalities can also be leveraged. More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query "crazy cat". The model does not restrict the types of input and output, and hence, in principle, emotion prediction and retrieval on any combinations of media are fesible.
Fig1. - Multimodal DBM that models the joint distribution over visual, auditory, and textual features. All layers but the first (bottom) layers use standard binary units. The Gaussian RBM model is used to model the distributions over the visual and auditory features. The replicated Softmax topic model is applied on the textual features.
Fig. 1 shows the proposed network architecture, which is composed of three different pathways respectively for visual, auditory and textual modalities. Each pathway is formed by stacking multiple Restricted Boltzmann Machines (RBM), aiming to learn several layers of increasingly complex representations of individual modality. Similar to , we adopt Deep Boltzmann Machine (DBM)  in our multimodal learning framework. Different from other deep networks for extracting feature, such as Deep Belief Networks (DBN)  and denoising Autoencoders (dA) , DBM is a fully generative model which can be utilized for extracting features from data with certain missing modalities. Additionly, besides the bottom-up information propagation in DBN and dA, a top-down feedback is also incorporated in DBM, which makes the DBM more stable on missing or noisy inputs such as weakly labeled data on the Web. The pathways eventually meet and the sophisticated non-linear relaitonships among three modalities are jointly learned. The final joint representation can be viewed as a shared embedded space, where the features with very different statistical properties from different modalities can be represented in an unified way.
The proposed architecture is more generalized and powerful in terms of scale and learning capacity. In visual pathway, the low-level features amount to 20,651 dimensions, resulting in large number of parameters to be trained if connecting them directly to the hidden layer. Instead, we design a separate pathway for each low-level feature, which requires less parameters and hence more flexible and efficient to train. This advantage makes our system more scalable to handling higher dimensional features, rather than features of 3,875 dimensions used in . We further consider learning the separated pathways in visual modality in parallel. The computational cost can be further reduced. Furthermore, we generate a compact representation which represents the common feature and preserves the unique characteristic of each visual feature. In this way, it will not overwhelm other modalities because of high dimensionality during joint representation learning. Auditory and textual pathway do not suffer from this problem. However, the proposed structure can be easily extended for other modalities.
The code and dataset are only for non-commercial research and/or educational purposes. To obtain the code and dataset, you have to fully agree on the following terms and conditions with complete understanding:
- I understand that the copy right of videos & corresponding metadata in the dataset fully belongs to their owners. In no event, shall City University of Hong Kong be liable for any incidents, or damages caused by the direct or indirect usage of the dataset by requesting researchers.
- The code and dataset should be only used for non-commercial research and/or educational purposes.
- City University of Hong Kong makes no representations or warranties regarding the code and dataset, including but not limited to warranties of non-infringement, merchantability or fitness for a particular purpose.
- Researcher shall defend and indemnify City University of Hong Kong, including its employees, trustees and officers, and agents, against any claims arising from Researcher's use of the code and dataset.
- Researcher may provide research associates and colleagues with access to the code and dataset provided that they have also agreed to be bound by the terms and conditions stated in this agreement.
- An electronic document, such as email, containing the signed form, from requesting researcher is regarded as an electronic signature on the form, which has the same legal effect as a hardcopy signature.
- City University of Hong Kong reserves the right to terminate access to the code and dataset at any time.
Download The dataset can be obtained via sending a request email to us. Specifically, the researchers interested in the dataset should sign the Agreement and Disclaimer Form, and Email to us. We will send you instructions via email to download the dataset at our discretion.
- E-YouTube: 156,219 videos and metadata 1.4TB(Since the size is too large, we only provide the url list of the videos and you can crawl from YouTube by yourself.)
- E-YouTube Visual Features:
- DenseSIFT: 22.0GB
- GIST: 5.2GB
- HOG: 21.0GB
- LBP: 6.8GB
- SSIM: 22.0GB
- E-YouTube Textual features 9.0GB
- Code 401.0KB
- Model 2.0GB
Lei Pang, Shiai Zhu, and Chong-Wah Ngo. Deep Multimodal Learning for Affective Analysis and Retrieval. IEEE Trans. on Multimedia, vol. 17, no. 11, pp. 2008-2020, 2015.
- N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep Boltzmann machines,” J. Mach. Learn. Res., vol. 15, pp. 2949–2980, 2014.
- R. Salakhutdinov and G. Hinton, “DeepBoltzmannmachines,” in Proc. AI Statist., 2009, pp. 448–455.
- G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.
- P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 1096–1103.