View on GitHub

Research Review Notes

Summaries of academic research papers

JTAV: Jointly Learning Social Media Content Representation by Fusing Textual, Acoustic, and Visual Features


Idea

The authors propose to learn the ‘content’ of social media, by fusing together 3 distinct content modes: textual, acoustic and visual. They claim that jointly learning features from all 3 of these outperforms systems that rely on single- or bi-modal learning. Their approach involves learning single-modal features and then fusing together the features learned from different modes.

Background

Method

An attention-based network attBiGRU is used for text information, a DCRNN for acoustic information, and a fine-tuned general framework called DenseNet for visual features.

Text information is divided into two parts:

Observations