In this repository, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.
View the project on Github to download the dataset.
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, and Florian Metze. How2: A large-scale dataset for multimodal language understanding. In Proceedings Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS).
We will hold a special session on Multimodal Representation Learning for Language Generation and Understanding at ICASSP 2019 using the How2 dataset and other related work. More information coming up soon!
We continue to explore tasks for vision and language as a workshop at ICML 2019. Stay tuned for more information!
To The John Hopkins University for organizing such a great experience as the JSALT workshops, and to the amazing team: Grounded Sequence-to-Sequence Transduction!