Grounded Sequence-to-Sequence Transduction Team at JSALT 2018
We are a team of computational linguistics researchers from different institutions (academia and industry). We will be working on algorithms for language grounding using multiple modalities during six weeks in the Frederick Jelinek Memorial Summer Workshop at John Hopkins University.
Video understanding is one of the hardest challenges in Artificial Intelligence research. If a machine can look at videos, and “understand” the events that are being shown, then machines could learn by themselves, perhaps even without supervision, simply by “watching” broadcast TV, Facebook, Youtube, or similar sites.
As a first step, we will combine written, spoken, and seen objects and actions in how-to videos: if we see a person slicing round, red objects and putting them on a brown surface, it is more likely that she or he is explaining how to make a sandwich than how to change a car tire. And we might learn that the red objects are called “tomato”. Our team will develop methods that exploit multimodality to process and analyze videos to accomplish three main tasks: speech captioning, video-to-text summarization and translation into a different language. These tasks are diverse but not unrelated. Therefore, we propose to model them using a multi-task (sequence-to-sequence?) learning framework where these (and other, auxiliary) tasks can benefit from shared representations.
The tasks we propose generate natural language, which has a number of well-known challenges, such as dealing with lexical, syntactic and semantic ambiguities, and referential resolution. Grounding language using other modalities, e.g. visual and audio information such as what we propose here, can help overcome these challenges. Information extracted from speech, audio and video will serve as rich context models for the various tasks we plan to address.
The dataset we will use in this project is a set of instructional videos called the HowTo corpus containing about 2000 hours of speech. We are collecting Portuguese (possibly Turkish) translations for these videos via crowd sourcing. We will also be collecting a special summarization dataset from these HowTo videos. A dataset website and more information coming soon – in the meantime…
Here is the topic distribution visualization for this dataset. We find that 25 topics including yoga, cooking, sports, guitar, sewing, and many more are most representative of the dataset. Set the relevance metric on the right to ~0.2 and click on a particular topic cluster to see the top words in each topic. Toggle the options in the interactive visualization and have fun!
Here there is a part of the team:
Picture with all of us coming soon :)