Sight and sound


An MIT-IBM Watson
AI Project

The sight and sound project focuses on the training and recognition of multi-modal data.  We target feature representations and higher level semantic concepts by training neural networks with multi-modal data such as videos, sounds, and texts. 

People

Project PI Jim Glass (MIT)
Jim Glass
PI - MIT
Project PI Hilde Kuehne IBM
Hilde Kuehne
PI - IBM
Project PI Leonid Karlinsky IBM
Leonid Karlinsky
PI - IBM
Senior Advisor David Harwath
David Harwath
Senior Advisor
Senior Advisor Brian Kingsbury
Brian
Kingsbury
Senior Advisor
Senior Advisor Rogério Feris
Rogério
Feris
Senior Advisor
Senior Advisor Samuel Thomas IBM
Samuel Thomas
Senior Advisor
Project PhD Andrew Rouditchenko MIT
Andrew Rouditchenko

PhD - MIT
PhD Nina Shvetsova
Nina Shvetsova

PhD - Goethe University Frankfurt
PhD Brian Chen
Brian Chen

PhD - Columbia University
PhD Layne Berry
Layne Berry

PhD - University of Virginia
PhD Alexander Liu
Alexander Liu

PhD - MIT
PhD Yuan Gong
Yuan Gong

Postdoc - MIT
Mobirise Website Builder
Past Members
Kevin Duarte
Aisha Urooj
Sirnam Swetha
Aslı Çelik
Angie W Boggust
Kartik Audhkhasi
Dhiraj Joshi
Danny Gutfreund
Yang Zhang
Rameswar Panda
Antonio Torralba

Papers

Multi-modal fusion transformer

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval  
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne, CVPR 2022

  

Routing With Self-attention

Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah;  arXiv:2112.00775

 

Multi modal multi lingual learning

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang; ICCV 2021

  

Paper , Code

AVLNet architecture


AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James Glass; Interspeech 2021

Paper , Code

Mobirise

Cascaded Multilingual Audio-Visual Learning from Videos
Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass; Interspeech 2021

Paper , Code

Overview_architecture

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass; IJCV 2020

Paper

MIT-IBM Watson AI logo
MIT CSAIL logo

Designed with Mobirise ‌

Free Web Site Maker