Multimodal Summarization


Summarizing textual content from a video or just video content often lead to a single sided summary, They do not extract the important parts where a instructer puts more focus on, Here we combine the 3 modalitites text, video and audio to construct vocab, extract video features and pinpoint important audio segments. We try to combine these 3 different feature sets to create a more robust and informative summary.