May 5, 2023

Dhruv Kunjadiya

Hello there!

I am an undergraduate student in my final year pursuing computer engineering from VJTI, Mumbai. I have keen interest and decent experience in Machine learning, Artificial Intelligence, Computer Vision. I am very excited to be part of google summer of code 2023 and learning new things from Red Hen Lab team and my mentors.

You can learn more about me over here. As a student developer, I will be working for Red Hen Labs and this blog is to document my weekly progress on the project that I have proposed to work on.

Project Details

Mentors

Raúl Sánchez, Cristóbal Pagán Cánovas, Inés Olza, Rosa Illán

Abstract

I propose to create a semantic multimodal search engine for collections of transcribed and aligned videos using state-of-the-art artificial intelligence models of different types, including NLP (Large Language Models) for text generation and capturing the semantics of transcriptions, as well as image description models to understand what is being seen in the video. Only focusing on the transcribed text does not help much, considering the context, what is being shown in the video will be helpful. The search engine will list down the most closest matches to the user query containing the metadata like link to the video, video-id, timestamp, text.

For more details refer my project proposal. Going through the project proposal will help you understand the goal of the project and steps mentioned below in a better way. Project repository can be found over here.

Progress Update

Community Bonding Period

I received my Case Western Reserve University (CWRU) ID for the purpose of gaining access to the High Performance Computing (HPC) clusters. I was able to setup the connection to the HPC clusters through VPN access. This is a very nice video for your reference.

Participated in the welcome meet and learned more about red hen labs and the team.

Week 1

I had a meet with all the mentors regarding initial steps to be taken for the project, discussed about project goals and how to approach it step by step. We decided to convert the video transcripts (vrts files) to json data first, extracting all the useful data from vrts files and storing it in json.

Extracting metadata of the video like date, time, month, video title, duration etc and extracting all the sentences to be used further for finding sentence embeddings. Also stored start and endtime of sentence as it will be used for extracting first, middle and last frame of that sentence. This extracted frames will be passed to “coca_ViT-L-14” for getting text descriptions of that frames. After discussion with mentors, I came to know that storing verbs with their start and endtime for every sentence will be helpful, because verbs are the most important part of every sentence. Verbs can provide useful context, so extracting text from that verb frame will be helpful.

So I wrote a python script using regular expressions to extract all useful data. You can see the script over here.

This is how a .vrt file looks like.

Untitled