One of the leading global entertainment companies based out of North America receives around 600 to 1000 audition videos every year. The whole process of reviewing the videos, classifying them into relevant categories, and rating them in order of preference was carried out manually. It was a huge time- and effort- intensive exercise for the company to parse through all the videos for the relevant talent and recall archived videos for current usages.
Looking at Machine Learning models to speed up this intensive manual process
The company wanted to evaluate how Machine Learning (ML) could help them in this task. Persistent, with the Google Cloud certified ML expertise and home-grown APIs and custom models, developed a prototype to ingest a batch of videos, run them through multiple sophisticated ML models, and extract granular impressions of the data from the videos.
The video intelligence prototype is built with Google Cloud Storage, Google Cloud DataFlow, Google Video Intel API, Tensorflow, DataStore, AppEngine and has the following capabilities:
- Object Detection & Tracking
- Facial Recognition
- Age & Gender Estimation
- Video Processing Pipeline
- Video Transcription
- Natural Language Processing
- Smart Search
- Recommendation Engine
- Style Transfer
- Generative Models
Multiple video processing pipelines to help in the auditioning process
The First Pipeline: detecting real-world objects
The model was trained to identify generic-labeled real-world objects in the videos. The platform identified musical instruments like drums, accordion; stage equipment like trampoline, chair; bodies of water; and people by their profession like singer, acrobat, and trumpeter. The times of occurrences were projected on a horizontal timeline with a corresponding confidence score. The user could search and navigate through the analyzed information on an interactive and intuitive UI.
A step ahead, the model was fine-tuned to detect custom labels as necessitated by the company. Now, the system can train the generic model to detect customer-specific labels.
The Second Pipeline: analyzing the audio content
The audio in the videos was transcribed by passing them through a speech to text system. The transcript was later fed into Natural Language Processing (NLP) systems to identify important entities like the performer’s name, the school he/she attended, his/her experience, and more.
The Third Pipeline: understanding the demographic distribution
The unique faces in the video were identified and information like age and gender were predicted for all the unique faces.
All the processed and parsed information were mapped onto a horizontal timeline for easy future reference. Users can search through the archive of thousands of videos and find the right mix per their requirements with “Smart Search” – search for specific artists based on talent, gender, age, or ethnicity. The platform boosts the productivity of the casting team since they can churn the processing of videos faster and identify the right talent to begin the production with.
More possibilities of processing the videos in the future
In its second phase of innovation, the platform looks at possibilities of detecting specific actions in the videos based on deep learning models. Specific actions in the video like cartwheels, backflips jump, or more can be identified and documented for making processing, shortlisting, and future referencing easier.