Why we invested in Twelve Labs, the future of video understanding 

Although roughly 80% of global data is in video format, generative AI has mainly concentrated on text and images due the complexity of video analysis in processing visual, textual, and audio data concurrently. Not only is video analysis complex due to its nature of multimodality, the need to recognize objects, emotions, and context and to effectively search, engage, and communicate with video data further provide challenges. 

Enter Twelve Labs, a startup building multimodal foundation models for video understanding. The overarching problem that Twelve Labs solves for is video-language alignment. Twelve Labs specializes in creating machine learning systems that produce powerful video embeddings aligned with human language. This means their models can interpret and describe video content using text. This technology offers customers the ability to search for specific moments in a vast video archive, either by providing text descriptions or interacting with Twelve Labs’ models using text prompts. This enables the generation of various types of content, such as summaries, chapterizations, and highlights. Ultimately, Twelve Labs is revolutionizing the way we search for and comprehend videos, addressing current limitations in AI. Their technology has versatile applications, including ad insertion, content moderation, media analysis, and highlight reel creation, making them a significant player in the field of video data interaction.

Twelve Labs initially caught our attention when a team of four young AI engineers won the 2021 ICCV VALUE Challenge, outperforming AI teams from tech giants such as Tencent, Baidu, and Kakao. We were extremely impressed by the rapid progress of the model and company’s growth since the challenge. In a short period of time, Twelve Labs has become a leader in the field, featured in the NVIDIA GTC 2023 Keynote, and attracting talent like Minjoon Seo, a professor at the Korea Advanced Institute of Science & Technology (KAIST), who now serves as Chief Scientist. The talent that Minjoon brings as a distinguished NLP research scientist, coupled with CTO Aiden Lee, who is an expert in CV AI, further validates Twelve Labs’ ability to create powerful large multimodal models to video understanding. 

Twelve Labs is not only providing a cutting-edge video understanding solution but also a developer platform that is set to release APIs that can tackle video moment retrieval, classification, and video-to-text to address downstream tasks. Essentially, Twelve Labs is bringing a new video interface to make video just as easy as text, giving enterprises and developers programmatic access to all of the semantic information that reside in their video data. This developer-friendly approach has already attracted 20,000 developers to the platform during the beta phase. Further, they recently announced that their Pegasus-1 model already outperforms existing models in video summarization benchmarks, demonstrating a significant improvement in video understanding. 

Our investment in Twelve Labs underscores the company’s extraordinary potential to revolutionize video understanding. With a team of exceptional talent, a robust technology foundation, customer-centric approach, and a well-defined vision, Twelve Labs is poised to become a pioneer and a dominant vision-language model partner for enterprises.  We hope our investment can help accelerate the widespread adoption of this transformative technology, empowering millions of users and paving the way for a future of seamless video interaction.


Michael Shim is an investor at Samsung Next. Thomas Choi supported this deal. Samsung Next’s investment strategy is limited to its own views and does not reflect the vision or strategy of any other Samsung business unit, including, but not limited to, Samsung Electronics.

Previous
Previous

Why we invested in Prometeo, the open banking company creating an interconnected financial system in Latin America

Next
Next

Why we invested in Koyeb, the next-gen serverless platform