Q&A: The future of video synthesis

Nov 7

Synthetic media will soon enable businesses and content creators to make high quality, highly realistic, and highly personalized AI-generated video at scale. To learn more, we sat down with Victor Riparbelli, the founder and CEO of Synthesia, creator of the world's first text-to-video platform.

To start, tell us what Synthesia is and what it does.

Sure. Synthesia is a content generation platform that uses deep learning to create video content. Our mission is to make it as easy to create a video as it is to write an email. That's kind of our tagline.

If you look at what's happening in the world right now, we are moving into a video-first world, where most of the experiences we have online are based on video in some way or form. I'm sure everyone is aware of this. I think that's a trend that's only going to continue until we are in a more or less video-only world.

If you look at something like TikTok, for example, it's probably the first social media which is more or less video-only. There's no text in the interface when you're browsing through it, where something like Instagram still has the idea of captions and there's going to be lots of text in the interface.

This creates a problem for lots of creators and brands because the way in which we produce video has evolved on one vector, which is that we all have cameras in our pockets right now, but it hasn't really evolved in terms of how we can digitally produce video content. That's still something that's mainly confined to Hollywood, with big budgets, visual effects studios, and all sorts of jazz we know from the movies.

That's different from something like text, where, once we invented computers, text became a digital asset. With photos, we have something like Photoshop, which allows you to easily create any photo, more or less. You don't really need to use a camera, although that obviously doesn't mean we don't use cameras anymore.

With audio, I come from a background as a hobbyist music producer, on a laptop, we can synthesize all of the instruments, effects, amps that you would otherwise have in a 10-million dollar recording studio in L.A.

But video has kind of still been standing still. The interesting part is that AI and deep learning and neural networks have now gotten to the point where they can imitate the real world, particularly in the domains of speech and video generation, to the point where it's almost indistinguishable from the real thing. That's what we are tapping into.

We've built the world's first text-to-video platform. Essentially the way it works is that you select an AI presenter, either one of them that is built into the platform or you could upload yourself. It's a three-to-four minute process to do so, and then you simply just type in text and we'll generate a talking head-style video of you or the avatar you've chosen performing that to the camera.

The big long-term vision is that in 10 or 15 years, we could create a Hollywood film on your laptop. That’s the big, bold vision.

What goes into the process of creating this digital representation of yourself on your platform? Walk me through the steps of how it works.

In terms of the avatars, the process is quite simple. The way that these algorithms work is that essentially they look at data of you talking and then learn to replicate that in a very believable way.

So literally what you need to do is just talk to the camera for three to four minutes. We give you a script that you can follow if you want to. It doesn't really have to be that script, but a lot of people find it difficult to freestyle for four minutes. It's harder than you think, so we have something you can just kind of follow.

And then, the input is the output. So if you send me a video where I'm sitting in a little confined booth and the lighting is not very good, that's what we're going to replicate. So you probably want to get to a point where there's some decent light on your face. Ideally, you want to go into a studio, have a green screen, and things like that. But that isn't really necessary. It depends a lot on the use case.

How did you get started? What made you interested in creating this technology?

So I started my first company when I was in my late teens doing online marketing and e-commerce, which back then was not as common as it is today, and I built a kind of growth-hacking product mindset.

I worked in a few startups as part of my education. I went to Stanford to do the last semester of my degree. And when I went to Stanford, it kind of just underlined that what has always been kind of my hobbies: science fiction, gaming, the kind of odd side of life is really what I wanted to focus on.

To make a long story short, I started working with augmented reality and virtual reality, basically working with the U.K. government on how to build a solid ecosystem around the creation of concepts for VR and AR. This was back when VR and AR were still very hot. It was kind of the beginning of that new wave.

When I got into it, I quite quickly realized that while I'm personally a big believer in VR and AR, I think that it's still five to 10 years away before it really becomes a major platform. But what I did find very interesting was a lot of the technologies that are used in AR and VR, and I stumbled across Matthias Nießner, who's my co-founder today, a professor at Stanford back then who had created Face2Face, which is the world's first, deep fake-type of technology.

Everyone else saw fake news and dangerous stuff. I definitely saw that as well, but what I really saw was a glimpse of the future of content creation. It's one of those things where I think it's a good idea disguised as a bad idea, and that's why we decided to focus on commercializing this technology because I think there's a massive commercial opportunity.

This technology applies to video, which is already everywhere. And for me personally, this is also a really interesting intellectual challenge in bringing a new type of media to life, both in its commercial aspects but also the ethical aspects and cultural aspects... and seeing all kinds of cool stuff is going to be built with this.

From a technological perspective, what makes Synthesia differentiated from the competition?

I think that we have a lot of really smart people in the company building these things. We both have deep learning experts, but we also have people who come from a background in visual effects from the Hollywood scene. I think it's a combination of those two things that make this stuff actually work in a production sense, where it has value, and it's not just a fun gimmick.

As with any technology, it gets democratized and it gets easier and easier. If you go one-and-a-half years back, I think Synthesia was the only one that used to have a production version of this kind of technology. Now it's obviously different. I still think from a short-term perspective, R&D is a massive competitive advantage, and we'll keep innovating and expanding on the quality and features of the product.

How are your customers and partners using your technology today?

The main areas are definitely onboarding, training, and education, and those come in different forms. So there's the kind of standard linear video, where if you have a lot of customers creating courses, which could be anything from internal processes to sales enablement to product demo videos, where even just the simple fact that you can type in your videos rather than having to do a voiceover or record yourself, it's a hundred times increase in workflow and one person can do it. You don't need a team of people to do it. That's the main use case.

That could be served in many different ways. It could be served as a normal video on your website. It could be served in a chatbot-style interface, which is something we're increasingly seeing. And it can also be served with additional personalization on top of the normal video.

I think this is the really interesting promise of synthetic video and it’s something that's unique to synthetic video. If you're just creating a linear training video, you could do that with a camera. It'd be a lot more expensive and it'd take a lot more time, but it's technically possible to do. You would not be able to film 50,000 different versions of a course with a normal camera.

Why do you believe more personalized avatars are a better way to communicate these things? Why is video a better medium to use for this?

I think there are two answers here or two ways of looking at it. One is that we're not competing with traditional video production. We are not saying that you should replace your beautiful design branding video, or that you shouldn't create a TV ad or something like that.

We see ourselves competing with text. We want to have the things that you currently only have as text that you should make into a video. That's the foundational idea. And this is all about information delivery, and the fact that as humans, our attention and retention of content is very linked to how we consume that information. So if you read something in text, generally speaking, you'll remember around 10 percent of that. If you watch something in video, it's about 80 percent.

Then you add the personalization aspects on top of it, of which there's been loads of great research done in how that affects how you retain and learn things. The base layer of personalization, the one that most people think of immediately, is that it'll say your name or your company. And that definitely heightens your attention, but where it gets more interesting is once you start looking at the actual content.

With these types of technologies, you could create a fully personalized journey through the video, depending on who is actually watching it. Because it's so scalable, it is actually feasible to create these videos.

That's where personalization becomes really, really interesting. It's where you can kind of stitch together information depending on who the viewer is. And that's very unique to synthetic video because video today is a one-to-many video, generally speaking, but synthetic video makes that into a kind of one-to-one style of communication.

Right now there seems to be a bifurcation between avatars that are photorealistic, and those that look more like cartoons. Why is the photorealistic use case better for what you are doing?

That's a great question, and I think different avatars have different types of use cases and different types of contexts. In terms of what we are doing, I think the main part is that what people are used to is real humans, and a lot of companies want to have real humans. They don't want to have some kind of cartoonish characters to represent their brand. That is one part of it.

The second part of it is just that as humans, depending on the context and the kind of information that's being conveyed to you, you'll trust it more and it just feels more like a professional video. So I think the photorealistic avatar has high importance in terms of being able to create video content that is somewhat as engaging as a real video.

Since you went this route, how do you avoid this concept of the uncanny valley, where the brain just inherently knows that it's not a real human, based on movement or gestures?

I think when we and our customers have done focus groups, it's on average around 80 percent of people that can not tell that they’re watching a synthetic video. If you tell them, they can tell, but most people won’t realize it. So I think the uncanny valley is an interesting concept.

I think it's very dependent on the type of content. If you're doing an AI avatar that tells jokes, it's going to be hard to do it because they cannot create or convey emotion. That's obviously going to change in the future, as the technology gets better and better.

But if you're watching a very standard corporate video that tells you about a specific process, that is going to have a very neutral style delivered -- in those cases, we can actually kind of escape the uncanny valley to a certain degree. So again, I think it's very dependent on the use cases.

Now from a technical perspective, those are things that we're working on. We want to have more emotion. We want to add more body movements. And I think what we've seen in recent years in terms of advances with these technologies, that is not far away. I can say that being on the inside of a company that's developing exactly this type of stuff.

Looking forward, what are you most excited or optimistic about in terms of synthetic media?

From someone who's been in the synthetic media space now for now to four years, seeing it get to a point where we have actual products that scale is incredibly exciting. If you couple that with just the tremendous growth that we are seeing, and the very real use cases, and the very real customers that are flooding in to create synthetic videos, that's great.

When I then talk to my peers who are doing synthetic speech or synthetic photos, and they're saying the same thing. I think that's really exciting because the conversation around synthetic media has changed a lot over the last few years.

We're seeing this massive adoption of synthetic video. It's still early days, but I think just with the growth from ourselves and our peers, I think that's really exciting. And that's obviously just linked to the fact that these technologies now are getting to the point where you can actually do it at scale. And those two things in combination with each other, the quality and the scale, is what makes this broadly accessible.

Thanks for taking the time to chat with us, and best of luck with Synthesia.

Thank you as well.

Victor Riparbelli