Photo Credit: Ashley King
What do you do as a social media company with millions of hours of video containing human movement? ByteDance’s answer seems to be the creation of an AI video generator dubbed ‘OmniHuman-1.’
Generative AI over the last few years has focused on improving text answers, logic and understanding, generating images—and video. While the video generation aspect has lagged behind others, ByteDance’s OmniHuman is an eerie look at what’s coming. OmniHuman-1 is a new multimodal video generation framework that can be fed a single image and bring it to life—complete with audio.
OmniHuman-1 can combine video, audio, and near-perfect lip syncing to create a video from a picture that never actually happened. The model can create some startling results, which were illustrated on the model’s GitHub page.
Examples provided include an AI-generated young Albert Einstein speaking and videos of Taylor Swift singing and dancing. Six videos generated from Taylor Swift images created scenes in which the singer never performed.
Under the heading ‘Singing’ on the official GitHub page, ByteDance says, “OmniHuman can support various music styles and accommodate multiple body poses and singing forms. It can handle high-pitched songs and display different motion styles for different types of music. Please remember to select the highest video quality. The generated video quality also highly depends on the quality of the reference image.”
Most of the videos OmniHuman has as examples of its generation on this page are hard to tell the video is AI. Some hallmarks of AI content do remain, such as bokeh and blurry backgrounds, strange motions and movement that appear slightly jerky, and slight mouth movements that don’t quite match up with what’s being said.
But the level of AI generation here highlights how quickly generative AI videos are progressing under ByteDance. The tech company has access to hundreds of millions of hours of video data to train on TikTokers dancing, singing, lip syncing, and more. Each TikTok and Douyin video created is another data point for getting these human movements perfected when AI-generated. In short, deep fake creation has never been easier—for those who create these models.