Hands on with Video Generation

Of all of the modes of generative AI – including text, audio, image, and code – video generation is still one of the earliest and most complex. This post explores video generation from a couple of the most successful current platforms, but the point isn’t really to see what the technology can do right now: it’s to see what’s on the near horizon.

I’ve written elsewhere about the multimodality of generative AI, and why I’m convinced that increasingly multimodal platforms are the future of the technology. OpenAI’s forthcoming update will put image recognition, image generation, advanced data analysis (coding), and internet search all into one platform. I don’t think it will be long before audio (including their newest text-to-speech model) and then video and 3D assets are included in that mix.

It’s an obvious step towards multimodal content generation, including educational resources. Beyond just creating endless PowerPoint decks, it will mean generative AI has the capacity to make videos on demand. I’ll show an example of what I mean later in the article.

It also has implications for virtual, augmented, and mixed reality. Given the hardware that’s just on the horizon like the Apple Vision Pro, developers will be scrambling for ways to make generative AI immersive content. All up, I don’t think it will be more than six months from now (November 2023) before users can create reasonable 3D assets and videos and drop them directly into VR editing software.

The live webinar Practical Strategies for Image Generation in Education is now available as a recording. Click here to purchase the recorded webinar.

How does video generation work?

I’m using two platforms here: Runway and Pika. Runway is definitely leading in terms of quality and the range of tools on offer. In fact, it’s building towards an entire editing suite, and beyond video generation it offers editing tools similar to the AI-powered generative fill and expand Adobe products, and more standard tools like colour correction, depth of field adjustments, and subtitling.

Pika is, like Midjourney, currently found inside Discord. It includes the ability to generate images from text, or to “animate” still images which you provide.

Both platforms are built on the same premise: a stable diffusion-based image generation model, with motion added. Runway’s Gen-1 relied on existing videos, editing the frames directly. Gen-2, which has recently been upgraded further, can generate videos from text and images.

Runway has both an app and a website. The website offers more functionality.
Pika lives in Discord, which makes me feel old.

Creating clips from text and images

The difference between using a text prompt and an image prompt in both platforms is that text prompts give a “classic” image generation experience, whereas image prompts will add (generally quite subtle) movement to an existing image.

There’s a place for both approaches. Runway’s text-to-video is pretty impressive, although it does struggle with complex generations and lots of movement. Here are two examples from Runway’s Gen-2 text-to-video for comparison:

Underwater scene, cinematic documentary about coral and tropical fish, shallow depth of field, impressive nature video

Model: Runway Gen-2 text-to-video

Shallow depth of field blurred footage people walking down a busy street. Brightly lit, daylight, English city street.

Model: Runway Gen-2 text-to-video

The underwater scene has stuck reasonably close to the prompt, but the fish aren’t really fish, they’re more like… floating glitches. In the crowd scene, I deliberately prompted for lots of blurring because faces will always be an issue, but it has also had problems with the movement – people are walking backwards, feet are attached wrong, and it’s altogether a bit creepy.

As a result of these limitations, it’s best for now to go for simple subjects, and very explicit prompts like this:

Tropical bird on a tree, shallow depth of field, highly detailed nature documentary, closeup, rainforest background, background blur

model: Runway Gen-2 text-to-video

Pika has a very similar approach, and also borrows a few terms from other image generation platforms like midjourney for controlling things like aspect ratio (the width and height of the videos). Here’s an example of some pieces of ginger falling from the sky, which I’ll use again later:

gravity defying floating pieces of cut ginger fall from the top of the screen -ar 16:9

Model: Pika

You have more control over the visual style with an image-based prompt, but less control over the movement. For example, you can generate an image in an image generator like DALL-E 3, Adobe Firefly, Bing Image Creator or Midjourney, and add some movement to the image:

Still image generated in Midjourney
Video generated in Runway Gen-2 image to video

As you can see, the visual style is the same, but it still struggles to keep the character and movement consistent throughout. Again, these are “now” problems that will likely be solved soon given the trajectory of the technology.

Making an entire video

Obviously you can’t do much with a 3-4 second clip, but it’s not difficult to generate a few scenes and stitch them together in any video editing software. Because I’m trying to learn it myself, I’m using Adobe Premiere Pro here but that’s definitely over-complicating things: I’ve also stitched videos together and added audio using TikTok on my phone.

The process for making an entire video is the same, with a little more planning. Here’s a video I made “advertising” lemon and ginger tea (because that’s what I was drinking when I decided to make a video…)

The videos were generated through a combination of text- and image-to-video prompts in Pika. I used Pika because, unlike Runway, I wasn’t running out of generation credits. Here is the process:

  1. Generate still images (the cup of tea, the packet of tea at the end) in Midjourney
  2. Generate videos with text-to-video in Pika (shots of lemons, shots of ginger)
  3. Generate videos with image-to-video in Pika
  4. Drag and drop clips into the correct order in Premiere Pro and slow some clips down to extend the length of time.
  5. Add a couple of basic transitions and text in Premiere Pro
  6. Generate an audio track in Stable Audio and add it to the video in Premiere Pro
  7. Export the finished movie

The entire process took about 25 minutes from start to finish. I could perhaps achieve something similar (and higher quality, for now) with stock video footage, but it wouldn’t be much more straightforward.

Enjoying this? Join the mailing list for updates on these articles and resources:

Success! You're on the list.

Implications for education

There’s a whole lot of hype around the “revolutionary” potential of AI in education, but I think that this kind of workflow, when it’s moved a little further along, will have some genuine implications in education.

Online courses are more popular than ever, and platforms like TikTok already reportedly outperform Google Search for younger demographics looking for advice. Designing and delivering content through videos is easier than ever but still requires some amount of skill. With these multimodal generative AI technologies, there is the potential to make this a lot easier.

I’m going to show you a video that I made following the steps above, with the slight addition of a script and text-to-speech using ElevenLabs’ voice model. But instead of the process I outlined for the lemons video, I want you to imagine its a few months from now and we have access to a more sophisticated multimodal model.

Here is a fake, not real, hypothetical prompt for a future multimodal platform which I’ll call ChatGPTvX. This platform has text and voice prompting, image generation, video and audio generation, text-to-speech capabilities, and the ability to compile, edit, and export video files. It’s not such a stretch. As mentioned earlier, ChatGPT already has any of these features built in, and I’ve even made some headway getting ChatGPT to generate videos in Advanced Data Analysis.

Hypothetical multimodal prompt:

Create a 1 minute long explainer video exploring bias in generative AI datasets. Use an illustrated style that blends realistic images with infographic style symbolic images. Use a blue and white colour palette, and transition smoothly between scenes every 3-4 seconds.

For the audio, create a voiceover in a British female voice. Generate the script, based on the following blog post: LINK. Add music at 30% volume: ambient, underwater, synthetic, deep blue vibes.

Export as .mp4

Model: ChatGPTvX (not a real model!)

Forget chatbots and “endless questioning” – the ability to instantly create short, engaging content aligned to quality education materials, with the oversight of an experienced teacher is where I’d like to see us focusing our attention.

If you’d like to chat about generative AI, or want to get in touch to discuss GAI policy or professional development, please contact me with the form below:

Leave a Reply

%d bloggers like this: