Making meaning with multimodal GenAI

As much as Generative Artificial intelligence has caused waves in education, the focus in research and publications on the impact of GenAI is still squarely on text-based models and in particular ChatGPT. That’s understandable considering the impact OpenAI’s chatbot had almost immediately from its launch November 2022. But by focusing attention on large language models like GPT, we neglect the opportunities and the challenges presented by multimodal generative artificial intelligence.

The term “multimodal” is being used liberally in the tech industry by the companies developing applications like ChatGPT, Copilot and Gemini. It’s an attractive marketing term used to describe models which can work with text, images, audio, and video, but it’s also a term worth interrogating.

But Multimodality is a concept much more complex than the ability to produce an image from text. In my research, I draw on definitions of multimodality and multiliteracies from scholars such as Gunther Kress and others of the New London Group. Kress’s multimodality provides some of the language with which to discuss and interrogate these technologies, looking beyond the simple affordances of multimodal as a tool for interpreting and creating texts and towards the societal implications of a technology which can be used to make meaning.

I’ve attempted to define multimodality in the context of GenAI in an earlier post, so I’ll just summarise the key points here. Multimodality is more than just the ability to communicate in different modes such as text, image, sound, and gesture; it encompasses how meaning is made, changed, and transferred between modes, considering the affordances and limitations of each mode and their combinations. Transduction refers to the process of changing meaning across different modes, such as converting a text prompt into an image or audio output using generative AI algorithms, while transformation involves altering the form or representation of content within the same mode, such as transforming a novel into a script or condensing a lengthy document into bullet points.

The X-to-Y of Multimodal GenAI

Trying to come up with a way to express the complexity of multimodal generative AI beyond just text generation, I started to wonder about the various combinations of modes and how they each impact upon meaning. At the simplest level, text-the-text models like GPT interpret text based prompts and output text in return. Without going into too much technical detail here, as I’ve discussed elsewhere this is exactly what language models are designed for, and in my experience it’s the way most people are currently using generative artificial intelligence.

Another popular use is text-to-image which can be achieved in ChatGPT provided you have Plus subscription, or elsewhere via a plethora of free applications including open source models such as stable diffusion. Text-to-text and text-to-image aside however, there are many more ways that generative artificial intelligence transduce meaning from one node to another.

Rather than listing all of the possible combinations I’ve created this table. As an aside, I created the following table with a colour gradient by using Claude 3 Opus to write a Python script. It uses the Python docx module and a simple algorithm to create the table and shift the colour gradients in the cells. This is something which I surely could have done myself, but probably would have taken me half an hour rather than 30 seconds.

Table generated with Claude 3 Opus using Python

As you can see, there are many more combinations of multimodal generative AI than it might first appear and this is a fairly simplistic two-dimensional representation. It doesn’t take into account that these technologies can, in theory, combine multiple modes at once. For example, It’s possible to turn a text transcript into both visual and audio through the creation of talking avatars in an application like HeyGen. For now, though, this serves as an introduction the some of the combinations available and these combinations are largely available now.

I’ve included “spatial” as a mode, acknowledging that the placement, positioning, layout, and structure of digital texts often convey as much meaning as the text itself. I’d also like to extend the spatial mode to encompass technologies such as virtual reality (VR), augmented reality (AR), and mixed reality (XR). In the future, perhaps replacing the as-yet unknown ???? category, I’d also consider Kress’s “gestural” mode. Generative Artificial Intelligence has not yet merged fully with robotics – one of the primary uses I can imagine in the gestural mode. However, if/when AI-generated avatars pull themselves out of the ‘uncanny valley’ and demonstrate more realistic (and compelling, emotive, and persuasive) facial expressions and body language, then the gestural mode will certainly become important.

Here’s a non-exhaustive list of examples of technologies which exist right now in various combinations.

text-to-text

Without wanting to state the obvious, text-to-text models are by now a well-established application of generative AI. Since ChatGPT’s release in November 2022 we have seen hundreds of applications, including a fair few “foundation models” – large models which other smaller applications are built on.

The most commonly used are OpenAI’s ChatGPT, Microsoft Copilot (which uses the same underlying model as ChatGPT), Google Gemini, and Anthropic’s Claude. IBM and Amazon each have their own models which are less popular in the public sphere but will be widely used in industry. And there are dozens of other major players, including many available as open source. France’s Mistral and Meta’s Llama are powerful examples which can be built on top of, and used via platforms like Hugging Face.

HuggingChat allows users to experiment with a range of open source models in a now-familiar chatbot context

All of these models share common features, allowing a user to input text and generate novel text in response. This could be anything in any language which comprises a large enough part of the dataset, making text-to-text models useful for translation, code generation, and many writing tasks.

text-to-image

Image generation is also well established by now, with both open source and proprietary models exhibiting huge leaps forward in quality in the past 2 years. As an example, compare the output of OpenAI’s DALL-E 1 in August 2022 to the same model, DALL-E 3 in April 2024 (click images to enlarge):

DALL-E 1 August 2022. I have since lost the exact prompt, but it looks like “students in a classroom”

DALL-E 3 April 2024. ChatGPT writes its own prompts, in this case a rather convoluted “A high-resolution photographic image that captures the essence of a real-life classroom setting. The scene includes a diverse group of six students of various ethnic backgrounds, both male and female, seated in a contemporary classroom. The classroom is bright, with natural light coming through large windows, enhancing the studious atmosphere. The students are engaged in their studies, some looking at laptops, others reading books. Surrounding them are walls adorned with educational posters and a blackboard covered in mathematical equations.”

Despite some obvious issues with DALL-E 3’s image, and the terrifying complexity of its prompt (I asked for “photo of students in a classroom”) it’s an obvious step up from OpenAI’s earlier attempts.

And DALL-E is far from the most competent image generation model. Other models (typically ones with even less regard than OpenAI for copyright, and therefore larger datasets and fewer guardrails) are even more capable. Midjourney, for example, can produce much higher quality images:

Four options generated in Midjourney v6. Prompt: photo of students in a classroom –ar 3:2

This doesn’t mean that text-to-image is perfect – far from it. You can still expect some issues, plenty of evidence of bias, and occasional horrific weirdness. Rolling Midjourney’s model back to v5.2, for example, produced this oddity with the exact same prompt:

Is she sad because she’s in a classroom, or because she has six fingers? We’ll never know. Maybe the disembodies spirits floating above these students have the answers…

image-to-video

In comparison to text and image generation, video generation is in its infancy. Given the pace of change in these technologies, that doesn’t mean much. In fact, image-to-video is already available, although the resulting generations of are a much lower quality than image generation.

I have covered both Runway and Pika before, and both platforms offer text- and image-to video. This provides users with the ability to upload an image (including a generated image) and add motion to the scene, for example controlling camera movement, adding motion to discrete elements of the picture, and so on. Of the two, Runway is currently the most versatile tool as it offers more editorial control through aspects like “motion brushes” and camera controls.

In the above example, I uploaded an image (generated in Adobe Firefly) and used various controls to add a zoom, a slight vertical movement, and some motion to the clouds.

A moving picture is essentially a series of still image frames, and models like Runway generate a series of images. However, as you can see in the resulting video, these outputs often struggle with consistency, resulting in weird morphing throughout the scene.

text-to-video

OpenAI’s Sora has taken the limelight recently, and with good reason. The model offers incomparable image fidelity and realism, and while it’s not there yet (weird limbs, bendy physics…) it’s a huge step up from any of the text to video models we have seen thus far. It’s also, unsurprisingly, a complete black box in terms of training data. We have very scant information on where OpenAI got their video data, with speculations including the unlicensed scraping of various video sites, and the generation of “synthetic data” using video games development platforms like Unreal Engine.

Adobe also recently announced that they will be including Sora as one of several models available alongside their own Firefly in an upcoming update to Premiere Pro. This is huge news, not just because it puts the video generation capabilities of Firefly, Sora, Runway, and Pika into an industry standard video editor, but also because it might help legitimise AI use in video production.

Adobe’s clout in the creative industries can’t be underestimated, and adding content credentialing and model support makes GenAI seem a little more respectable (whether the artists being scraped would agree is another question entirely)

image-to-text

Image recognition is already a feature of many LLM-based applications, including ChatGPT Plus and Microsoft Copilot (both of which use OpenAI’s GPT-4v model), Google Gemini, and Anthropic’s Claude. Image recognition in these platforms is hit and miss – while they can come up with very accurate responses, they are also even more prone to “hallucinate” or fabricate information when using image recognition.

However, it’s already possible to do some impressive things with image recognition. In a post late last year, for example, I used image recognition to a turn a sketch into a functioning website, having the LLM interpret the code structure from the hand-drawn note. I’ve also written elsewhere about turning back-of-the-napkin sketches into complete images:

text-to-audio

Text-to-audio might come in the form of voice generation, sound effects, or music. It’s already available and very powerful, including incredibly accurate voice synthesis through platforms like ElevenLabs. In a post last year, I deepfaked my own voice and demonstrated that this platform can be used to create a reasonably convincing but totally synthetic podcast.

The script was created by ChatGPT, and the voices are “mine” (via ElevenLabs) and one of the six ChatGPT voice options. The music in that podcast was also generated from a text prompt using the platform Stable Audio from Stabilty AI.

Whilst Stability AI tell us their model was trained only on stock music, other developers have not been anywhere near as cautious. Two new platforms – Udio and Suno – have both been implicated in using unlicensed music in their training data. For all the claims that AI “democratizes creativity”, these companies seem hellbent on reducing creative output to a business model in which they steal intellectual property, and use it to generate revenue for themselves – no creators are as-yet benefitting from these platforms.

audio-to-video

Video games developers and related companies are working frantically on incorporating generative AI into just about everything. The video games industry actually kickstarted the current wave of GenAI: models like GPT wouldn’t be possible with the powerful processors built by companies like NVIDIA. And NVIDIA has recently released a platform which allows for audio-to-video in the form of Omniverse Audio2Face. The platform allows users to upload audio which is then passed to a neural network, generating a 3D mesh of a talking character.

Not one to be beaten by the likes of NVIDIA and HeyGen, Microsoft has just released a technical paper and some demonstrations on its new VASA-1 avatar generation model, which can also use audio-to-video to create talking virtual characters.

text-to-spatial

I’m extending Kress’s spatial mode to include three-dimensional elements such as those which might be used in Virtual Reality, Augmented Reality, and Mixed Reality applications. While we’re not yet at the stage where GenAI can be used to create entire virtual worlds, the confluence of image generation, video generation, and our existing understanding of 3D asset creation means that we’re not far off.

3D design software such as Adobe Substance already incorporates GenAI elements, for example through the use of Adobe’s Firefly image generation model to create textures for 3D objects. Luma Labs, which currently operates inside Discord, can be used to create low-resolution 3D objects from text prompts. And platforms like SkyBox can be used to create simple wraparound 3D environments.

speech-to-speech

In the very near future, the ways we interact with our devices will change significantly. We’ve had Siri and co. for years, but, frankly they’re rubbish. It feels like we are fairly far from having fully-functional digital assistants, but Large Language Models will probably change all of that. Speech-to-speech, where a user talks to a device and the device talks back, will almost certainly become the de-facto way for operating many systems over the next few years.

Advances in both speech recognition and voice generation, as demonstrated earlier, plus the general advances in the abilities of LLMs to interpret user requests, means that Siri is about to get a turbo-charge. Similarly, Alexa, Google Assistant, and whatever Cortana calls itself these days (Copilotana?) will all get LLMsn under the hood.

Of course, the sheer volume of speech-to-speech technology on the horizon comes with some serious ethical and social implications. Voice models are essentially deepfake technology, whether they’re being used to power a virtual assistant, or to steal someone’s identity. ChatGPT developers OpenAI have written one of their “we’re ethical, honest” blog posts about synthetic voices, in which they discuss how they’re doing some really-serious-hard-thinking, they promise. I’ll be interested to see what government regulations and laws emerge over the next few years to regulate the use of these technologies, since I’m not really convinced it’s best left in the hands of the developers.

????

I haven’t talked about the gestural mode, facial expression, intonation, context, or any of the other complexities of multimodal communications. Some of these, in terms of generative AI, are just not there yet. While some companies can create realistic avatars, for example, they’re still in the uncanny valley and their excessive blinking and bizarre lip-syncing often gives them away. And robotics is still very much an emerging field – we’re a long way off from the time when a robot can convey as much meaning with folded arms and a closed off stance as a human can. But that won’t be true forever.

As well as the gestural mode, there are plenty of sensory and other modes that can be used to convey meaning, and all of this data can potentially be used to create Large X Models. For example, while the gustatory and olfactory senses (taste and smell) don’t create meaning in themselves, they can evoke powerful memories, emotions, and associations that contribute to the overall meaning of an experience. The taste of a particular dish or the scent of a specific perfume can transport someone back in time or trigger strong feelings, adding depth and complexity to the way they interpret and understand the world around them. In this way, these sensory modes play a crucial role in shaping the meaning we derive from our experiences, even if they don’t directly encode meaning in the same way as language or visual symbols.

The haptic, or touch, sense can equally shape meaning, whether through the feel of certain textures under the fingertips, or of hot/cold against the skin, or any number of emotions and thoughts connected to touch. Generative AI cannot (yet, if ever) interpret, create, or understand meaning through these diverse senses. However, it’s not difficult to imagine a near future where Virtual and Augmented Reality devices include haptic or olfactory functions to create more compelling experiences.

Once we go beyond the obvious X-to-Y combinations of GenAI modes, we start to see some of the limitations of the technology. If we take into account more complex matters of meaning-making, such as temporality (sequencing in time) or context (consider the difference between a smile in a friendly conversation versus a smile in a tense negotiation), it becomes clear that current GenAI struggles to fully capture and convey the nuances of human communication.

Is GenAI even multimodal?

Bill Cope and Mary Kalantzis – literacy scholars and members of the influential New London Group – argue that although GenAI can output data across a variety of modes, it is essentially all reliant on the written mode. In discussing the limitations of GenAI for meaning-making, the authors write”

“Computers can’t mean anything other than zero or one. All they can do is calculate by textual transposition: recorded Unicode > chunked into tokens > binary notation > calculation of the probability of the next token > token > readable Unicode”
Cope, B., and Kalantzis, M., (2024) Literacy in the Time of Artificial Intelligence

For example, for image generation to work, dataset images must be labeled in plain text. This is why a user is able to use a text prompt to generate an image from the rules learned by the image generation algorithm. Similarly, in automatic speech recognition (ASR) and the generation of audio, the data must be transcribed.

This is a serious limitation of the technology for a couple of reasons:

Firstly, GenAI is reliant on the subjective interpretation of the person labelling the data. Kate Crawford also writes about this in her book Atlas of AI, where she speaks of the limitations of classification and the ways in which labelling images, for example, can encode bias, racism, and sexism into the data. In a recent paper, Crawford and HuggingFace researcher and AI ethicist Sasha Luccioni, dug deeper into these issues in an analysis of ImageNet, one of the first and largest image datasets and one in which the text/image pairings are known to contribute to bias.

Secondly, this process of transcription loses much of the nuance of multimodal texts. Audio transcription disregards oral elements such as pace, accent, intonation, volume, and tone. Speech also rarely works in isolation and is frequently coupled with facial expression and gesture, or contextual cues which are absent from audio transcripts. This flattening of the audio/speech mode is problematic for both interpreting and creating meaning.

I have seen this in effect recently, when I used Google’s Gemini 1.5 Pro in the AI Studio for the first time. Gemini 1.5 Pro has an enormous context window of 1 million tokens, meaning it can process immense amounts of data. It is also a multimodal model with image recognition, audio transcription, and optical character recognition built in. I uploaded a recorded webinar in which I discussed AI and assessment for one hour and asked the Gemini model for feedback. Whilst it was able to provide feedback on the layout of the sides, the structure of the presentation, and the content based on the audio transcript, it was not able to provide feedback on aspects such as those mentioned above the oral qualities of the presentation like pace and pitch.

Like many things with artificial intelligence, there is a chance that we are overselling the multimodal capabilities somewhat and attributing too much value to models which, sophisticated as they are, reduce the world to a simplified two-dimensional representation.

The Practical AI Strategies online course is available now! Over 4 hours of content split into 10-20 minute lessons, covering 6 key areas of Generative AI. You’ll learn how GenAI works, how to prompt text, image, and other models, and the ethical implications of this complex technology. You will also learn how to adapt education and assessment practices to deal with GenAI. This course has been designed for K-12 and Higher Education, and is available now.

Check out the course

Leon Furze

Making meaning with multimodal GenAI