What is Multimodal Generative Artificial Intelligence?

pink and purple wallpaper

The term multimodal generative intelligence is getting thrown around a lot recently – even more so now that the most popular models like GPT have added features like image recognition and generation. But what does ‘multimodal’ actually mean?

What is “Multimodal”?

Although the term “multimodal” might seem self-explanatory, there’s more to it than you might think. The term is being bandied around right now by many AI developers, but I like to consider it from a different perspective. My definition of multimodality comes from linguist and social semiotician Gunther Kress, and is concerned with both the forms of meaning available to us, and the ways in which these forms can blend and shift.

In the typical technology world definition of multimodal, the term simply means “capable of communicating in more than one mode” such as text, image, sound, and so on. By that definition, an application like Microsoft Copilot is multimodal because it can generate both text (using the GPT model) and image (using DALL-E 3).

But I think we need to go a little further with the definition, particularly if we are to account for the kinds of GenAI technology just over the horizon. Multimodality isn’t just concerned with communication in different modes, but how meaning is made and changed between modes, and the various affordances and limitations of different modes. For example, I can convey the meaning of the word “tree” by saying the word aloud, showing an image, writing the word, or even in more abstract terms through modes such as sound and gesture. Each of these “trees,” though stemming from the same concept, has a different meaning.

Practical AI Strategies is available for pre-order from Amba Press

We also need to consider how meaning is changed when the form or mode changes – something which is particularly relevant for GenAI. Most generative AI applications currently rely on text prompts to generate novel content. That means that the idea must first be written and then transformed (by the algorithm, underpinned by the dataset, and refined through the model’s training) into something else. The change of meaning within the same mode, e.g., text to text, is something Kress would call transformation. We do it all the time without GenAI, such as when a novel is transformed into a script, or a lengthy policy document is transformed into a series of dot-points on an information brochure.

Transduction, on the other hand, is changing meaning across modes, such as from text to image. So in image generation, or audio, or video, we are changing the meaning from one mode to another – or, rather, the algorithm is changing the meaning in response to our prompt.

Multimodality is also important because of the different affordances and limitations of modes and combinations of modes. The cliche “a picture paints a thousand words”, for example, suggests that it’s possible to condense meaning in a visual in ways which aren’t possible in written text. This is why advertising uses far more images than writing. Combinations of modes, such as audio/video, can take that even further. But there are also complex relationships between modes such as gesture, body language, and speech which are much harder to capture and convey with digital technologies.

I’ll talk more later about the limitations of GenAI on this point, but for now it’s worth repeating that the term “multimodal” is more complex than simply being able to communicate in image as well as text.

If you enjoy these articles, please join the mailing list for updates:

Processing…
Success! You're on the list.

What is a Multimodal GenAI model capable of?

in 2023 there were several important advances in Multimodal GenAI, most of which involves bringing together existing technologies in new ways using more powerful language models. ChatGPT, for example, began as a text only model. All of the input was text, and all of the output (including text in computer programming languages).

The release of GPT-4, however, changed that. OpenAI incorporated image recognition with the GPT-4v (vision) model, and image generation with DALL-E3. Both of these features are also available through Microsoft Copilot and Google has it’s own version of image recognition and generation. In applications like Stable Audio, text-to-sound is possible through the use of a large audio dataset. Runway allows for text-to-video, though that could be seen as an extension of text-to-image.

Models also have the capacity to take other modes as input. ChatGPT for example uses OpenAI’s Whisper voice recognition model to allow a user to speak to the app rather than write. Speech recognition has been around for a long time (like Siri and Alexa), and is seeing improvements through the combination with Large Language Models (LLMs).

Outside of LLMs, work is also being done in the spatial and gestural modes, such as through the training of robots through virtual reality or wearable sensors. A user wearing a body suit of sensors can “teach” a robot how to manipulate objects in space by converting their spatial movements to machine-readable code - a transduction of meaning from the spatial mode to text, if we use the earlier definition.

Multimodal GenAI for Creating Texts

All of this opens up many combinations of multimodal creation. When combined together, it’s not hard to imagine that we will soon be able to use GenAI create texts through means such as speech-to-video, or gesture-to-image, or text-to-spatial. Being able to verbally describe an abstract idea and have it realised in video with sound is probably not too far on the horizon. Prominent companies such as Sony are working on technology that allows users to create in 3D space for virtual environments, and it is already possible to combine these kinds of technologies with GenAI such as image models.

Thinking of GenAI as a multimodal creative tool is, to me at least, far more exciting than most of the uses we’ve seen so far for applications like ChatGPT. It’s definitely possible to crank out reams of text using chatbots – in fact, the internet is already filling with what I call “digital plastic“: synthetic media churned out of language models and image generators. you only have to look at a few Amazon product titles to see how this is already going wrong.

I’ve come across a few others who share the opinion that GenAI should be more than just a means for producing bulk text. I’ve come across a few others who share the opinion that GenAI should be more than just a means for producing bulk text. Justin Hodgson, Associate Professor of Digital Rhetoric in the English Department at Indiana University, would prefer to see GenAI as “playable media” which “comes alive as people make choices that shape their experience in real-time”. His article focuses on co-creation of meaning, and ways in which writers or text producers can “play” with GenAI rather than just producing content.

In a similar vein, Simon Buckingham Shum, Professor of Learning Informatics at University of Technology Sydney, presents the Writing Synth Hypothesis:

The Writing Synth hypothesis proposes that with the emergence of generative AI, authors will be able to learn writing in new ways, democratising writing just as we saw with music synthesisers.

Now we need to learn to play these new instruments.


There may be new genres of writing that, like the music revolution, were impossible to create without these new tools.

https://simon.buckinghamshum.net/2023/03/the-writing-synth-hypothesis/

Drawing an analogy between GenAI and the musical synth gives another perspective on “playing” with multimodal technologies rather than simply using them as tools. In fact, there has been plenty of writing over the years around the similarities between digital texts and the remixing and sampling of various music genres.

We’re yet to see much in terms of tools built with these kinds of approaches in mind. For every innovative use of ChatGPT by creatives and artists, there are a million more people promoting it as the “best tool for writing marketing copy” or for generating LinkedIn hashtags.

Google has built a collection of AI tools “for rappers, writers and wordsmiths” in collaboration with hip-hop artist Lupe Fiasco which uses a LLM to make analogies, “explode” words, describe scenes, and so on. Interesting as it is as an experiment, however, it does little beyond what you could achieve with a prompt in any model like Bard, Copilot, or ChatGPT. It’s also not multimodal, limited as it is to just text.

Slouchy Media used Meta’s AI music generation tool, available via the open source platform Hugging Face, to put together an album in 24 hours to raise money for Oakland-based not for profit Beats Rhymes and Life. The documentary on YouTube shows how the producers used the text-to-music model, leaning on their own knowledge of genre and style and further manipulating the AI generated samples using “traditional” digital editing software.

I’d love to see more examples of people using GenAI across the range of modes, including text, image, sound, and the physical modes of space and gesture, with the same kind of enthusiasm and technical skills as these.

Is it really Multimodal?

I recently read an article that made me question whether GenAI is truly multimodal. Bill Cope and Mary Kalantzis at the University of Illinois are prominent linguists, and draw on their hugely influential work as part of the New London Group in trying to define a “grammar” for GenAI. If you’re into that sort of thing (I am) then you should go and read the article yourself – it’s far too dense to try to convey successfully in this post.

But one important takeaway from the article is that current GenAI is really only centred on the written mode. Even image generation, despite producing visuals as an output, is reliant on writing. We write to prompt the model. Images are generated based on a model which has learned via written labels. Essentially, the written word underpins everything a GenAI model can do. The authors point out the huge differences between written and spoken language, and the fact that as such GenAI cannot hope to capture or reflect the complexity of meaning that humans can convey.

It might seem like an abstraction, but it’s important in this context of creativity and multimodal GenAI – if models are only really dealing with the written mode, then they will be forever hamstrung by the limitations of that mode.

The future of Multimodal GenAI

In my upcoming book, Practical AI Strategies, I’ve written about the (near) future of GenAI. I firmly believe that increased multimodality will be one of the key features of the technology. I think there will be an increased shift towards speech-to-X as opposed to text input, and that there is an imperative for developers to put out models which have as many modal possibilities as they can. I wouldn’t be surprised if we see audio generation in upcoming versions of ChatGPT, for example, and video generation is already a hot research topic for Google and Microsoft.

All of this multimodal technology will also require masses of data. As Cope and Kalantzis noted in their paper, speech is very different from written language and closely entwined with gesture, body language, and facial expression. Developers will use wearable technologies to capture as much of this data as possible. For example, VR headsets can track facial expression at the same time as recording speech. Sensor technology can capture gesture, body language, and movement. Building multimodal datasets will be a high priority in the near future.

An important question is of course, do we want that? Digital technologies, including those used to collect data for GenAI, are already incredibly invasive. Developers are notoriously cavalier with our privacy. Could the collection of ever-more multimodal data make those problems even worse? Absolutely. And if we only ever use these technologies for creating digital plastic, then we will have given away our privacy – our voices, our thoughts, our movements through space – for very little.

Multimodal Generative AI could be a powerful creative force, but we need to find ways to use it which are playful, imaginative, and truly beneficial.

If you’d like to discuss consulting, advisory services, or professional development for GenAI, please get in touch using the form below:

Go back

Your message has been sent

Warning
Warning
Warning
Warning.

3 responses to “What is Multimodal Generative Artificial Intelligence?”

  1. […] or not that should be the case (and in other articles I’ve argued that focusing on the creative uses and ethical concerns of GenAI should be at least of equal priority), schools and universities […]

  2. […] Multimodality is a concept much more complex than the ability to produce an image from text. In my research, I draw on definitions of multimodality and multiliteracies from scholars such as […]

  3. […] What is Multimodal Generative Artificial Intelligence? I suggested that we need to look beyond the obvious definition of multimodal technologies as those […]

Leave a Reply