This is the second post in a series exploring the multimodal possibilities of generative AI. This series will take a detailed, hype-free look at text, image, audio, video, and code generation and explore the creative potential as well as the ethical concerns of GAI.
Although Generative AI isn’t a new technology, it’s definitely been having a hype moment since the release of ChatGPT in November 2022. Unfortunately, the focus has been squarely on the text-based chatbot at the exclusion of other multimodal forms of generative AI.
In the previous post in this series, I ran through Adobe’s latest GAI image generation product, Firefly. In this post, I’m going to look (and listen) to a few examples in another mode: audio.
Audio generation AI tools are built on a similar premise to other forms: take a pile of data, run it through an algorithm to “learn” the rules of the form, and then generate novel content. In the case of audio, this might involve music, speech, and sound effects.
Like other forms of GAI, audio generation is also not without ethical concerns. Privacy and security worries are the most prominent, with audio voices having the potential to breach security measures like telephone banking. Deepfake voices of celebrities are already causing trouble, and fake voices could be used in phones scams by fooling people that their friends and family are calling. Finally, music generation is just as contentious as art generation when it comes to the potential for displacing artists and being used to replace human creators.
In this post I’ll explore some of the currently available tools, discuss how they work and what they might be used for, and explore some of the potential issues.
Music generation: Google MusicLM and Stable Audio
Although voice synthesis is the most talked about form of generative AI, music models also have great potential to impact the commercial and creative worlds.
The first cab off the rank a few months ago was Google’s Music LM, with a model that can turn text prompts into music samples at 24kHz, and can also generate music based on “whistled and hummed melodies” (according to the research paper – we can’t yet do that in the public demo).
To access the MusicLM demo, you have to be registered for Google’s AI Test Kitchen, its controlled environment for testing new GAI products. Earlier in the year, the AI Test Kitchen saw the first release of what would later become the Bard conversational chatbot.
Google’s MusicLM examples page demonstrates a range of features and uses that aren’t yet available through the demo, such as using a chain of prompts to create a consistent soundtrack that moves from one style to another. However, given the progress in previous demos, it’s almost certain that MusicLM will be turned into a fully featured product at some stage.
For now, you get the option to generate a short 20 second track with a single text prompt. It generates two options which you can listen to, download, and provide feedback on. Here’s a track generated using the prompt ambient instrumental music, calm, tropical beach sunrise. I’ll use the same prompt later with Stable Audio for comparison.
MusicLM is trained on the open source MusicCaps dataset, which uses over 5000 samples of music labelled by musicians. Google will certainly be using data from apps and services it owns likes YouTube to train more powerful models. In fact, there are already complex legal proceedings happening such as a deal between Universal Music Group and Google to determine the rights and royalties of AI generated music that uses existing artists’ voices or songs.
Stable Audio is a new product from Stability AI, the open source developer behind the powerful Stable Diffusion image model. Stability AI has come under fire (like midjourney) for its use of intellectual property in building their image generation model, and because the model is open source and has no guardrails it can be used to generate explicit and harmful content.
It’s interesting, therefore, to look at the approach they’ve taken with the audio model. Stability AI have made it very clear where the dataset comes from for Stable Audio: “Our first audio AI model is exclusively trained on music provided by AudioSparx, a leading music library and stock audio website.“
Because it only uses stock music, it also can’t be used to create music “in the style of…”, one of the most problematic features of their image generation. It seems that this time around, they’re trying to avoid a few lawsuits (or if I’m being less cynical, trying to build a more ethical product). Diverging from the fully open source release of Stable Diffusion, there is also a paid “pro” version of Stable Audio which allows for more generations and tracks up to 90 seconds that can be licensed for commercial use.
There are a few more controls, including the ability to change the track duration (from 1-45 seconds) and the model, although at the moment only one model exists.
Here’s the beach music prompt from earlier.
Stable Audio can also be used to generate sound effects. Here’s a five second clip of “dog barking” in Stable Audio. When you try this in MusicLM, you instead get a music track “influenced” by the text prompt of a dog barking….
Speech recognition, generation, and language translation
Speech recognition and generation use different processes, but represent two sides of the same coin. The kind of speech recognition that has powered assistants like Siri and Alexa for years has now advanced to a much more “human-level” recognition, including of complex accents and dialects, slurred or distorted speech, and speech with lots of background noise.
OpenAI’s “Whisper”, for example, is “trained on 680,000 hours of multilingual and multitask supervised data collected from the web” and presents a highly accurate model of speech recognition. It’s built into the ChatGPT app if you want to try it out – you’ll see right away that it’s much more accurate than the iPhone’s Siri (pre-iOS 17 at least). Google has a similarly powerful model with AudioPaLM, Amazon’s Alexa has been updated with AI, and Apple’s Siri will follow soon enough.
ElevenLabs has pulled ahead with voice generation, offering a free “instant” version their voice cloning and a premium version which takes “up to four weeks” to generate using longer samples. They also provide totally synthetic voices for commercial use.
I’ll use Elevenlabs later in the post for a complete example of a “podcast episode”, but here is a side by side comparison of my real voice, the instant model, the paid model, and one of the presets (“Dorothy”, the British children’s book narrator).
The prompt in each case is the same: I’m reading an extract from the author’s introduction to Mary Shelley’s Frankenstein.
I saw—with shut eyes, but acute mental vision—I saw the pale student of unhallowed arts kneeling beside the thing he had put together. I saw the hideous phantasm of a man stretched out, and then, on the working of some powerful engine, show signs of life and stir with an uneasy, half-vital motion. Frightful must it be, for supremely frightful would be the effect of any human endeavor to mock the stupendous mechanism of the Creator of the world.Mary Shelley’s Introduction to Frankenstein, 1831 ed.
Earlier this year there was a media storm when it was revealed that Spotify had provided human audiobook narrators’ voices to Apple for use in training its AI model. The big fear, of course, is that ultimately the real voice actors will be put out of work by the AI. In fact, that’s already happening in projects like this one, which is using AI to narrate out of copyright classics.
You could argue that this technology will make texts more accessible, either by making audiobooks free or by adding voiceover to any text online. There’s also a chance that real, verified human narration will start to fetch premium prices over low or no-cost AI, and that professional voice artists might benefit as a result. Again, there are plenty of potential positives and negatives to GAI audio.
GAI Audio Translation
Just as I was writing this post, another product went viral and started to appear on my feeds with the same frequency as those “historical talking heads” from D-ID a few months back.
HeyGen is an app that, like Synthesia and D-ID, started off as an AI avatar creator. Recently though they’ve added a translator which uses ElevenLabs’ instant model, translates text into another language, and then lip syncs it over your video. It does a good job of making it both look and sound like you’re speaking another language.
At the time of writing, the “free” version of this tool was barely available due to a queue of over 70,000 users. You can, of course, pay to access the model and skip the queue. Still, after a four day wait, here’s a snippet of one of my existing videos about AI converted into Spanish:
The technology is certainly impressive, and there are obvious applications such as creating online content and then reaching a wider global audience. The quality of the translation is high, and presumably uses GPT (I didn’t dig around enough to find out). While I was skimming the typically obtuse terms and conditions to try to find where your voice and text data actually goes, I found a clause stating that no one can link to their site without express permission, so I won’t bother.
Expect tools like this to become part of the online furniture before long. Translation, voice synthesis, lip syncing, and so on are combinations of technologies already available to the big companies, and it probably won’t be long before we have live video translation built into products like Microsoft Teams and Google Meet.
Creative applications: GAI podcast
Taking a few of those above tools, I’m going to pull together a quick example of an “AI podcast” that uses my original (human-written) content alongside the synthetic voice and sounds.
At this stage, the technology is too expensive and a little unwieldy to use for creative or commercial purposes, but it’s not far off. Producers are already creating “AI artists” (they’re terrible) and apps like Synthesia can be used to run off AI-avatar videos for compliance and training.
Outside of making corporate training videos that no one wants to watch in the first place, it might be a few months before we start seeing more creative applications of the technology, but it will happen.
Here is episode zero of my brand new podcast (the one and only episode…), using content from the start of this blog post. I’ll just reinforce that I never spoke any of these words out loud.
Here’s a video of the whole process. There are some edits and the speed is increased, but overall this entire process still only took a few minutes. You’ll probably work this out, but the voiceover for the process video was also generated in ElevenLabs. I edited the clips together, and then typed a script while it played. Then, I exported the audio file and dropped it into the video editor, cutting and moving the voiceover audio to match the visuals.
One huge advantage is the ease of adding natural(ish) sounding voiceover to videos like this without needing the time, space, and equipment to make the recording. I work from home and have young kids, so a quiet space for recording doesn’t really exist… I can honestly see myself using this for short process/explainer videos for the blog and social media. I’ll let you know whether it’s me or the AI version, I promise.
This process could also be useful for adding an accessibility voiceover, although right now a long blog post like mine would use too many credits. I signed up for the lowest level paid service so that I could try the “professional” voice (as opposed to the “instant”), and that account comes with 100,000 mystery credit units. You can generate audio in blocks of 5000 credits, which was not enough to get much further than the couple of minutes used here.
Like everything else GAI related, the audio genie isn’t getting stuffed back into the bottle any time soon. We can expect to see a glut of apps and services built on speech recognition and generation models. Some will be great, others will be next to useless. Many will come with privacy and security concerns, and will be used in ways which are unethical or even criminal.
The technology will also be used creatively, and ways which improve accessibility and lower the cost of accessing information in an audio format. It’s down to us to decide what role we want these tools to play in our lives.
In the next post in this series, I’ll go back to image generation and take a hands on look at Midjourney, arguably the most powerful and visually impressive image generator. I’ll compare it to Adobe Firefly and discuss both the creative opportunities and the concerns. Join the mailing list to stay up to date:
I run professional learning and consulting services for anyone looking to develop skills in the creative and ethical use of generative AI. I’ve worked with schools, businesses, and organisations developing GAI policy to help guide staff through the appropriate, safe use of tools like these. If you’re interested in discussing GAI professional learning or consulting, get in touch directly via the form below.