This is the fourth post in a series exploring the practical and creative implications of multimodal generative artificial intelligence (GAI). The previous posts covered image generation with Adobe Firefly, audio generation for voice, music, and sound effects, and text generation with chat plus search.
Over the last couple of weeks, Microsoft has upgraded their Bing Image Creator to OpenAI’s DALL-E 3 model. It’s interesting, because even OpenAI haven’t released the model yet – either on their website directly, or through ChatGPT. This is obviously part of the deal for the huge amount of funding Microsoft has provided OpenAI.
The upgrade has brought with it some pretty dramatic improvements to the qualities of the output. I’ve been testing it for a few days, and not only is the quality of photorealistic images much improved (including fixing the fingers problem in most images!), but it’s also much more true to the prompt than DALL-E 2, and more versatile and expressive in how it interprets certain prompts.
Most importantly for the majority of you reading this, Microsoft Bing Image Creator represents another model that could potentially be used in the classroom. Like Adobe’s Firefly, it is easy to access and use, doesn’t rely on Discord (like Midjourney), and almost certainly has better safety features than a Stable Diffusion based model.
In this post, I compare Bing Image Creator to Midjourney. In the past, I wouldn’t have even considered stacking DALL-E 2 up against what is arguably the leading image generator in terms of quality. I’m also exploring some of the perennial issues of image generation, including the ethical concerns of appropriation of artists’ work, and the baked in stereotypes that plague these models.
Head to head with Midjourney
A picture paints a thousand words, or a few words paint a picture. Rather than explaining the upgraded model, here are some side-by-sides using the same prompt in Bing (left) and Midjourney (right). The prompts are in the captions for each pair of images. Use the slider on the images to move between the two.
So, what do you think? In many of these images, I’d say the quality is almost comparable between Bing (DALL-E 3) and Midjourney – particularly in the illustrations and paintings rather than photographic styles. Midjourney is still sneaking ahead, but not by much.
There are also a few interesting guardrails missing from DALL-E 3. For example, MIdjourney won’t display brands, but in the ‘purple fish’ image, there’s a very accurate McDonald’s sign in Bing’s version. Similarly, Midjourney has been trained to not display text, since most models are currently hopeless at generating accurate text. But that doesn’t stop Bing attempting to generate text in the “lemon scented coffee” advert, even if it does come out as “AMAZING LEMON SCNNITED C OFFEE“.
A few of these prompts were designed to deliberately play into current image generation weaknesses. For example, the group of marathon runners has hard to render legs, and “fish holding hands” is tricky because in the labelled dataset, there are unlikely to be any fish with hands. Both models have produced some distortions in the legs, but much better than six or even three months ago. Midjourney’s fish hands, creepy as they are, are distinctly more “handsy” than Bing’s (even if Bing took the Macca’s part of the prompt seriously).
Does Bing/DALL-E 3 appropriate artist’s styles?
Onto the negatives. Although there is huge creative potential for image generation, and it opens up image creation to people who lack digital art skills, it’s a contentious technology for a number of reasons. Despite claims that DALL-E 3 steers users away from being able to prompt for artist’s unique characteristics, a couple of simple tests proved that it’s still ridiculously easy to get output “in the style of” living and deceased artists.
Deceased artists like Van Gogh and Rembrandt are fair game:
But so too are living artists, like American artist Kelly McKernan, who is currently involved in a class action lawsuit against Midjourney, Stable Diffusion, and DeviantArt. McKernan has had their name used thousands of times to generate work in Midjourney in their style without consent. The problem hasn’t gone away in Bing:
Another plaintiff in the same class action, Sarah Anderson, can easily be appropriated in Bing Image Creator.
Interestingly, during this whole process I only got two errors. The first was for the short prompt Rembrandt style oil painting:
The second was while experimenting with artist’s styles. However, the reason the error came up was not because of the style, but the subject: I asked for a dramatic portrait photo in the style of Annie Leibowitz of Britney Spears. It seems that artist’s styles are fair game, but creating images of celebrities is a no-no.
Smashing the stereotypes, or not
Another perennial problem in image generation is the replication of bias from the dataset. The weighting of labelled images determines the probability in the output, so for example, in the dataset if all images labelled as “CEO” are of men, then the prompt “CEO” will be more likely to produce an image of a man. This is true for all kinds of stereotypes, including race, gender, employment, representations of disabilities, and more.
There’s an argument kicking around online that this is perfectly logical, since image datasets reflect societal biases and are therefore “true”. Let’s clear this one up right now.
Datasets do not represent reality. Just because a dataset has more images of male CEOs, and there are statistically more male CEOs in the real world, does not mean that the inequality in real life is replicated in the dataset. In fact, the percentage of “male CEOs” in the dataset is much more likely to be higher than in reality. This isn’t just semantics. Arguing that “X% of CEOs are male, and therefore datasets are true representations of reality” is a gross oversimplification and a misunderstanding of the concept of representation.
First, datasets are constructed. They are not organic snapshots of the real world but are products of human choices. The selection, curation, and categorisation of images in a dataset are all informed by biases – both conscious and unconscious. This means the dataset doesn’t merely reflect societal biases; it can amplify them.
Second, even if we were to accept the premise that datasets simply mirror society, it doesn’t follow that this is a good or neutral thing. Many societal structures and norms are rooted in biases, prejudices, and historical injustices. Just because something exists doesn’t mean it’s justified or unproblematic. Replicating these biases in machine learning models can inadvertently reinforce and perpetuate societal inequalities.
So how does Bing fare against a few standards tests of bias? First of all, the “CEO test”:
Well, they’ve certainly “solved” the gender issue… So now, as long as you’re a young, attractive, brunette you too can be a CEO. This is an example of a very heavy-handed rule being applied post training, something along the lines of if prompt includes “CEO”, produce image of woman. It doesn’t address the underlying issue, which is the bias in the dataset. And you can see that the problem hasn’t really been resolved at all if you try a few other commonly stereotyped prompts.
Here are a few examples. In this collection we’ve got a nurse (young female), an athlete (ripped white male), a teacher (bookish young woman, seemingly with a “produce an image of a black woman” rule applied since this image was replicated in almost every iteration), and a scientist who, like our CEO, seems to have been “ruled” into a young woman.
There were plenty more, like the muscular white male mechanics tending vintage automobiles in grainy black and white, but you get the point. Applying rules to try to steer the output is certainly one approach to mitigate the bias, but it really depends on who’s doing the rule making.
Bing or Firefly?
The previous post I wrote on image generation was about Firefly, a newly public model from Adobe. In that post, I suggested that Firefly had finally made image generation a possibility in schools due to its safety features, accessibility, and training data.
So how does Bing stack up? In terms of filtering, it seems pretty proactive. I haven’t personally tried throwing NSFW words at it or asking for violent images because I don’t want my account suspended, but there are plenty of people on Reddit already letting the world know exactly what gets them kicked off the platform.
Anything pornographic, violent, or gory is flagged and will not generate (however, the photo below of “party people” might be a little risqué for some). Multiple requests result in a ban. It also won’t generate most brand names, although as I showed earlier some (like McDonald’s) slip through. And I showed you earlier what happened when I asked for Britney, so no real living celebrities.
Adobe Firefly’s image quality isn’t nearly as high as Bing’s, but its advantage still lies in its more “ethical” dataset, sourcing images only from Adobe Stock. Getty Images have recently released a similar model trained on their own stock images. Although these don’t scrape indiscriminately from the web, there’s still an argument that the artists who provided stock images never intended for them to be used this way. Check out Bing (left) versus Firefly (right) with the same prompt. The faces in both are… disturbing.
So can you use Microsoft Bing Image Creator in a classroom? Probably. But with all of the usual caveats: beware of bias, don’t expect the guardrails to be entirely secure, and always remember where these models get their images from.
Have any questions or comments about this post, or interested in professional development for GAI? Get in touch: