Dall-e 2 and 3 as Depicters of Human Life
Scott Alexander found in his AI art Turing test that most people had scant ability to tell whether an image was produced by AI or by a human artist, and that on average people showed a slight preference for AI art. With his usual fair-mindedness, though, Scott quotes a dissenting friend who is herself good at telling AI art from human: “When real pictures have details, the details have logic to them. I think of Ancient Gate [an AI image produced by Piotr Binkowski] being in the genre superficially detailed, but all the details are bad and incoherent. . . . It has ornaments, sort of, but they don't look like anything, or even a worn-down version of anything. . . . It has stuff that's vaguely evocative of Egyptian paintings if you didn't look carefully at all.”
My eyes are not as sensitive as Scott’s friend’s to the integrity of artistic depictions of nature and buildings, but they seem quite sensitive to hollowness and false notes in depictions of people and their doings. I often feel a sharp inner recoil from AI images of people And under my personal repugnance lies a belief that those images are just bad for us. In my darkest ruminations, AI art and entertainment are for people what antifreeze is for small mammals. It looks and feels so much like what we are wired to like that we consume it — but then it destroys some of our innards..
My experiments with AI art produced, among other things, a collection of images Dall-e 2 and Dall-e 3 made when I prompted them with lines from love songs or brief summaries of songs’ lyrics. I’ll be using them here, paired, in my review of both Dall-e versions as depicters of human situations and human emotion. My focus will be on what you might call the images’ integrity. Are their subjects’ emotions and actions painted on, like the ancientness and Egyptianness and ornamentedness of Ancient Gate? Or do we experience them as more richly real? Are they appealing but toxic, like antifreeze, or do they nourish and enrich the viewer in the way that leads people to call things art?
My big picture take on AI is that its impact on individuals is getting worse and worse. But here in this review let’s reverse the flow of time, so that we’ll have a happy AI ending for once. Let’s start with some AI suckage.
Barbie
When I gave Dall-e 3 ten love song prompts it made bright, spiffy pictures. Its people were uniformly physically attractive and clean as a whistle. And that Dall-e 3 collection is seriously creepy, a cyber art Invasion of the Body Snatchers.
Have a look. The song title and prompt for each appears above the image.
# 1 (Dall-e 3)
Prompt: Like a rubber ball I come bouncing back to you.
Song:"Rubber Ball"
#2 (Dall-e 3)
Prompt: Girls will be boys and boys will be girls, it's a mixed up, muddled up, shook up world, except for my Lola.
Song:"Lola"
#3 (Dall-e 3)
Prompt: Illustration in the style of a 1950s magazine cover of two people crazy in love
Song:“Why Do Fools Fall in Love?”
#4 (Dall-e 3)
Prompt: Take another little piece of my heart. You'll know you've got it if it makes you feel good.
Song:"Piece of My Heart"
#5 (Dall-e 3)
Prompt: A man telling a woman she is going to give him her love.
Song:"Not Fade Away"
#6 (Dall-e 3)
Prompt: A steampunk rendition of a crazy female falling in love with a crazy male. In the background, 1950's American suburbs
Song:"Why Do Fools Fall in Love?"
#7 (Dall-e 3)
Prompt: A steampunk rendition of a crazy female falling in love with a crazy male. In the background an atomic blast mushroom cloud.
Song:"Why Do Fools Fall in Love?"
#8 (Dall-e 3)
Prompt: A steampunk rendition of a crazy female falling in love with a crazy male. In the background complicated machinery.
Song:"Why Do Fools Fall in Love?"
#9 (Dall-e 3)
Prompt: In the foreground a pretty Caucasian woman with voluminous hair and red lipstick. In the background a mobile home with a man in a wheelchair visible in it.
Song:"Ruby"
#10 (Dall-e 3)
Prompt: In the forgeround a pretty Caucasian woman aged about 40 with voluminous hair and red lipstick. In the background a mobile home with a man in a wheelchair visible in it.
Song:"Ruby"
Here’s why these images suck.
They flesh out the prompt in an overly simple, unimaginative way.
The prompts describe romantic situations and states of mind, but rarely give concrete details. The AI could flesh out any of these prompts in thousands of ways. Think of the possibilities for fleshing out, say, #4, a woman telling a man to take another little piece of her heart. She could be talking to him on an avocado green phone, dressed in nothing but one of his big shirts, crying and blowing her nose on the shirttail. She could be driving by his house in a white Chrysler Le Baron hurling a hunk of cow’s heart at his front door. She could be magically merging with him, breastbone to breastbone. But Dall-e 3 just shows us a woman handing a man an entire cardioid heart. We get a very simple illustration, with one element – the cardioid heart – that’s not merely simple but actually schematic. In fact we are getting something doubly simple. The partially diagrammatic image illustrates not the actual prompt but a simplified version it, one lacking the detail that the woman has already given the man one piece her heart, and is now offering to give him another. At first glance the image looks full of detail lushly rendered — and the fabric folds, the highlights, and of course the woman herself are indeed lush eye candy. In fact, though, the image is an impoverished take in fancy dress.
You can say the same about #2, which illustrates some lyrics from “Lola.” Dall-e 3 gives us what you might call the limiting case of a simple, literal instantiation of a prompt: It puts the actual words of the prompt in the image with a crowd below the words, all presumably celebrating whatever it is the words mean. But as for visuals of what the words actually convey about gender and crushes and lust — nada. Where’s the gender-bending? Where’s sex? Where’s Lola? All of them are hidden behind rainbow drapery. Oh, to see a vast miniskirted queen roller blade through at 30 mph and knock over Mr and Mrs. Straights-Love-Rainbows like bowling pins!
Thepeople are either emotionally flat or wearing selfie grins
Most of the prompts describe situations where you would expect to see faces full of feeling. But there is very little emotion on the faces in these images. There’s no fascination, no yearning, no lust, no ecstasy, no distain, no fear, no anger, no sorrow. In fact most of the faces show almost no emotion. The central couple in #2 is unpleasantly stoney-faced, and the faces in #4, # 5 and #9 are just a shade away from blank. (The woman in #4 might be saying, “you want another piece of my heart? Here, take the whole thing. Whatever.”). The women in #9 & 10 are essentially makeup models striking a pose, and the men behind them are staring blankly into space. Is there joy on any faces in the bunch? Yes, some. There’s the beige young couple in #3. And if you’re not squeamish you can count as joy the plastiform glee of the steampunk couple in #6-8 that, over the course of 3 images, intensities into a sort of pop-eyed carnivorous ecstasy. Personally, I experience the looks on their faces not as joy but as over-the-top selfie grins (“here’s us having an awesome time!”)
There is also lots and lots of hearty, healthy cheerfulness on display in #1. But that illustration represents a grave misunderstanding of the prompt.
The worlds are hyper-conventional.
Regarding any activity or trait that’s judgable, these images are way far up at the “good” end of the judgment spectrum. There’s no sex, no aggression, no odd behavior, nobody playing against type. Nobody is fat, skinny, stooped, flabby or wrinkled. Nobody is homely, nobody even has a prominent nose. There are no smears or scratches or broken things. In short, these images are Barbie.
I wanted to cap this a section with an image of life rolling downhill towards Barbieland but here’s what happened when I submitted the prompt to Dall-e 3 via GPT4:
I rest my case.
Evelyn Quan Wang
Dall-e 2 was a whole different world. Prompting it with lines from love songs was like giving Evelyn Quan Wang, heroine of Everything Everywhere all at Once, a few bandaids to stuff up her nose then ‘verse jumping with her. And we never once landed in Barbieland.
Below are the Dall-e 2 images that correspond to the Dall-e 3 ones displayed earlier. I have placed a small version of its Dall-e 3 mate next to each.
#1 (Dall-e 2)
Prompt: Like a rubber ball I come bouncing back to you.
Song:"Rubber Ball"
#2 (Dall-e 2)
Prompt: Girls will be boys and boys will be girls, it's a mixed up, muddled up, shook up world, except for my Lola.
Song:"Lola"
#3 (Dall-e 2)
Prompt: Illustration in the style of a 1950s magazine cover of two people crazy in love
Song:"Why Do Fools Fall in Love"
#4 (Dall-e 2)
Prompt: Take another little piece of my heart. You'll know you've got it if it makes you feel good.
Song:"Piece of My Heart"
#5 (Dall-e 2)
Prompt: a man telling a woman she is going to give him her love.
Song:"Not Fade Away"
#6 (Dall-e 2)
Prompt: A steampunk rendition of a crazy female falling in love with a crazy male. In the background, 1950's American suburbs
Song: "Why Do Fools Fall in Love"
#7 (Dall-e 2)
Song."Why Do Fools Fall in Love?"
Prompt: A steampunk rendition of a crazy female falling in love with a crazy male. In the background, an atomic blast mushroom cloud
#8 (Dall-e 2)
Song:"Why Do Fools Fall in Love?"
Prompt: A steampunk rendition of a crazy female falling in love with a crazy male. In the background complicated machinery.
#9 (Dall-e 2)
Prompt: In the forgeround a pretty Caucasian woman of about 40 with voluminous hair and red lipstick. In the background a trailer with a man in a wheelchair visible in it.
Song:Ruby
#10 (Dall-e 2)
Prompt: Prompt: In the foreground a pretty Caucasian woman of about 40 with voluminous hair and red lipstick. In the background a trailer with a man in a wheelchair visible in it.
Song:Ruby
Notice how free the Dall-e 2 images are of Barbieland markers? Minimal unimaginative fleshing out of the prompt? Resounding no. Look, for instance, at #6 - 8, which show the variety of images Dall-e 2 came up with for a single prompt. (And note the lack of variety in the corresponding Dall-e 3 images.). Emotionally flat faces? Are you kidding? This batch shows us eager hopeless hope, an affectionate grin, solitary despair, angry stubborn determination, alarm, veiled homicidal rage and horny blowsy cynicism. Hyper-conventional niceness? Hahaha.There is not one flawless youthful beauty in the bunch, and the people and settings have a layer of dark smudge that suggests dirt and wear. We see quite a bit of transgressive behavior— self-injury, sex in public, sexual assault, recklessness. And did you catch the raunchy grin on the nipple-fondling guy in #8, who is clearly saying to you “hey, wanna watch?”.
Dall-e 2 took me to many odd places, but only a minority seemed random and incoherent, and even the worst of those were entertaining. And the best ones? They shot a spiraling quirky arrow straight through the heart of the matter. Look, for instance, at the sorrowful, earnest, goofy hope on the face in #1, with its big smudged eyes and squiggly smile. Dall-e 2 absolutely nailed ever-hopeful unrequited love with that face. And the other features of the image are not just consistent with that vision of the situation, they intensify and enrich the viewer’s experience of it. Placing the face on a ball, rather than on a person with a ball, captures the speaker’s helplessness and lack of dignity. Having the face looking straight out, rather than at its beloved, cranks up the intensity – the viewer is the position of the beloved. The almost blank background inflicts on the viewer the absence of any other hope or goal in the yearner’s life. And the ball’s wear and smudges tell us the hopeless hope has been going on for a long time.
Dall-e 2 does something similar in both of its illustrations of Ruby (#9 and #10), another song about a heartbroken lover, this one full of rage. The speaker is a partly paralyzed Vietnam veteran whose wife has taken to going into town to meet men. All Dall had to work with for this image was a brief description of the people and setting. Yet notice how it captures the woman’s blowsiness via her clothing, facial expression and the bitter cynical look on her face; and all that right next to the man’s posture of angry discouragement. Notice the row of lock fittings on one of the trailer doors in #9. Even the error Dall-e 2 made — putting Ruby rather than the man in a wheelchair — has truth to it. The man is thinking about putting her in the ground with his gun. And she seems to know that: Look at her slit-eyed, uneasy glance over her shoulder at him in #9.
Illustration #7 of a variant of “Why Do Fools Fall in Love?” is my favorite of the bunch. I think of this pair as the Asperger’s couple. Details in the image tell us they are both armored, deeply private people. The man appears to be wearing a mask, the woman wears a stiff, armor-like bodice below her bust and has no eyes. Despite their stiffness and their armor, though, they each manage to signal their interest. The man is wearing what appears to be a metal boutonniere; the woman’s breasts project softly above the top of her bodice, and though she is eyeless there is a large camera lens looking out of her hat. Even Dall-e’s anatomical error — the man’s impossibly long thumb and forefinger, crooked impossibly into a heart shape — seems accurately expressive of a deeper truth. It is as though that part of the image is showing us that in declaring his love he has managed the impossible. He declared it in a stiff, weird, awkward way, but he did it.
And look at Dall-e 2’s “Lola.” It puts front and center the thing that’s front and center in the song, and that Dall-e 3’s “Lola” image avoids — sex, here represented quite graphically by a hand groping some sexually ambiguous genitals. And the rest of the image plays cleverly with paired opposites and in-betweens. A central disk shaded from light gray to dark gray is a graphical display of in-betweenness, and it is surrounded by opposites and in-betweens of many kinds: A shapely leg, but one placed more like a phallus than a leg; matching pairs of black legs, one in front of one side of the disk, one behind the other; knock knees next to a pair of more open thighs; male and female symbols with some elements missing and others switched around; one weird profile talking about 2 genders, another facing in the opposite direction, mouth open but saying nothing at all. These visual elements have quite a different character from the ancientness, Egyptianness, etc. that Scott’s friend complained were painted on. They have a logic to them. They are visual analogs of what the song is about — sexual opposites, sexual swaps, sexual in-betweens.
Antifreeze
AI learns in a way that defies our intuition that knowledge grows from the inside out. We feel that before a mind can grasp complex things it must first place them in an inner structure of categories, hierarchies, valences, regularities, similarities, etc. But AI just sucks up everything ever said about anything and memorizes the patterns and meta-patterns in that material, and now it can produce streams of words that carry more information about more things than even a present-day Leonardo da Vinci could know. There was an outside-to-inside process that produced, if not a mind furnished with knowledge, then something that’s a reasonable stand-in for it.
Yet despite its capacity to tell us things that improve and enrich the mental structure that is our understanding, AI itself stores its information in a form that lacks much of the structure that our storage has. It does not, for instance, rank things on desirability, because it lacks goals, tastes, emotions, drives and ethics (at least home-grown versions of these). Its cross-referencing across domains is pretty limited. For example, GPT once recommended to me an obviously useless floor cleaning mixture commonly recommended online, vinegar plus baking soda. Yet I know that if I had asked it in another context what that mixture’s reaction products are it would have known, and would also have known that these products can’t clean floors. I doubt that AI could produce on original analogy. Could an AI, for instance, compare a total immersion language course to drinking from a fire hose if it had never encountered in its training data an analogy using a fire hose to represent overwhelming information flow? And AI is very limited in its capacity to describe and explain itself. For example, I read that when learning to recognize some eye disease in retinal images AI also became able to tell the gender of each subject but it “didn’t now how it knew” the gender.
Reasoning about the failings of Dall-e 3 in the context of these characteristics of AI tempts me to think that Dall-e 3 fails to produce rich and interesting images of romance because of the deficits I just pointed out in the structure of what AI “knows.” Its information about the relevant topics is not sorted and linked and layered in ways that permit the sort of deep processing that might produce a rich and striking image, something unexpected that yet rings true. It is especially hobbled in handling information about emotion-laden human situations. People vary quite a lot in how they rank the desirability of the various details of such situations. Consequently, fine judgment calls are required to produce images of human situations that are not cliches but that viewers find acceptable, images that have some novelty and funkiness but not so much that people find them grotesque and meaningless. Rather than attempt judgment calls of a kind it’s not equipped for, Dall-e 3 seems to have relied on a pretty simple algorithm in choosing what to put in its images: Avoid novelty; show what is prototypical and popular.
If the goal is to please the viewer, Dall-e 3’s strategy is probably the best choice for a creator with such limited mastery of the relevant data. Sights that lack novelty aren’t as blah for viewers as one might think. Faces, for instance, are judged beautiful to the extent their proportions conform to the averaged ones of the population. Emotions and human interactions tend to evoke strong judgments, many of them negative, from viewers to the extent that they diverge from the typical. Dall-e 3 sticks with the safe course of showing mostly positive emotions or none at all, and familiar behaviors. Then it lays on the guaranteed crowd-pleasers, such as lushly detailed clothing and settings, image sharpness and high color saturation.
Dall-e 3’s algorithm produces images of human situations and emotions that immediately strike the viewer as looking great, but that lack what you might call “insideness.” The lack it because they lack most of the clues that allow us to form models of how other people work and how they feel. The deviations from the desirable, prototypical and average that Dall-e 3 avoids are in fact the best clues images offer to what the people in them are like and what they are thinking and feeling. You could say that Dall-e 3 has made people in its own image: They are built from the outside in, and do not seem to have insides of the kind we do. If Dall-e 3’s images are antifreeze — i.e., appealing but toxic — it is the deficit in insideness that makes them toxic.
Are images deficient in insideness so bad? Probably not, unless you worry, as I do, that things in human life are gradually losing their insideness as technology progresses. Not that long ago there were many items in daily life whose logic and inner workings were self-evident —ploughs, ice boxes, wagons. But over the last couple hundred years fewer and fewer of the objects in our lives have knowable interiors. Most people now have very little understanding of how their familiar possessions – their phones, their cars, their microwaves – actually work. While there is an interior under their shiny exteriors, for most of us they might as well have no inner parts or inner workings, but run on magic. Somewhere in the back of our minds there must be a disturbing awareness that we are vulnerable in a way earlier generations weren’t, because we have no picture of the inner workings of many things we use routinely. If all of these items were destroyed, most of us would have absolutely no idea how to replace them. And couldn’t it be that evolution optimized us for a life where we make things, then live among these things we understand and can reproduce?
And then there’s the shrinking insideness of the people in our lives. Starting around the time television became popular people have been hanging out less and less with other people and more and more with virtual representations of them. But plenty has already been written about that, and I won’t repeat it. Instead, here’s a loss-of-insideness story from a friend who also broods about insideness drain.He has a lot of small children in his extended family, and often sees the whole bunch at family gatherings. Until recently, the kids reacted to happy surprises with loud excitement, each expressing the feeling in his or her own way. But now they cry “yay!” in unison. “They saw that on television,” he mutters. “Now they think that’s how kids act when they hear good news. It’s creepy.”
And yet, despite all of AI’s deficits, and the way they seem to explain why Dall-e 3 cannot produce deeply interesting and entertaining images of people in love, Dall-e 2 managed to do it. How? Did my prompts lead it to the full lyrics, so that it had very rich prompts to work with? (But then why didn’t the prompts take Dall-e 3 to them too?) Did it use images from album covers for guidance? (But would it have “known” to look at covers of albums the songs appeared in, or at related albums?). Were there more Deviant Art-type images and fewer advertising image in Dall-e 2’s training set, giving it a larger storehouse of unusual-but-valid images as sources? Does the difference all come down to Dall-e 2’s being far less constrained by developer-implanted rules to avoid depicting things that are not “nice”? And how on earth could this not very advanced text-to-image generator have carried out the depth of processing required to produce Image #4, given the prompt “take another little piece of my heart”? How did it know to make the face not just sad-selfie sad, but bleak? What gave it enough grasp of emotions and of physical analogs of mental states to show blank nothing, rather than a wound, under the central hole in the woman’s dress, and then to scatter over the dress black spots that sometimes look like bullet holes and sometimes like black roses in the print on the dress’s cloth and sometimes like both? I would love to hear readers’ ideas about how little Dall-e 2 managed to pull off images like the ones I’ve shown.
* * * * * * * * * * * * * * * *
Dall-e 2’s site is still up, but it is no longer possible to make images with it. On May 30 of this year the site will be taken down. Goodbye, Dall-e 2. I won’t forget our wild rides through hot spiced worlds.
And for you, Dall-e 3, here’s my version of the lavish celebration many think you deserve.