This year, artificial intelligence has won art competitions, dominated the Internet, passed exams, and proven itself to be the technology of the future... but it still can't accurately depict ordinary human hands.
Despite all the work done with AI image generators, hands have become their nemesis, exposing the model's flaws.
While this feature has already been a prominent issue since the introduction of the Dall-E 2 and all of its subsequent competitors, the issue came into focus thanks to a Twitter user's collection of user images that he generated using Midjourney. Read about it in the following news:
At first glance, the AI's work can be impressive: the images show realistic-looking people. And yet, you can notice problems: one of the people has three hands, another person has seven fingers and a very long palm, and in one of the images someone is holding a phone with a bent finger.
So why does such a small obstacle cause the AI to fail? “2D image generators have absolutely no understanding of the 3D geometry of something like a hand,” says Professor Peter Bentley, a scientist and author at University College London.
“They have a general idea of the hand. It has a palm, fingers and nails, but none of these models actually understand what it is.”
If you're just trying to get a very general image of a hand, this won't be too much of a problem for a neural network. The problem comes as soon as you set the context. If the AI can't understand the 3D nature of the hand or the context of the situation, it will have a hard time recreating it accurately.
For example, a hand holding an object such as a knife or a camera, or someone performing hand signs, will instantly confuse a model who has no understanding of the hand in 3D or the geometric shape of the object it is holding.
However, the “hand problem” is not just a problem for the Dall-E 2 neural network. Other popular image models, such as Midjourney and Stable Diffusion, have faced the same impossible task of creating a normal-looking hand.
“In fact, all these models are divorced from reality, they have no context, they do not have any knowledge or ability to take into account the context of the image. They just kind of combine all the garbage that we gave them,” says the scientist.
So, these models are good, even great... but they still have a long way to go to create perfect images. What needs to happen to solve this problem and finally create a hand that doesn't look like it was inspired by David Cronenberg?
“All this may change in the future. These networks are gradually trained on 3D geometry so that they can understand the shape of images. This will give us a more coherent image, even with complex clues,” says Bentley. “The first significant results in this field could lead to the creation of highly detailed 3D renderings and even digital worlds.”