The artificial intelligence revolution. The technology behind deepfakes.
By Alejando Pérez Blanco
Alejandro Pérez Blanco is a vfx artist. In 2018 he began to experiment with the application of Artificial Intelligence in a professional environment. Ever since, he has taken part in about 30 projects in film, TV, internet, advertising and corporate, and he has done all kinds of deepfakes.
A brief induction
Let’s imagine that a doctor sets out to create a dictionary of symptoms in order to diagnose any disease. He will collect thousands of clinical pictures and start writing:
“If the patient has a fever, loss of appetite and red pimples -> chickenpox”
And so on with each disease.
After putting in some effort, he discovers that the job can be made easier by creating a graph of connected dots, thus saving a lot of time and space (Figure 1).
But a mathematician comes along, sees his graph and says: “this could be done by a computer. If we convert diagnostics into numbers, the machine can learn through trial and error. “The essence of this system is trying out random connections, and when a correct result is achieved those connections are reinforced; and whenever a mistake is made, they weaken and eventually disappear.
This is the basis of Machine Learning. Mathematicians began to think about all this when Santiago Ramón y Cajal saw, for the first time in history, neurons through a microscope. The brain was no longer seen as a mysterious thinking machine, but as a network of neurons that, as they are interconnected, are somehow capable of thinking, recognizing the world around them, reaching conclusions and taking action. How this was achieved still remained a mystery, but being able to see the network opened endless doors to progress. Physics, chemistry, medicine… And mathematicians have been trying to translate into numbers everything we gradually learn about neurons. Machine Learning creates algorithms like the ones a mathematician or a computer scientist might work out, but instead of programming them, what ML does is design a digital simulation of a blank brain to see how it is filled up with content. Because a brain is a computer that is not programmed, but trained (Figure 2).
In our example, when the mathematician finishes his system to train his model, it seems to render promising but incomplete results. The doctor examines them, does some tests and notices a problem: “there is no correlation between symptoms,” he says. “If you weigh 90 kg being two meters tall is not the same as being one meter fifty centimeters, but this scheme does not understand.” The mathematician then thinks about this and finds a way to create a series of dots in the middle and let a trial and error learning scheme trickle down to the middle layer. “This column of points will serve to create new combinations of symptoms”, he replies.
But he also warns that “this middle row is very difficult to decipher. It can get it right, it can get it wrong… anything could happen. It can get it right for the wrong reasons, and it can discover new things that you didn’t know about. Whenever you get a layer of neurons in between, this electronic brain turns into a black box.”
Those dark, deep layers of neurons, which are hard to understand with the naked eye, are known as Deep Learning. And Deep Learning is what most closely resembles the human brain that we have managed to create so far, and this has been done without really understanding how two human neurons connect with each other… But the fact that we can get results from them indicates that perhaps are on the right track (Figure 3).
After many years of testing and setbacks, Artificial Intelligence is now experiencing an unprecedented explosion. This is due to the gradual development of the mathematics behind these neural networks, and also because of Big Data (how to take advantage of the large data collections that different institutions and companies have gathered) and the paths that technology has taken towards parallel processing (3D, video games, mining bitcoins).
Deep Learning Models
Science has performed in the AI field in a quite exemplary manner. Some of the main stakeholders in this area, such as Google, Facebook or NVidia, developed proprietary technologies and released them to be used by everyone free of charge. And the world gave a response accordingly: Universities, research centers, companies and individuals began to publish new ways to connect digital neurons to each other, and to connect them to images and sound, to language, to meteorology, medical diagnostics, resistance of bridges or protein structure. Then, whoever finds a use for these findings compiles them and develops their own programs. The technology used to analyze the composition of the coronavirus is essentially the same as the one that detects if you upload a nude to Instagram.
Let us see some rather simplified examples of Artificial Intelligence models applied to the world of image (Figure 4).
The first big hit of today’s AI was a detector/classifier. If we connect the pixels of an image to a neural network, add several deep layers to it, and end up with a couple of output options, we can create an image classifier. Dog or cat? Porn or not porn? Benign or malignant tumor? And the choice does not have necessarily to be limited to two items. Detectors have been created with hundreds or even thousands of outputs and are capable of detecting all kinds of different categories. Through a structure that analyzes the differences between adjacent pixels (convolutional neural networks or CNN), each new layer means a greater degree of abstraction, so that the first layers can detect if there are lines and corners, then rounded shapes, then circles and then confirm if for example these correspond to pupils, balls, wheels… A few more layers and the machine will manage to create categories. And always by using said trial and error procedure, if a correct result is achieved, the structure is maintained. If it fails, a change in connections is then forced.
Conversely, if we start from a few numbers and set the mission to produce images with them, we can create an image generator based on initial values that we indicate in the first few neurons (Figure 5).
But how can a model like this be trained by trial and error? This is where the exciting advances seen in recent years come in. To determine whether a hit or a miss is achieved, a detector like the one we have just seen is added at the end. The detector is fed with real images and with random images from the generator, turning the virtual brain into cat-and-mouse game. If the detector spots a fake, the generator must then learn. If the forger misleads the detector, it is the detector the one modifying its structure. Gradually, through millions of attempts, totally new images are created with a quality that depends on the quality achieved by both generator and detector (Figure 6).
Autoencoder: If we combine the ideas of a detector and a generator the other way around, we get a bottleneck-type structure. From an image, information is passed through neurons until a center with few of them is created and then we force the machine to rebuild it. The detector, in this case does not follow any criteria we have set (dog class or cat class) but creates its own automatically. The name comes from this idea of creating data encodings on its own. The first part is usually called encoder and the second decoder. And little by little the decoder learns to reconstruct the images that are asked and also to create new images, if the numbers that are right on the bottleneck are changed (Figure 7).
The autoencoder was designed to try to compress information, but then it was seen that it was no more efficient than formats like jpeg. The problem was that it didn’t work well as a general tool, but it did yield acceptable results by working with specific image categories, such as clouds, cars, or faces. Ten years ago, this model was being used as an example in Artificial Intelligence courses because it was very simple to understand, but in practice it was not that useful. Until someone came up with a way to alter the model so that it would learn two different faces and then deepfakes were born. In fact, that stranger was an anonymous Reddit user whose name was precisely “deepfakes”.
This is the structure that was proposed (Figure 8).
The encoder learns to read two faces. But on the bottleneck the information is forked into two decoders, so that if face A is fed, only decoder A is trained to rebuild it. The theory is that, if you train enough, data such as the direction of the gaze, the turn of the head, the height of the jaw will concentrate on the neck… and that data will be valid for one face and for the other; decoder A reaches different conclusions “about where to place that spot on the cheek if the face is smiling” than decoder B, so that the two become interchangeable. That was the theory. And it did work.
The first area where it was massively applied was pornography. A huge market quickly and unexpectedly emerged in which there would be people willing to pay to get videos according to their interests as well as many programmers willing to try their luck. Little by little, research published by AI laboratories of technological universities began to be implemented. Currently, the best public access innovations are being gathered by Ivan Perov in DeepFaceLab, a program that uses Google technology and is available on Github, and some studios have (us too) started to develop private solutions to streamline the workflow when faced with the difficulties that are discovered through experience.
The problems of changing a face
The goal of big tech companies is to create and sell automated services with a single click. For the time being his work is focusing on services for casual users who play with their mobile phone, such as Instagram filters, face or fingerprint detectors, camera optimization, photo tagging… Facebook’s, Google’s, or Apple’s methodologies consist of creating very complex models that require entire buildings filled with computers to train for weeks, and then attempt to make them work universally for all users with a quality that fits with their mobile screen.
On the other hand, the latest version of Photoshop has higher-quality experimental tools that run AI models from the cloud in order to modify age or gestures in photographs. But with today’s existing technology, you cannot create a compelling broadcast-quality deepfake at the push of a button. You have to train your own model, and in order to achieve a good result this training requires a high degree of technological power.
NVidia graphic cards measure their capacity in this regard in two main elements: VRAM memory and CUDA cores. The memory required to work with Artificial Intelligence depends on image resolution and number of neurons that we want the model to have (generally, the higher the better). For their part, the CUDA cores are responsible for the simultaneous mathematical calculations of each neuron and the more we have, the faster the training will go.
This is one of the biggest problems facing technology: the result improves with larger models and with higher resolution, but the speed needed by television and the 4K definition that cinema requires are still at the technological limit. A very powerful card will be able to work with small models very quickly or with large models very slowly.
In the”Entrevistas por la cara” (Cheeky interviews) section of the Spanish TV program El Intermedio, typically two weeks elapse between the recording and the broadcasting of a sketch, depending on the difficulty involved. This means a degree of coordination and foresight that also includes the Costume, Photography and Makeup teams in order to link it properly with the live broadcast, in addition to having to prepare a script that does not expire during that time. At the Filmmaking department each sketch is planned separately, combining the scenery and the camera shots so as to minimize problems that could arise, as for example, that the substitute nose is flatter than the original; or if the background has to be rebuilt when the program’s presenter Gran Wyoming appears sideways,; and said planning always strives to give the interpreters the utmost freedom during their work (Figure 9).
In the series 30 Monedas (30 Coins), involving post-production in 4K, 250 deepfake shots were made to rejuvenate two protagonists during a flashback sequence. Currently there is no universal solution for projects of this size, and different techniques had to be designed for each single deepfake. It took about 5 months, with some of that time overlapping between training on one face and post-production of the other face. Fictional work leads to needs that are not found in an HD comedy show, such as returning grain and texture with the exact quality of the original shots and producing a 10-bit logarithmic output ready for color grading. Realism, in humor, can sometimes be sacrificed in favor of comedy. In fiction, that is just unacceptable (Figure 10).
In addition, the selection of faces that the model is fed can significantly modify the outcome. If there are only left-facing faces, a right-facing deepfake cannot be achieved. If there are no faces available in 4K, a 4K deepfake cannot be created. A similar deficiency occurred with the series. Eduard Fernández and Manolo Solo gave us a broad collection of images from their youth, but neither of them quite fitted. With Eduard, given the extraordinary physical change that he developed for the character, his images as a young man did not work, so we worked on a model that would seek an intermediate point between his youth and his physiognomy in the series. As for Manolo, having old images of poorer quality and nearly always wearing glasses, it was necessary to do a deepfake with an image stuntman by designing a model that would maintain Manolo’s proportions and features as much as possible, but keeping the youthful features of the stuntman (Figure 11).
And here comes the other big problem with deepfakes: There is no single result. There is no single way to train. It can be done in many different ways, either by changing learning algorithms or by altering the collection of images being fed. And the results can be better or worse, and even cause rejection problems in viewers.
The human brain tends to get used to imperfections as it deals with them. For VFX technicians, this creates a very complex compromise, because on the one hand they must trust their own judgment and on the other be wary as they dive into the myriad of images being faced. For projects of this magnitude, it is necessary to design a workflow with impact on supervision, so that colleagues involved in the project who are not used to these images can contribute their criteria before presenting them to the director who, being in possession of the original vision of the work, but at the same time dividing his attention between all the post-production departments, becomes the perfect judge.
Training AI is a trade itself -almost a craft- of learning how to ask the machine what we want from it. It is vital to spend time investigating neural networks and gain experience, because our natural brains must also be trained through trial, attempts and many mistakes in order to develop an intuition about what to expect from artificial brains.