Malicious Life Podcast: GAN and Deep Fakes, Part 1 Transcript
“Forrest Gump” is considered one of the most successful films of the 1990s, and has won several Academy Awards. One of the most memorable scenes in the movie is one in which Forrest, the film’s hero, is invited to a reception at the White House. During the reception, Forrest finds that his favorite soft drink, Dr. Pepper, is served free – and so he drinks fifteen bottles of it. Sometime later he’s asked to shake hands with President John F. Kennedy. The President asks him how he is, and Forrest tells him he has to pee. Kennedy laughs, turns to the camera and says – “I think he said he has to pee!”
This short scene – no more than fifteen seconds long – is considered a milestone in movie history because President Kennedy was, at that point, already dead for several decades. It was one of the first times moviegoers got to see a historical figure returned to life on the big screen, thanks to computer-generated animation.
Twenty-five years later, a short video is posted on YouTube in June 2019. It’s called ‘Terminator 2 Starring Sylvester Stallone’, and just as the name implies – it is a scene from the movie Terminator 2, except that instead of Arnold Schwarzenegger, the film’s original star – we see Stallone stepping naked into a shady bar, asking a tough-looking member of a Hell’s Angels motorcycle gang to give him his clothes. I won’t give you any spoilers, but let’s just say this encounter ends in a typical Schwarangger/Stallone manner – i.e. everybody gets the s*** kicked out of them.
This video is much more convincing than the JFK scene in Forrest Gump. Looking closely at Kennedy in the movie, it’s easy to see that his lip movements, for example, don’t match the rest of his face. Fake-Stallone’s facial expressions, in contrast, are amazingly accurate. If you didn’t know in advance that this is a fake – I’m willing to bet that you probably wouldn’t have guessed it.
You’re probably asking yourself – well, what’s new? Technology has advanced a lot since the 1990s. Obviously, in modern Hollywood movies, CGI is much better than it used to be.
But there is one thing that has hardly changed in the domain of computer-generated animation since the days of Forrest Gump to the present – and that is the huge investment in time and talent needed to produce such cinematic scenes.
To create Forrest’s scene with Kennedy, Robert Zemeckis – the director – had to recruit the services of Industrial Light And Magic, a well known special effects studio, and some major special effects experts, including one who had already won two Academy Awards. The actual filming of the scene was also complicated, requiring Tom Hanks to practice the handshaking sequence with Kennedy’s double on the set while stepping on markers placed on the floor. Even in more modern Hollywood movies, special effects often require complicated sets, sophisticated motion capture cameras, state-of-the-art animation software and professional and talented animators who know how to use them. I know this because my brother is an animator who worked on some of the most successful TV series and movies in recent years, from Game of Thrones to Aqua-Man.
An AI Revolution
But the fake Terminator 2 video, on the other hand, was created by a single person: a YouTube user calling themselves “Ctrl Shift Face”, using a freely available software they found on the net, in just a few days of work. Let me repeat what I just said, for emphasis: a lone user – not a studio full of pro animators, using a relatively simple software – and not some sophisticated CGI behemoth, and it took him a few days – compared to months and sometimes years of work. According to a digitaltrends.com article, Ctrl Shift Face – who says he is not a coder – doesn’t even really know how the software works.
And Stallone’s fake Terminator 2 video is not a one-off. Far from it. Searching YouTube for the phrase ‘Deep Fakes’ – as these new fabricated videos are known – brings back thousands and possibly tens of thousands of results. Ctrl Shift Face himself has made some fifty videos, including one in which a younger David Bowie replaces Rick Astley in the music video for ‘Never gonna give you up’. Go check it out, it’s horribly hilarious.
Over the past two years, the internet has been inundated with celebrity Deep Fake videos of all kinds: Obama, Putin, and Trump deliver speeches they never gave, Gal Gadot “stars” in a porn video, and professional comedians such as Bill Hader eerily turn into the people they impersonate, like Tom Cruise and Arnold Schwarzenegger.
What all of these videos have in common is that they were mostly created by amateur developers or small startups with tight budgets – but their quality is surprisingly good, and in some cases as good as what the biggest movie studios were able to produce with huge budgets just a few years ago.
So what happened in the last five years, that turned special effects from being the exclusive domain of industry experts – into something a 14-year-old can create more or less at the touch of a button? Like the top end of a floating glacier, Deep Fakes are by and large only the visible product of a fascinating – and much deeper – technological revolution in the field of artificial intelligence. As we shall soon see, this revolution has the potential to put some very powerful tools in the hands of both attackers and defenders in the world of cybersecurity.
Deep Learning
The roots of the Deep Fakes revolution lies in another technological revolution that took place less than a decade ago: Deep Learning. Deep learning is a technology that allows a computer to learn to perform tasks – not by having a programmer explicitly define what calculations and steps are required to solve the problem – but by giving the computer plenty of examples and having it learn to recognize subtle patterns in that information. Deep Learning is a broad and interesting topic by itself – but let’s review the basic principles relevant to our topic today.
Deep learning is based on a network of artificial neurons that are interconnected in a way that roughly resembles the neural networks in our brains. The information fed to the network – for example, an image of some sort – “permeates” and bounces around between the different neurons, until at the other end of the network some “decision” is made, such as whether or not that image contains a human face.
How does the neural network learn to recognize a human face in an image? This happens during what is called the “training” phase, in which the developer feeds many thousands of images into the network. If the network fails to recognize a face in an image – or detects a face in an image that has none – an automatic process called ‘backpropagation’ is applied to subtly change the pattern of connections between the neurons: some connections are strengthened, while others are weakened. After tens of thousands of cycles, the network’s connections pattern is such that each time the network receives an image it will recognize a human face with almost perfect accuracy.
The idea of ’training’ a network of artificial neurons to accomplish a given task is the basis of the AI revolution that has been going on around us for the last ten years or so. If you train the network on pictures of human faces, you get facial recognition. If you give it examples of words and phrases, you get speech recognition – and if you train it on information received from cameras and LIDAR sensors – you get autonomous vehicles.
Classification Vs. Generation
To date, most applications of Deep Learning are focused around the concept of classification, such as classifying images according to whether or not they contain a face, or answering questions such as ‘Does this image have a dog or a cat in it?’. Distinction and classification are very useful capabilities that, as I mentioned, form the basis for many interesting applications of AI – but as humans, we can do more than just classify things: we can also create new “things” and new ideas. For example, I can tell my cat, Nachos, from a random dog walking on the street – but I can also draw it, write songs about its adventures in our neighborhood and compose a symphony that praises the beauty of this amazing animal.
But, honestly, Nachos is not a particularly good looking cat, nor does he have impressive adventures because frankly, he’s a bit lazy. What’s more, even if Nachos was an amazing cat instead of an annoying lump of fur that steals food from the countertop and pushes things off the table just to see what will happen – even then, I couldn’t write songs nor compose symphonies about him. In fact, I couldn’t even draw him. I can spot a cat easily if I see one, but if I try to draw a cat, I’ll probably end up with something that looks more like an Alien. Why? Because there is a fundamental difference between classifying and creating things. Anyone who ever tried to draw something will understand this fundamental difference right away: to recognize a cat, you need just a glance at the tip of its tail, its pointed ears, and whiskers – but to draw a cat properly, you need to know how long the tail is, the true shape of its ears, the number of whiskers it has and a thousand other tiny details that make up a cat.
This is the problem that AI researchers encountered when they tried having their neural networks generate new information. With Deep Learning, it is relatively easy to teach your computer to recognize a human face or to distinguish between dog and cat. All the computer needs to do is learn the basic characteristics that define a human face, or those that separate dogs from cats: for example, being able to recognize a nose, eyes, and mouth, or differentiate between a cat’s ear and a dog’s ear.
But to generate a successful image of a human face – not a copied one, mind you, but a face that didn’t exist before – the network needs to be much smarter and much more accurate than a simple classification machine. It needs to be able to not only identify eyes in a picture – but to also know that the eyes are usually found under the eyebrows, and that eyebrows have certain typical shapes and that if the figure in the image is female then she probably has no mustache…and countless other tiny details that individually are almost insignificant – but in aggregate define what you and I will agree on as being a human face.
Not long ago, neural networks weren’t sensitive enough to infer all these tiny details from the examples they were given. When researchers tried using them to generate new information – the results were usually atrocious. Images of human faces, for example, were blurry and smudged at best, and distorted and nonsensical at worst. It was clear to everyone that something about how neural networks learn from the examples given is not good enough to allow them to discern all these little details, and then create new examples of that information.
All that changed in one evening, though, in less than an hour, after a debate between two friends in a pub.
Ian Goodfellow
Ian Goodfellow looks like a nerd. really. If you give an AI a thousand pictures of nerds and asked it to generate a new image of a nerd – it would probably draw Ian Goodfellow. He is bespectacled, thin, he has a peculiar goatee beard and a haircut that looks as if someone turned a cooking pot over his head and cut whatever comes out of the sides.
As an undergraduate at Stanford University, Goodfellow initially majored in chemistry and biology but wasn’t very good. Realizing that he was not destined to become a great chemist, he switched to Computer Science and AI. This turned out to be a very good decision, thanks in part to the fact that one of his hobbies was creating computer games – and the GPUs used in computer games are the same processors used in AI. Also, Goodfellow had the good fortune to study under two of the most prominent AI researchers of the past two decades – Andrew Ng and Joshua Bengio. As part of his master’s and doctoral studies, Goodfellow was exposed to a variety of existing methods to use neural networks to create new information – so-called Generative Models – and the advantages and disadvantages of each of these models.
One evening, in 2014, Goodfellow was sitting in a pub with a friend, talking about the different generative models and their problems. During the conversation, Goodfellow came up with an idea. What would happen, he wondered, if we let the neural network learn by playing a game. In particular – a game against a rival neural network. One network will generate new ‘fake’ information – for example, an image of a human face – and the other will try to detect whether the image presented to it is one of a real person or a fake image created by its rival. Goodfellow wondered if it would be possible to train a generative network to create compelling images of human faces this way, in much the same way that a chess player can learn to play better by playing against challenging opponents.
Generative Adversarial Network (GAN)
This idea of learning and improving while playing a game between two opposing parties is not a new idea of course. Almost every human sport is based on this principle, from chess to basketball. Computer Science has seen several successful applications of this principle in the past. For instance, Arthur Samuel, a Computer Science pioneer in the late 1950s, created a program that learned to play checkers by playing against another copy of itself.
The thing is – training neural networks is not as easy as it sounds: its a task that requires a lot of knowledge and expertise, and is no less an art form than engineering. Goodfellow’s idea required training, not one network – but two, simultaneously. Was it possible? Goodfellow thought so, and that very evening, after he returned home from the pub – he sat down in front of his computer and in less than an hour created what another pioneer of artificial intelligence, Yan LeCun, called – “the coolest idea in machine learning in the last twenty years”. This idea, now known as Generative Adversarial Networks or GAN, for short, caught the world of AI by storm and catapulted Ian Goodfellow – then only in his early thirties – into AI stardom. Goodfellow’s paper became one of the most cited papers of recent years in computer science, he’s a keynote speaker at international conferences and held prestigious research roles at both Google and Apple.
GAN Under The Hood
Let’s take a deep dive into GAN and see how it works under the hood. I will describe a system that produces images of human faces, but in principle, GAN is suited to a wide variety of domains such as videos, speech samples and more.
So a GAN system consists of two neural networks, each with a defined role: one is the generative network – namely, the network that generates new information – and the other is the discriminatory network.
Let’s start with the generative network. The first step is to train the generative network by itself, separately from the discriminatory network. This initial training phase is important to bring the generative network to a basic performance level – otherwise, the generative network will produce gibberish. Think about it this way: If our goal is to create the world’s best currency counterfeiter – it will be silly to start with a counterfeiter so novice, he never actually saw a banknote with his own eyes.
Our goal is to have the generative network learn the basic features that define a human face, and crucially – the hidden connections between these features: where the eyes are in relation to the rest of the face, what is the relationship between nose width and lip thickness, what makes a face ‘feminine’ or ‘masculine’, and so on. In other words, we don’t want our neural network to memorize the faces it is given like a student might memorize the answers to all the questions in a textbook: we want it to be able to generalize from the patterns it detects so that it can subsequently create new and different faces based on what it learned.
Once we’re done with the generative network training phase, we train the discriminatory network – the network whose job will be to spot the fakes created by the generative network – and for much the same reasons. We’re creating the equivalent of a police investigator that will be the adversary of the counterfeiter – and a weak adversary won’t help us improve our counterfeiter.
Now it’s time to connect the two networks. We take the output of the generative network and connect it to the input of the discriminatory network. In addition, we will connect one more input to the discriminatory network: one that will feed it images of real faces. The discriminatory network’s job will be to distinguish between the real and fake pictures produced by the generative network. In our analogy – the police investigator will be handed a pair of banknotes, one real and one fake, and his purpose will be to point at one and say – “This is fake” or “This is real.”
Now we can start the actual game. For the generative network to generate a fake image, we must first feed it some random number, sometimes referred to as a ‘seed’. This seed acts as a ‘writing prompt’: it is the base upon which the generative network can build a new and unique face, based on the knowledge it has gained about the hidden connections between the various parts of a face.
We now pass this newly created image on to the discriminatory network – along with a real image in its other input. The discriminatory network will examine the images and grade them according to how likely it thinks each one is fake or real.
If the discriminatory network was right and spotted the fake image – then the generative network did not do a good enough job: maybe the nose in the picture wasn’t exactly in the right place, or the shape of the eyebrows didn’t match the position of the eyes. The GAN system will modify the pattern of connections in the generative network to improve it – and try again with a freshly generated face. If the discriminatory network was wrong and mistook the fake image for a real image – then this means it was not good enough at detecting fakes, and so the GAN system will modify its neural connections to improve the discriminatory side.
This game will continue for many thousands of rounds until hopefully, the generative network improves so much that the discriminatory network can only detect the fake images fifty percent of the time, i.e – a virtual coin toss. From a practical perspective, this means that our counterfeiter is now so good at producing fake bills, that even the best police investigator has no more chance at discerning them from real banknotes than three-year-old guessing the notes at random.
Now that the training is over, we can “take apart” our GAN system and extract the generative network. Remember that the actual goal of the system is to train the generative network to be as good as possible: our virtual investigator exists, in the game, only to serve as a worthy opponent to the counterfeiter. Once the game is over, this discriminatory network is no longer needed.
The Many Variants of GAN
GAN has one significant advantage over any other method: it is ‘unsupervised’ – in the sense that there’s no need for a human to label images as ‘real’ or ‘fake’ since the system as we described it can take care of that classification on its own. This fully automatic and unsupervised operation saves a huge amount of time and effort for the designers, who would otherwise have to manually label each image. And yet, it works. In his original 2014 paper, Ian Goodfellow demonstrated fake images of human faces created by his innovative system that were significantly better than any created by a neural network up to that point.
The obvious benefits of GAN drew the attention of many researchers, and the past five years saw rapid development and innovation on Goodfellow’s original design. There are currently no less than 510 different variations of GAN, optimized and improved for various applications and use cases. For those interested, you can find a complete list of all existing GAN models on Github, under “The GAN Zoo”. There’s StyleGAN, and CycleGAN, DiscoGAN, SAGAN and of course VEEGAN, which is probably superior to all other designs – if not performance-wise, at least morally.
GAN in Cyber-Security
Almost all the current applications of GAN focus on image, video and audio manipulation, but there’s little doubt that it will have at least some implications in cybersecurity as well – and perhaps even major ones.
In one paper, for example, a group of researchers from Stevens Institute of Technology, New York Inst. of Technology and the Swiss Data Science Center, showed the potential of GAN in password cracking. According to their research, GAN based systems can guess passwords 18 to 24 percent better than other state-of-the-art password cracking tools – because their creativity allows them to think more like we humans do when we come up with new passwords.
Another paper, this time from the Chinese Academy of Sciences, demonstrated the potential of GAN in stenography – the process of hiding information, such as malicious code, in otherwise innocent-looking images. Since the use of stenography is relatively common in modern malware, there are automated tools that analyze such images to detect the messages hidden inside them. Using GAN, the Chinese researchers trained a generative network to produce images that are more resistant to such automated analysis – that is, GAN allowed the generative network to find weaknesses in an automated tool’s algorithms, in an unsupervised way.
These two papers demonstrate the potential of GAN technology in cybersecurity. It might allow attackers to create malware that is much better at impersonating humans and so evade detection, or malware that can probe and poke a defense system until it finds a hidden weakness. On the other hand, GAN might allow for improved detection tools – perhaps by turning the GAN paradigm on its head and having the system work towards training a better discriminative network – one that can better distinguish malware from benign software.
Currently, GAN is in its very early days, so it’s hard to tell how important it will be for cybersecurity in the coming years. To try to answer that question, I turned to someone experienced in the use of Machine Learning in the cyber domain.
“My name is Yonatan Perry, I’m Head of Data Science in Cybereason.”
Leading a team of talented developers probably means having to listen to many innovative ideas – and probably having to shoot down a good portion of them.
“Being the skeptic is easy for me, It’s something that I like to do. nothing drives the inventor better than skeptic views from friends. ‘I will prove him wrong, and this is going to be a huge success, you’ll see!…’”
Yonatan says Adversarial Training, the idea at the core of GAN technology, is an integral part of day to day life for every developer in cybersecurity.
“The actual concept of adversarial training, or adversarial thinking, is very natural in cybersecurity. Unlike other problems of AI, usually when you’re trying to identify images of cats – the cat’s don’t try to hide and mask themselves as dogs.”
But in spite of Adversarial Training being already familiar to AV developers – the skeptic in Yonatan doesn’t necessarily see GAN solving the real difficulties vendors face.
“Usually, ‘Trivial’ is a way to describe a problem given to other people [LAUGHING]. I think that the bigger challenges – and maybe that will be surprising – from my experience, is usually not the architecture itself. Talking from my experience in building Cybereason’s next-gen Antivirus. One of the huge challenges in that area is that there is no good definition of what really should be considered malicious. Not just because every vendor will draw the line in a different place, but also because there is no clear cut definition. The lack of data sources is always a problem. And understanding what is really your business objective in the problem. So placing the architecture in a business context is usually the bigger challenges that we need to face.”
Still, Yonatan can see the potential benefit of GAN to cybersecurity, if researchers can overcome the difficulties of modifying this new technology to their specific needs. And as always in cybersecurity – the bad guys will benefit from it as well.
“I think that if we could apply GAN to that process – having the generator compiling and running software that does malicious things, and be able to take a core of malicious payload and wrap it in something that will fool the discriminator, the discriminator being like the AV software here – then it will be a great tool for the attackers. But, any such tool, if used by the defenders, is also going to make AVs stronger. It’s a bit far off. Applying those tools to audio, video, and text and applying it to cybersecurity requires a whole lot of work in the adaptation. Many companies are working on it separately, hopefully, something will eventually get shared with the public domain and the industry will grow together. But there’s a huge challenge in applying those tools in the cyber domain, which is very different from the classic ones.”
The Potential of GAN
The attention GAN has received in image, video and audio manipulation reflects the tremendous potential of this new and exciting technology in a variety of domains, such as e-commerce. Let’s take online clothing stores as an example. Currently, when thinking about buying a shirt or some trousers, we can only guess how well those items will fit us, based on pictures of pretty models wearing them – but at least in my case, it turns out that a shirt that looks great on Brad Pitt doesn’t necessarily look great on me. I suspect it has something to do with the tiny differences in the color of our eyes. GAN technology will allow us, probably quite easily, to upload a simple selfie along with some basic weight and height measurements – and the computer will be able to recreate a convincing image or a video of us wearing our selected items. Samsung researchers have already demonstrated how they can take a single still image of a face and turn it into a short animated video of this kind.
We may also see a proliferation of digital media production with no flesh and blood models at all, and the basic idea could also apply to the generation of virtual environments and characters in video games. Long-dead actors will come to life on the big screen, and old actors will become young again. Computers will likely do a much better job at cleaning noisy images and videos, and coloring old black-and-white pictures.
Again, all these capabilities are already available to us today: after all, JFK returned to life in Forrest Gump more than thirty years ago, and in the Star Wars movie “Rogue 1” we saw a virtual Carrie Fisher return to her role as the 19-year-old Princess Leia. But in all these cases, these impressive feats involved a significant investment of money, time and expertise. A few months back we saw a Reddit user calling himself ‘derpfakes’ release his deep-faked version of the same Rouge 1 scene. Derpfakes said about his version –
“[The] original footage from Rogue One with a strange CGI Carrie Fisher. Movie budget: $200m. [Mine] is a 20-minute fake that could have been done in essentially the same way with a visually similar actress. My budget: $0 and some Fleetwood Mac tunes.”
In other words, GAN technology will bring this novel technological capability ‘to the masses’, making it much cheaper and way more accessible than it is today.
In fact, it already is in the hands of the masses. And what has humanity done with the most revolutionary AI of the last five years? What we always do with new exciting technological innovation. Yes, you guessed it. Porn.
u/deepfakes
In November 2017, an anonymous Reddit user calling themselves u/deepfakes created a subreddit, as these forums are known, named r/deepfakes. It was an important milestone in AI history since u/deepfakes was the one who invented this new term which later took the world by storm.
The videos u/deepfakes uploaded to his newly created subreddit were mostly short porn videos, in which the original actresses were replaced by much more famous actresses, such as ‘Wonder Woman’s Gal Gadot and The Avengers’ Scarlett Johansen. Within a few weeks, the subreddit gained almost 100,000 new subscribers, and many of them started uploading new deep fakes they created themselves. Some of these fake porn videos were circulated under misleading headlines as if they were leaked private sex tapes. Many of the new videos were innocent goofs, such as the Stallone video I described earlier, or whole YouTube channels dedicated to inserting Nicholas Cage into every movie imaginable, with the results somehow managing to be both hilarious and creepy at the same time.
The mainstream press got whiff of the story less than two months later, and journalist Samantha Cole published an article on Motherboard website under the somewhat provocative headline “AI-Assisted Fake Porn Is Here and We’re All Fucked.” In her article, Cole interviewed u/deepfakes who revealed that all his videos were created using open-source AI frameworks and images and videos he found on Google.
Reddit’s response was prompt, closing the new subreddit and blocking r/deepfakes – but it was already too late. A month later, in January 2018, someone created a free software called FakeApp that made generating deep fake videos even easier – and from here on out, deep faked videos began to appear by the thousands. Most of these early first videos were of poor quality: the superimposed facial features did not always match the original’s actor or actress’ face or general body outline, and lip movements did not necessarily match the speech soundtrack – but very quickly the amateur creators learned to improve their techniques, and the technology itself has improved rapidly as well.
The surprising and rapid development of deep fakes and it’s more problematic uses, has created a sort of “arms race” between deep fake creators – and AI researchers who are trying to find ways to detect these fakes, and hopefully tag them as such before they become viral. This arms race will be the subject of our next episode, the 2nd and final part of this mini-series on Deep Fakes. What are the tell-tale signs, if there are any, that a video, image or speech sample was created by an AI? How does our brain decide what is ‘real’ and what is ‘fake’, and how can lab mice help us – maybe – save democracy and the mutual trust so critical for a functioning human society? All this and more, next time on Malicious Life.