January 16, 2020 |
Andrew Ginter is the VP Industrial Security at Waterfall Security Solutions, a co-host of the Industrial Security Podcast and the author of two books on OT security. At Waterfall Andrew leads a team of experts responsible for industrial cyber-security research, contributions to standards and regulations, and security architecture recommendations for industrial sites.
Born in Israel in 1975, Malicious Life Podcast host Ran studied Electrical Engineering at the Technion Institute of Technology, and worked as an electronics engineer and programmer for several High Tech companies in Israel.
In 2007, created the popular Israeli podcast Making History. He is author of three books (all in Hebrew): Perpetuum Mobile: About the history of Perpetual Motion Machines; The Little University of Science: A book about all of Science (well, the important bits, anyway) in bite-sized chunks; Battle of Minds: About the history of computer malware.
Malicious Life by Cybereason exposes the human and financial powers operating under the surface that make cybercrime what it is today. Malicious Life explores the people and the stories behind the cybersecurity industry and its evolution. Host Ran Levi interviews hackers and industry experts, discussing the hacking culture of the 1970s and 80s, the subsequent rise of viruses in the 1990s and today’s advanced cyber threats.
Malicious Life theme music: ‘Circuits’ by TKMusic, licensed under Creative Commons License. Malicious Life podcast is sponsored and produced by Cybereason. Subscribe and listen on your favorite platform:All Posts by Malicious Life Podcast
Have you ever seen a petrochemical plant? They are massive constructions. Like small cities. And they’re incredibly complicated–buildings and structures connected to other structures intertwined with other structures, and endless steel pipes running in every which direction. How anybody thought to build such a convoluted thing, and how anybody actually understands what it’s all for, is, frankly, beyond me.
Petro Rabigh is one of these giant, labyrinth plants. It’s located on Saudi Arabia’s west coast, right along the Red Sea, just off the halfway point between Mecca and Medina. It’s like some kind of futuristic robot city: 3,000 acres, tall metal towers surrounded by six- or seven-story-tall structures made of steel pipes and beams that make them look like the metal skeletons of buildings half-finished. With all those pipes, it produces around 5 million tons of petrochemical products every year.
An Unexpected Shutdown
It’s difficult to imagine how a single piece of computer software could have any kind of effect on such a massive place. And if you were to have visited Petro Rabigh on a Saturday evening in June 2017, hardly anything would have appeared wrong. June happened to be the month of Ramadan that year, so it was even quieter than a normal Saturday at Petro Rabigh would otherwise be. The only event of note occurred when, without warning or explanation, a section of the plant shut down.
The shutdown had been triggered by a safety system. No real damage was done, but it was strange nonetheless. So a team of specialists was called in to investigate the cause. The team ran some tests, then brought the safety device back to their laboratory for further inspection. The tests turned up nothing–the device seemed to be in fine working order.
The shutdown must have been some strange, one-off glitch, they figured.
They were, of course, wrong. But it’s hard to blame them. What was actually going on, just under their noses, had never existed before in the history of cyber security.
Intro To Industrial Security
But we’re going to step out of our story for a few minutes now, because the kind of security that’s practiced at a petrochemical plant is characteristically different from the security we typically talk about on our show. It requires a different skill set–really, an entirely different mindset than working in IT does.
So here’s the deal: for the next twenty minutes I want you, Listener, to forget the lessons of cyber security you’ve come to know. We’re going to put on a different hat. Only by thinking like industrial systems engineers will we begin to understand the nature of why that Rabigh system was taken offline on that day in June, and why that incident was the tip of a much, much bigger iceberg.
Are you ready? You’ve got your new hat on? Good.
[Andrew] I’m Andrew Ginter. I am Vice President of Industrial Security with Waterfall Security Solutions. I work with some of the most secure industrial sites on the planet.
Andrew Ginter is one of North America’s leading voices in industrial security. He co-hosts a podcast–called “The Industrial Security Podcast”–with the Senior Producer of our show, Nate Nelson.
[Andrew] The fundamental difference between industrial cyber-security and classic sort of enterprise cyber-security is this. In the enterprise world, we seek to protect the information. We protect the confidentiality or the integrity or the availability of the information. Usually confidentiality is the highest priority. When I work with the world’s most secure sites, they tell me their number one priority is not protecting the information. Their number one priority is safe, correct, efficient and continuous physical operations and because cyber-attacks come in information, this is the definition of cyber. Keep the control system running. Keep the physical process running and integrity is important too. We have to keep it running correctly or there’s no point.
We’re talking about what’s at stake here. In industry and manufacturing you’re not just protecting data, but machines–big, hulking, powerful machines delivering the materials necessary to a functioning modern society.
[Andrew] All cyber-attacks are information. Because cyber-attacks come from information, what we need to do is protect physical operations from the information. Every information flow into the control system is a potential attack factor. A comprehensive list of these information posts is a comprehensive list of attack factors. We need to eliminate as many of these as possible and thoroughly discipline the rest of them. So not protecting information but protecting physical operations from information.
This point is crucial. If a hacker shuts down a corporate IT network, it’s a big deal. In fact, we’ve got a whole class of these kinds of attacks–denials of service–that everybody hates dealing with.
But industrial security engineers aren’t only protecting equipment from information. They’re also protecting humans from equipment. Plants process highly combustible, toxic, electric and otherwise volatile substances. If you’ve ever heard of Chernobyl, you don’t need me to tell you why it’s important to keep industrial machines happy. So computer security and physical safety are intertwined at industrial sites.
[Andrew] I mean my first language is German. In German, it’s the same word “sicherheit”. It means safety. It means security. So yeah, there’s a lot of confusion. But fundamentally, physical safety is only possible in computer-controlled processes if the control system is secure, if our enemies, if our attackers cannot tamper with the control system and impair physical safety.
To protect human lives is a heavy burden to bear. It is the burden of those who handle the computer systems, but also kinetic security personnel at the physical site. You can’t get past the gates at a place like Petro Rabigh without heavy security checks.
[Andrew] I had the privilege of visiting a refinery a number of years ago. When we drove up to the refinery – this is an impressive artifice. This is an artifact. It was seven stories of pipes from one horizon to the other. It was a massive installation. We drive up to the security booth and the usual happens. I mean we all file out of the car. We show our passports to the folks in the booth. They take pictures of them. They give us cards and then things kind of get a little strange and I couldn’t figure it out.
They made every one of us file through a turnstile, badging in, except the driver. The driver could go back to the car and drive the car through and badge into the facility when he drove the car through. We were not allowed to get back in the car and hand out badges to the driver and get badged in. We had to physically badge in ourselves. Then we could – on the other side of the security fence, then we could get back into the car.
This was strange and it continued strange. Nobody checked the trunk of the car to see if anybody was hiding inside. It just seemed odd and so I asked my contact there. I said, “I can’t put my finger on it. But it seems weird. The priority seems weird.” My contact there said, “You don’t understand. The physical security program here is part of the safety program. The reason they asked us all of those questions about where we’re going to be and who we’re going to be with and what we’re going to be doing and how we’re going to be walking around is because they want an ironclad count of where everybody is in the facility at all times.
If there is an industrial incident at the facility, they will risk the lives of rescue personnel only if they know there’s someone in the area to save. They’re not going to send personnel into a dangerous area if they don’t know that there’s anybody in there. So it’s vital that when you’re walking around the facility, every time you see a badger, you badge in. If somebody is hiding in the trunk, they’re on their own.
Safety Instrumented Systems (SIS)
In addition to safety procedures like badging, there are actual machines that are specially designed to save lives, should anything at a plant go wrong. Industrial engineers, plant operators, managers and their families only sleep at night if these machines work perfectly, 100 percent of the time. And no machine is more fundamental to safety at an industrial site than the “safety instrumented system.”
[Andrew] A safety-instrumented system is a little computer. It has one job. It does one thing all day long. It has got a program inside that reads all of its input. So this computer is typically physically connected between 50 and 300 sensors in the physical process. The sensors measure temperature, pressure, flow, this kind of thing and the program has one job. Check all of the inputs against a calculation that determines if the facility is still running within safe tolerances, within safe limits. If the answer is yes, there’s only one output from a safety-instrumented system, yes or no. If the bit is yes, keep running the process. If ever the process, the physical process deviates from safe parameters, send out a no and trigger an emergency shutdown, an immediate emergency shutdown.
This is what the safety systems are designed to protect human life. Not protect the equipment. Not keep the process running continuously and efficiently. Safety systems have one job and that’s to prevent casualties and to prevent catastrophes, if you like.
Now you understand why a team of specialists had to be called into Petro Rabigh in June 2017. It wasn’t necessarily what happened that was so catastrophic, but that a safety instrumented system–a machine designed with the sole purpose of protecting human life–was the thing which caused it.
[Andrew] The vendor looked at the logs, looked at everything and determined it was a mechanical failure. They didn’t say in what. I’m guessing in one of the sensors. So they made the repair. They brought the site up. When I say brought the site up, it’s harder than that. If a large refinery trips, it may be a week before it’s back in full production. This is a very expensive event.
So it tripped. There was some diagnosis. The verdict was mechanical. There was repair and the process was – the facility was sooner or later brought back up to full production.
After the initial check, everything appeared to be normal. Life went on.
[Andrew] A month later it tripped again. People started getting suspicious. They brought the investigators. They said, “What’s going on here?”
On August 4th, 2017, at 7:43 p.m. two emergency shutdown systems automatically clicked on. All the while, plant operators had no idea anything was amiss.
Once they discovered the shutdown, they entered a state of emergency. And they were right to do so. In the IT space, two cybersecurity events in two months is nothing. In the industrial security space, it’s a huge red flag. You just don’t see it.
In need of urgent assistance, Rabigh called in a response unit from Saudi Aramco.
Aramco should be a familiar name to those of you who listened to our episode on Shamoon–the malware that took out all of Aramco’s computers in 2012. What we didn’t have time to get into in that episode was how immediately, how thoroughly, Aramco transformed their security posture in response to that incident. It turns out that losing your entire IT infrastructure, and having to buy 50,000 hard drives all at once, is a good motivator to invest in cybersecurity. The company hired a team of specialists to rebuild their systems from the ground up, and clearly, by 2017, they still hadn’t forgotten their lesson. They had world-class specialists on-call to dispatch to Rabigh, a company with which they own a 37.5% stake.
These were the experts that finally figured out what was going on. Their initial clue was a pattern of strange communications between the Rabigh’s IT network and some operations workstations.
That there even was a line of communication between those two networks was itself a problem. Industrial plants have multiple, distinct layers of computer systems, each with their own function. And not all of these layers are supposed to talk to all the other layers.
Just picture a layer cake: a really great layer cake, with all kinds of flavors. There’s vanilla on top, then caramel, then mocha, then chocolate on the bottom. They’re all stacked up–that’s what makes it a layer cake. If the vanilla crossed over with the mocha, and the mocha bled into the chocolate layer, and the caramel was just all over the place, you’d have a…. You’d have a…
Huh? What were we talking about? Oh yeah: with all those layers bleeding into one another you’d have a tasty, but nonetheless screwed-up, layer cake.
The deepest layer of an industrial plant–even deeper than the actual machines–is where safety instrumented systems – or SIS – lie, as well as any physical mechanisms built into the machines to prevent equipment failure. Because failure of these systems can be deadly, there is nothing more important than keeping this layer protected.
The next layer up from safety systems are where the machines, and the machines that control the machines, lie. We call this the Distributed Control System, or DCS. This is the heart of the plant’s operation–what you picture when you’re imagining what goes on at an industrial facility.
Typically, the DCS and SIS layers are not supposed to be in communication, unless the SIS’s require a specific adjustment, like a software update. There are a number of reasons to keep these layers separate. SIS’s are designed to act automatically, without need for human input, so that they can stay objective. And isolation prevents SIS’s from being easily tampered with–by a remote attacker, or even a malicious insider in the plant. Again, because safety systems prevent explosions, these precautions are critical.
Next up from the DCS layer is the demilitarized zone, or DMZ. The DMZ is the protective layer, keeping what’s behind it safe from the outermost layer: the IT network. Because IT systems connect to the internet, they are extremely vulnerable to incoming cyber attacks. The DMZ is designed to prevent those threats from burrowing their way into operational systems. A typical DMZ might include layered firewalls, or unidirectional gateways that allow for only one direction of information flow: either blocking malicious information from coming in, or preventing critical information from getting out.
This is our layer cake. Each layer is distinct. The lower you go the more critical your equipment gets. Therefore, lower layers are more secure than higher layers, and overlap between layers must be kept to the absolute minimum necessary to keep the plant running smoothly. A breach in any layer represents a vector for attacks to burrow downward.
Connected IT/OT Layers
Though its best practice to keep every layer of an industrial network distinct, the reality is that most plants aren’t perfect. This is especially true when security clashes with accessibility. Andrew Ginter.
[Andrew] there’s a compelling business need to send information out of the system, to IT systems, to the internet so that we can monitor what’s going on. For example, if smoke rises out of one of the safety controllers, you kind of want to know that so you can replace it promptly. How do you know that? Well, it stops sending any information to the outside world. Oh look, we stopped getting updates from one of the controllers. We should go and investigate. There’s enormous value in monitoring industrial systems. But if monitoring comes at the cost of connectivity and potential attacks, we have a problem.
If now we connect the safety system to the industrial control system and industrial control systems connected to the demilitarized zone and the demilitarized zone is connected to the enterprise network and so on out to the internet, you’ve got a path of attack from the internet straight into the safety systems. This is why the advice that people are coming out with is saying, you know, instead of connecting all this stuff up like most people do, you really should start taking inspiration from really thoroughly secured industrial sites and don’t do that.
Either physically disconnect the safety networks so they cannot be reached from the internet. You have to walk over to the work station if you want to use these things or throw some unidirectional technology in there that lets you monitor the safety systems, but is physically incapable of sending any information into those systems. No information gets in and no attacks get in.
There is a clear and present cost to disabling all remote monitoring of plant operations. The cost of enabling remote monitoring is just as clear, but far less present. Petrochemical plants aren’t hacked often. So can you blame the operators at Petro Rabigh for leaving open a window of communication between layers in their network?
[Andrew] The sites I work with never connect their safety systems to any network that can be reached from the internet directly or indirectly. So this is something that the world’s most secure, most cautious sites do. Mere mortals, the average site, yeah, sometimes they connect things up. So could argue that these folks – some people will point the finger and say this was sloppiness. I’m not sure I would go that far. This is what a lot of people do.
The pathway that led from the internet, to IT systems, to its most sensitive internal systems, was wide enough to allow a capable attacker direct access to machinery. With that access, they went after the most sensitive machines in the entire facility.
SIS ‘PROGRAM’ Mode
Petro Rabigh used the “Triconix” brand of safety instrumented system–specifically the “Tricon 3008” model–made by a French company called Schneider Electric. Triconex SIS’s have four modes of operation, which can be toggled via a physical switch. Those four modes are: remote, program, run and stop. “Remote” is the least secure mode, allowing direct changes to the machine’s code. “Program” is the mode you might use if you were a plant operator, and needed to update the machine in some way–it allows you to load a control program onto the machine, which is then debugged and downloaded by the machine’s internal software. “Run” is the mode which, in theory, should be active almost all of the time–it allows only read-only access, so that plant operators can view but not alter the machine while it’s in operation. “Stop,” as you might expect, stops the machine from running at all.
We can only infer from afar as to why half a dozen SIS’s at Petro Rabigh were left in Program mode instead of Run mode. It’s unclear, too, why the other safety precautions built into Triconix systems didn’t work–namely, the ability to set password protection, and IP-specific restricted access over certain functions of the machine.
The hackers successfully loaded their own, custom-built software onto the six vulnerable SIS’s. That program opened a backdoor through which they could then maintain consistent access to those machines.
[Andrew] There were two pieces that were discovered. One piece tricked the safety system into what’s called “escalation of privilege” which means it – the job of this first piece of malware was to get – some people call it root. Some people call it admin. Get control of the CPU, of the system so thoroughly that the malware could do anything it wanted on the system. It had to get some power first, get some permissions.
The escalation component of the attackers’ software meant that even if the SIS were to have been switched from Program into Run mode at a later time, it wouldn’t have affected their ability to upload new, malicious data onto the machines.
[Andrew] The other thing that piece of malware did was download the rest of the malware. The rest of the malware sat there in memory with permissions to do anything it wanted and what it wanted to do was change the contents of memory. So it would take orders over a communications protocol to change this piece of memory or change that piece of memory. What can you do with that? Well, you can reprogram the safety system. You can change the limits, this – you know, in memory that the safety system compared readings to. You can do anything you want to the safety system.
It’s worth reiterating here just how significant this point is. Even if the hackers did nothing to the Rabigh safety systems, the mere fact that they demonstrated they could have was enough to change the entire future history of cyber security. Two and a half years later, it is still the number one malware on industrial security experts’ minds.
[Andrew] I mean people have talked about this possibility for a long time, about this class of attack. This is the first example of this class of attack we discovered in the wild. It would appear that somebody was trying to sabotage the safety system. Why? We don’t know. Maybe they wanted the ability to shut down the plant at their will. Well, they succeeded in that twice. Maybe they wanted something worse. If you impair the operation of the safety system, then when an unsafe condition occurs in the physical plant, the safety system no longer shuts down the plant.
What happens when unsafe conditions occur? Well, there are explosions. There’s toxic releases. There’s releases of petrochemicals into the environment. In the worst case, it’s possible to imagine not just a disaster but a catastrophe.
The program that breached Petro Rabigh was the first ever malware designed to kill humans. We call it Triton.
Having discovered the most dangerous malware in the entire world, researchers were now left to pick up the pieces–to restore the Petro Rabigh plant to working order, and figure out what evil entity was behind Triton. But they had one thing to do before anything else.
Even after discovering the Triton malware, the Triton hackers remained connected to Petro Rabigh’s internal systems. The security team had to remove Triton from their systems, but doing so would signal to the hackers that they’d been caught. At that point anything could happen.
This is where we’ll pick up the story in our next episode, Triton Part Two.