Therac-25, buggy software that killed.

The Investigating Software Podcast

Categorie:

I look at the Therac-25 incidents, a devastating collection of software failures that often rank in the top 10 of civilian radiation accidents. The Therac-25 radiation therapy device killed or injured 6 people across Canada and the United States.I look into the bugs, why the manufacturer didn't fix them and what we can learn from their mistakes. Scroll down for full transcript! Resources used to research and compile this podcast include:FATAL RADIATION DOSE IN THERAPY ATTRIBUTED TO COMPUTER MISTAKEhttps://timesmachine.nytimes.com/timesmachine/1986/06/21/870086.html?pageNumber=50 Radiation Therapy for Cancer 1940s Tumor Treated How it Workshttps://www.youtube.com/watch?v=CKjEz-9CbgE FATAL DOSE - Radiation Deaths linked to AECL Computer Errorshttp://www.ccnr.org/fatal_dose.html Medical Devices: The Therac-25 by Nancy Levesonhttp://csel.eng.ohio-state.edu/productions/pexis/readings/submod3/therac.pdf Wikipedia: Therac-25https://en.wikipedia.org/wiki/Therac-25 FDA document outlining the failure of microwave oven interlocks.https://www.fda.gov/media/75184/download 1.21 Gigawatts - Back to the Futurehttps://www.youtube.com/watch?v=f-77xulkB_U Hamilton Health Sciences:https://www.hamiltonhealthsciences.ca/about-us/our-organization/our-history/ 10 Modern Radiation Accidents Involving Civilianshttps://listverse.com/2016/02/05/10-modern-radiation-accidents-involving-civilians/ Safety-Critical Computing: Hazards, Practices, Standards, and Regulationhttps://staff.washington.edu/jon/pubs/safety-critical.html GOOD COMPUTING: A VIRTUE APPROACH TO COMPUTER ETHICS Chapter 6http://docplayer.net/33270293-Good-computing-a-virtue-approach-to-computer-ethics.html  Show Transcript:Pete Houghton (00:01):Hello and welcome to investigating software. My name is Peter Houghton. It was the 3rd of June, 1985. When Katie Yarborough checked into the Kennestone regional oncology center in Marietta, Georgia. Yarborough was there for followup radiation treatment. After surgeons had removed a tumor a few months earlier, and she needed treatment on the lymph nodes near her shoulder. Patients typically have little if any sensation or sign that the treatment is taking place. And Katie had attended treatment at the center before. So she knew the drill. This time was different. Yarborough screamed in pain and told the machine's operator that he'd burned her shoulder later. The hospital's medical physicist determined that she had received up to a hundred times the expected radiation dose in just that one visit to the oncology center. This is the story of the Therac 25, a state of the art radiation for every device. Katie was his first victim that we know of. Pete Houghton (00:59):And over the next 18 months, the Therac 25 would kill or seriously injure five more people at the time, the hospital staff didn't know what had happened and the full horror of Katie's injuries didn't come to light until weeks later, when the radiation damage to her shoulder became visible and while suffering in constant pain, she eventually lost the use of her whole arm. Katie Yarborough required several skin grafts to fix the soft tissue damage caused by the machine's malfunction. As you can imagine, the incident worried and kind of puzzled to staff at the cancer center, they considered these machines safe and easy to use shortly after the incident. Tim Still the hospital's medical physicist called ACL medical, the manufacturer of the machine. For some answers, Tim still asked if it was possible. The machine had malfunctioned and incorrectly spread the beam of radiation. A few days later, Atomic Energy of Canada Limited (AECL) Medical called him back and said it was impossible. Now this seems a little overconfident, especially as later, within a few weeks, it was clear that Yarbrough was suffering from radiation burns, but the manufacturer seemed unable to accept that the radiation producing device that had been used on Katie Yarborough had been the cause weirdly, this reminds me of a scene in back to the future. When Marty McFly is back in 1955, And needs some plutonium for his time-traveling DeLorean, Marty McFly (02:23):All we need is a little plutonium. Doc Brown (02:26):I'm sure that in 1985 Plutonium was available on every corner drug store, but in 1955, it's a little hard to come by. Pete Houghton (02:33):I mean, how exactly did they think this 61 year old manicurist would come by? Severe radiation burns in Georgia in 1985, if not through this machine. Within weeks in late July, 1985, another overdose occurred in Hamilton, regional cancer center near Toronto in Canada. This time a 40 year old lady was undergoing treatment. Following cancer of the cervix. The operator tried five times to treat the patient each time receiving a message indicating 'no dose' of radiation had been applied this time. Despite the apparently failed treatment, the patient described a burning sensation. Three days later when the patient returned to hospital. Again, there were clear signs of radiation burns. The hospital took the Therac 25 out of service and brought in technicians from the manufacturer AECL medical to investigate. Now, at this point, I was sort of understanding. I figured, you know, it's a new technology. We didn't really know the dangers of what might go wrong, but that's not entirely the case. Having looked into it. I know that radiation therapy itself was established practice and had been for decades. For example, here's a nurses training video from 1945, 40 years before the accidents, Narrator (03:49):The nurse technician who administers radiotherapy must have a precise knowledge of the equipment, which she operates. She should also have an understanding of the physiological effects and psychological aspects of x-ray treatments upon individual patients. These qualifications are essential since the technician must cooperate very closely with the doctor nurses and others who are administering the medical and nursing care prescribed for each patient. Pete Houghton (04:17):So the treatments themselves were not new, but one aspect of the machine was new, it had an all software control system, the machine no longer used electromechanical interlocks to help control the device. All parts of the device were controlled mainly via computer (for old tech geeks. It was a PDP 11 running a bespoke operating system. I'll come back to that later). Those electro mechanical interlocks sound impressive, but they are standard safety devices that you'll even see in your home appliances. For example, when you open your microwave oven, mid nuke, you're using an interlock and interlock, physically shuts down the cooking process to stop you getting cooked as well. You don't get a iradiated because you forgot to click, cancel or end on the fiddly little buttons. The interlock kicks in and stops the radiation Therac 25. Our state of the art machine did a wave of these old fashioned physical interlocks. Pete Houghton (05:11):And instead use software to determine if it was safe to start treatment. Earlier devices had used physical interlocks to ensure safe use. So even if the user had mistakenly tried to switch on the radiation too soon, nothing would happen. Yes, as you've probably figured out these new software interlocks failed to protect patients, as well as I could have another point to note about this new Therac, 25, is its size, it occupies a whole room and some space outside the room for the operator who actually sits at the computer terminal and types in their prescription details and keeps an eye on the patients inside the room. There was a moveable table where the patient is placed and around them is the arm of the device itself. It kind of looks like a giant KitchenAid food mixer. It can be rotated round into the right and aligned with the patient. Pete Houghton (05:58):The business end of the machine, sort of like the whisk attachment is rotated into different positions and each position determines what the machine is doing. Either producing electron radiation, x-rays or just a simple light. So the operator can align the beam before treatment. After the second accident in Hamilton, Canada, AECL Medical had sent engineers to take a look and see what was wrong, but the engineers never managed to recreate the problem. Nonetheless, AECL medical suggested a couple of minor code updates that could handle some of the head positioning problems they guessed might have occurred. They thought that the hardware micro switches that detect the positions of the head, the head that determines whether you're using x-rays or electrons or whether you're aligning the head, they thought, those might fail. And so the new code would help them handle those failures more gracefully. Then in a statement that was surely tempting fate, they claimed:AECL Medical Statement (acted by Pete Houghton) (06:52):Analysis of the hazard rate of the new solution indicates an improvement over the old system, by at least five orders of magnitude. Pete Houghton (06:59):They were in essence saying that the machine was now a hundred thousand times as safe. That's quite a statement considering they hadn't been able to recreate the bug in the first place and therefore no evidence that they had fixed anything of consequence. Over the next 18 months the Therac 25 would kill or seriously injured. Four more patients that we know of. This includes a tragic case of a man down in Texas who received a massive overdose to the frontal lobes and died within a month from his injuries. It wasn't until after the second incident at the same site down in Texas, that the FDA declared that for act 25 defective and demanded AECL Medical, come up with a plan to fix it though the machines would remain in use and would kill another patient in early 1987. And now it wasn't until the sixth incident in 1987, that the FDA finally demanded the AECL Medical, tell the hospitals to stop using the device altogether. Pete Houghton (07:56):So what was going wrong? Was this a case of human error, hardware, failure, or problems in the software? The problems here are manifold. There wasn't just one problem or one bug or one thing that caused this 18 month long tale of death and injury. Let's take a look at two of the bugs that were found in the system,Pete Houghton (08:14):The race condition bug. To tell the Therac 25, what type of treatment to give the patient, The operators would use a relatively easy to use application on the systems computer. So it would seem pretty old fashioned to us today. The first mainstream desktop that you might recognize with a mouse, desktop files and windows was released the same year as that Therac 25 in 1983, Apple fans will, of course know the computer was the Apple Lisa the Therac 25, didn't have one of these new fangled desktops and used a more traditional set up called a VT 100, which really just consists of text dumped to the screen in columns where the user could move about the screen using the arrow keys, the medical physicists and operators would just navigate around the screen using the arrow keys and enter the details of the patient's prescription. Pete Houghton (09:06):For example, the energy level, the type of treatment. So x-rays or electronic therapy, the duration of the treatment and other things to choose x-ray treatment. The user would just enter an X into the appropriate field, and if they wanted to use electron therapy, they would just enter an E into the same field behind the scenes. The system would rotate the appropriate parts of the machine into place. So the light type of treatment would be given. Unfortunately, during this setup, it wasn't detecting the changes had been made by the operator. So several of the incidents appear to have been caused by the following steps. One, the operator would quickly type in the prescription two. They would notice that they had dented an X for x-ray when the patient needed an E for electron that's an easy mistake to make because most patients were receiving the x-ray treatment. Three: Pete Houghton (09:59):They would then correct the mistake using the up arrow to go back up and enter an E for electronic therapy for the operator would then return to the bottom of the screen to command the machine, to begin the treatment. Unfortunately, Therac 25, stop listening to the new commands for eight seconds. After the operator had entered the original X for x-ray during this eight second time window, the device has rotating the electromagnets out of the way and the x-ray targets into their correct positions for x-ray treatment. So when our quick fingered operator updated the system to use the electron therapy mode, the machine ignored her new request. This left a machine in a sort of inconsistent state, half configured for x-ray mode and half configured for electronic mode. What's worse in normal operation x-ray mode automatically sets the system to use maximum power. Normally the patient is protected from most of the radiation by a sort of lens, absorbs a lot of the radiation as it diffuses the beam out into a wide area on the patient's body. Pete Houghton (11:04):When the operator started the treatment, the patient was hit by the full unshielded power of the electron beam, and it gets worse. The confusing and misleading messages about no dose having been applied men that the operator sometimes would repeat the process multiple times. So that's the race condition bug a quick and efficient operator who noticed a mistake in the prescription would quickly fix it. That Therac 25, unable to handle the changes ends up delivering a massive overdose of radiation. Sometimes a hundred times the required dose, the manufacturer AECL Medical of Ottawa, Canada repeatedly refused to accept that the machine had any faults, especially ones with such lethal consequences. And after the third incident in Yakima, Washington state, they sent the following response. AECL Medical (acted by Pete Houghton) (11:52):After careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac 25 or any operator. Pete Houghton (12:03):This is after the hospital staff had pointed out to the company that the red radiation burns on the patient matched the pattern of the open trays at the business end of the Therac 25. After the fifth incident, AECL medical were informed by medical physicist, Fritz Hager, that he had managed to reproduce the error that might've resulted in the patients getting an overdose. So it's interesting here, a user figured out the bug and supplied AEC L medical with the details of how it could both state no dose on the readout while massively overdosing the patient. The user clearly had a good handle on how to use the machine. You might imagine a team of AECL experts quickly looking through the code, developing safe and reliable code patches to this problem, hardware fixes and ensuring that no one else is placed in danger until a thourough review had been completed. Pete Houghton (12:58):Yes, that's exactly what didn't happen. Instead, AECL Medical issued an advisory to customers to remove the up arrow key from their keyboards and to cover the metal contacts under the key with electrical tape, just to make sure users don't click the up arrow and edit the prescription data, thereby avoiding the deadly race condition bug, even the FDA for more was needed and stated on the 2nd of May, 1986,FDA (acted by Pete Houghton) (13:22):it does not satisfy the requirements for notification to purchases of a defect in an electronic product specifically. It does not describe the defect nor the hazards associated with it. The letter does not provide any reasons for disabling the Cursor key, and the tone is not commensurate with the urgency. Pete Houghton (13:40):Furthermore, the manufacturer didn't stop to think if the development and testing process had allowed this first Bug into the system, maybe other bugs were present and a more thorough approach might be needed. Pete Houghton (13:52):The second bug I'm going to talk about occurred in Yakima Washington state in early 1987, incidentally, it's also fought to be the cause of the earlier incident in Hamilton, near Toronto, that I mentioned earlier, when the operator is getting ready to treat a patient, they often first enter the prescription into the computer, then go into the treatment room to finish aligning the head of that Therac 25. So it points correctly to the tumor or lymph node being treated in this situation. And the machine was in what they called a field light mode. And the operator could make continual adjustments until a patient. And the business end of the machine were perfectly aligned during this process. The computer keeps track of the fact that the system isn't ready to use and that the right heads are not in place for treatment. It does this by just increasing a little counter in its memory. Pete Houghton (14:40):So the computer is in a little loop going, have I been set up right yet? No. Okay. Then add one to mine. Not ready. Counter is the not ready counter zero. No. Okay. I won't fire the huge beam of deadly radiation just yet. So as long as that not ready counter is above zero. The computer doesn't allow the beam to switch on who's good and safe, but of course not in Therac 25, the problem of increasing account on a computer is that eventually it will reach its maximum possible value and will leave an error or return to zero. This is often described as a rollover. It's a bit like how a clock goes back to zero zero, zero, zero after 23 hours and 59 minutes. At the end of every day, the Therac 25, not ready counter had a maximum value of 255. So while the technician is aligning the machine and the patient, this counter is increasing every time he notices that things haven't finished being set up yet until of course the operator decides everything is in place and wants to proceed to the next step. Pete Houghton (15:44):The operator then clicks SET and the computer then proceeds to allow them to continue the next stage of the setup. But of course there is a slim chance one in 256 in fact, that our knock ready counter has just returned to zero. And when that Therac 25, so it was zero, essentially saw a green light and applied the radiation beam at full power, even though things weren't properly set up yet. And as a machine wasn't fully set up, there was no diffuser or colimater in place to reduce or shape the beam. The patient was hit with the full power of the beam before the operator had even completed the setup. They were the two high profile examples that were highlighted after the incidents, but they weren't all of the problems, a medical physicist, Tim Still the guy I mentioned earlier that worked in Marietta, Georgia compiled a list of eight other worrying bugs he'd seen in the system. Pete Houghton (16:39):So who or what caused these problems. As I mentioned earlier, there are many causes. We could point at the code clearly that had a lot of bugs. Also the choice of programming language, which was assembly language rather than the language was easier for colleagues and auditor's to review. But that point is almost moot as there was no external code review, no one outside the company had audited the system before it was deployed into 11 hospitals across the United States and Canada, the software was all developed by one person. I never issue was the overreliance on software for safety instead of tried and tested hardware interlocks, also failing to fully investigate the early problems and then guessing at a fix, which we saw was not the actual issue. They didn't step back and examine the system more deeply. Even given there was evidence of harm. There was obviously serious development management issues at a ECL medical. Pete Houghton (17:34):For example, while developing the software, the developer decided not to use a standard off the shelf operating system like Unix or one of the many others that were available. He instead wrote his own operating system. This is like developing a new car radio and then deciding that your radio was so special that you just had to design and build a whole new car to plug your radio into. You can just imagine how reliable that car would be. But the problem, I think at the heart of all, this is a denial by the manufacturer that there was a problem they repeatedly denied the software could be at fault. Even after people were getting radiation burns, they assumed all sorts of other causes for the incidents accusing one patient of getting burned by her electric blanket and another of getting electrocuted by faulty hospital, wiring, hospital staff, thoroughly debunked, both of those suggestions, even AECL Medical's initial fix was based on a guess that the hardware micro switches were failing and the software needed to be amended to handle this. Pete Houghton (18:30):This is when they add no sign that that had ever happened. So they fought their high quality software. I needed a slight modification to handle the imaginary hardware problem they thought had happened. I suspect the engineers involved assumed that they could build software, just like physical machines. That was their background. After all, all the previous machines had been hardware controlled with any computers acting purely as calculation machines for the operator. A good point noted by Nancy Leveson, an academic who did much of the research into the failures is that in a 1983 hazard analysis report, AECL Medical stated.AECL Medical (acted by Pete Houghton) (19:05):Two. Program software does not degrade due to where fatigue or we production process. Three. Computer execution errors are caused by faulty hardware components and by soft random errors induced by alpha particles and electromagnetic noise. Pete Houghton (19:21):... to that last point while technically true certain types of radiation can cause errors. Pete Houghton (19:26):It's accepted that here on earth, at least it's much more likely that writing flaky code or unreliable code is what's going to be your problem. In my opinion, a ECL medical looked at the problem of determining their system's safety, the wrong way around. They assumed that they had used all the right nuts and bolts and therefore the machine as a whole would be fine. In fact, some of the code had been used before and no one had died on those systems. The difference was, of course those machines had hardware, safety interlocks, and didn't use the software in any fundamental way to control the machine. The manufacturer, didn't assume the software was broken, optimistic, wherever they were looking at the individual software components or the entirety of the system. They treated them as cogs or nuts and bolts. The assumption being that once you plugged them together, right? Pete Houghton (20:10):And they just work just like real cogs or like Lego software. Isn't like that. It's more like an arcane set of rules for an old board game. You find in the attic, you might get the gist of what's going on by glancing at the rules, but you won't really know until you play it, even then you soon realize that you're just not playing it right. And the game is probably a bit rubbish. And you understand why it's placed in the attic in the first place, a better approach is to assume failure assume that there are serious flaws in the software and that you just need to find them. It's a sort of Pascal's wager. Blaise Pascal was a 17th century mathematician who claimed it was more rational to believe in the existence of God than not. Pascal's wager when something like this. If you were to believe in God and live accordingly, you'll go to heaven. Pete Houghton (20:54):But if you didn't believe in God, you wouldn't go to heaven. Even if God existed. And conversely, if God didn't exist, you had nothing to lose any way from believing in God. Three pointed out the rational choice is to err on the side of believing in God. So when it comes to reviewing, investigating, or testing software we're trying to find out if the bugs exist, when we take a shallow unthinking, look at the app and see that it's all good. And there's nothing to worry about. Then we might be right. The bugs might not exist, but the more rational choice is to believe that there is a bug and spend your time diligently searching for this truth. Because the significance of finding one bug far outweighs the significance of many attempts to find no bugs, finding that one bug that deletes or corrupts your customers data is far more important than five glances that showed you how marvelous the software was. Pete Houghton (21:40):It's those bugs that will make people decide to not use your app, your website, or your data model. The next step would be to see the results of your testing. See the bugs and then take action. Not just to fix those bugs, but the underlying causes. For the Therac 25. This might include ensuring that the system was fail, safe, having code reviews and using safety orientated programming techniques that weren't so prone to dangerous failures. But like I've said before, the first step is knowing you've got a problem and that's where software testing can help. Thanks. That's all for this episode. Again, I might return to this series of incidents in a future podcast, as there's just so much that went wrong here. I'd also like to thank professor Nancy Leveson who wrote the initial report that much of the later articles on that Therac 25 failures are based on, it's an excellent insight into what went wrong. 35 years ago, I'll put a link to her report in the show notes. Thank you. You've been listening to me, Peter Houghton, and this was investigating software.