HW5: Reflections

From Chapter 13 (Security Engineering) we know that Security is closely related to the other dependability attributes of reliability, availability, safety, and resilience.  In the case of Security and Safety, we also run up against the problem that we can never prove a negative, we can never prove that a system is safe or secure because these often take the form of negative and shall not characteristics.  For example, who do we prove that this radiation treatment machine will never harm a human being or how do we prove that this cay system will never be hacked?

The Therac-25 is one such radiation treatment machine controlled by the PDP11 assembly language, just like the Therac-6 and Therac-20 before it.  However, unlike these, it was designed to take advantage of computer control.  The Therac-6 and Therac-20 were designed around machines that already had histories of clinical use without computer control.  Further reading reveals the Therac-25 software had more responsibility for maintaining safety than previous machines.  There was a decision to not duplicate all the existing hardware safety mechanisms and interlocks in the software since they thought it wouldn’t be worth the expense and placed too much faith on the software.  For a machine designed to deliver radiation treatment, there were times it would have 40 dose-rate malfunctions in one day according to one radiation therapist.  Some of the error messages in the Therac-25 would be cryptic and consist merely of the word “malfunction” followed by a number from 1 to 64.  The operator’s manual for the machine didn’t even explain or address these malfunction code.

If we are to create reliable, available, and safe software intensive systems security has to be maintained not as an add-on or afterthought but from all stages of the system development life cycle.  Chapter 14 introduces the concept of Resilience Engineering which makes two important assumptions 1) it’s impossible to avoid system failures so we limit the costs and recover.  2) Place more emphasis on external events such as operator error (since good reliability engineering practice reduce technical faults in system).  This seems to have been the downfall of the Therac-25.

A lesson learned after two patients died from grueling radiation exposure at the East Texas Cancer Center in 1986 only months apart was that focusing on particular software bugs was not the way to make this system safe.  Because of share memory spaces and pretty extensive software design flaws the problems stemmed from manual operator editing of the procedure to take place.  If the hardware safety mechanisms had not been skirted in the first place these lethal radiation blasts could have been prevented.  The four related resilience activities are all R’s, recognition, resistance, recovery, and reinstatement.  Of course, adding resilience to a system will increase costs and benefits from these costs are difficult to calculate since the cost of failure or attack is only known after the event.  But it is very clear that these can be pretty high as we are seeing with Boeing who still hasn’t lifted its fleet of 737-Max’s since after a year from its first crash costing countless billions.

 

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>