What do you do when the excrement starts flying?
I write this in California on Monday, Mar. 18, 2019. Eight days earlier, a Boeing 737 MAX 8, operating as Ethiopian Airlines Flight 302, crashed six minutes after takeoff in clear air on what should have been a routine flight from Addis Ababa to Nairobi. The equivalent route flown in Africa as a flight from San Francisco to Los Angeles. A milk run, as pilots say. Some 157 innocent souls didn’t make it.
This follows a crash five months earlier in Indonesia, under similar circumstances with no apparent weather-related impediments, of the exact same aircraft type shortly after takeoff, with similar loss of life.
It is irresponsibly premature to draw sweeping conclusions from this most recent crash, as the investigation into both disasters continues. Much remains to be learned, and early theories of causation could be proven wrong. Eerie similarities between the two incidents have emerged from the evidence reviewed thus far, however, and we already know a few things. Those few things prompt anxious questions now.
Like: It appears anti-stall software in the 737 MAX, called MCAS for Maneuvering Characteristics Augmentation System, which controls horizontal stabilizer deflection in response to the airplane’s angle of attack (AOA) sensing vanes, repeatedly overrode pilot upward pitch inputs from the control yoke, causing each airplane to automatically pitch rapidly downward in a rapid oscillation sequence, which eventually caused the pilots to be overwhelmed and lose control of the aircraft with fatal results.
Like: Boeing inexplicably did not see fit to emphasize pilot awareness training about the MCAS system, which was new to this next-generation model of 737. The new software was designed to counteract a tendency of the MAX’s more powerful, more fuel-efficient engines to pitch the nose upward, especially at slow speed. Somehow Boeing chose not to dwell on the altered flight characteristics resulting from new engines, slung in new positions beneath and forward of the wing, instead electing to rely on pilots’ ability to look up answers to any questions they may have in the aircraft operating manual, while hurtling along at 500 miles per hour. Cockpit voice recordings from Ethiopian 302 show that in their final confused moments the pilots were frantically attempting to troubleshoot the problem of pitch oscillations by consulting the manual. They did not find the solution to their problem.
Like: The pilots of both flights either lacked the skills or the training to override MCAS-induced control inputs. In plain English, they didn’t know how to flip the switch to turn it off and fly their planes manually. Experienced pilots know how to do this. Further, such an override option was at hand, and with redundancy. The 737 MAX contains three such switches, one on each pilot’s control wheel, and one on the center console between them.
Like: How is it these – on the face of it – insufficiently briefed pilots were allowed to fly these brand-new aircraft, with aerodynamic behavior not seen in older 737 models, containing new and unfamiliar software, without the necessary training in its use, including how, in certain situations, it can be disabled, in just such an emergency as they evidently encountered in their final moments aloft? What flight proficiency checklists, or the lack thereof, enabled this to happen or passed over this scenario as either improbable or insignificant and therefore not worthy of attention?
Like: What decision-making process at Boeing contributed to this bewildering, counterintuitive state of affairs? This from a company that demands Nadcap quality management system certification from its key suppliers. Hmm. This calls to mind the phrase from scripture: Physician, heal thyself.
Fair questions. Questions posed by hindsight usually are. Their answers will be revealed in the coming days and weeks. Tragically too late for 346 people.
What does all this have to do with test engineering?
Upon reflection, quite a lot. As with commercial aviation today, so with test software: We have an inordinate overreliance on automation, to a degree that is disturbing. When it works, it works handsomely. It’s also fast. Debug happens quickly. Point and click. The same mentality that promotes autonomous vehicles, machine learning, and data-driven manufacturing (Industry 4.0) naturally gravitates to the ease of use.
However, what happens when events conspire to deviate from the script? In the moment, does the practitioner know how to identify the problem and respond properly and decisively?
In technical terms, what do you do when the excrement hits the ventilating device?
It’s one thing to obtain the test failure printout. It is a wholly different one to interpret what it says and decide whether it squares with reality. (And take action from that decision.)
For example, purveyors of flying probe testers brag about their machines’ self-learning programs, emphasizing the ease with which one can generate a program in the morning and be happily testing boards in the afternoon. Here comes revenue, minimal skill level required. Simply know the proper commands and push the right buttons. No need to hire an expensive engineer when a low-level technician will suffice.
What they neglect to point out is the self-learned program is itself learned from a board that may be good or bad; the program is only as good as that board. That’s the risk: the “program” is not debugged in any traditional, objective, parametric sense. Those who know cross their fingers and hope to avoid any test escapes.
What do you do when you get that call about that test escape? How do you explain it? Oops? Call the manufacturer?
Further as to flying probe, what do you do when a customer asks you, upfront, what the test coverage will be on a programming project you’ve only just quoted? Do you lie and tell the customer what you think they want to hear, namely 100%, hoping secretly that whatever locations on the board lack probing access can be viewed for presence/absence with a camera? (Dirty little secret: most flying probes have cameras.) Or do you tell the truth: that you won’t know until you do the work and debug the program?
Everybody always wants free engineering. Especially the well-documented kind that the shrewd customer can take down the street, cut and paste into a specification/RFQ, and secure a lesser price. (For further proof, see below about functional testing.)
When automation fails or provides questionable data, as with piloting, do we have the fundamental skills to overcome the problem, overrule the automation or even disable it and safely “fly around” it manually? Pilots call these manual skills “stick-and-rudder” skills, or “pilotage.” Managing the situation to a satisfactory and safe conclusion. Test engineers call their version of these same skills “tribal knowledge” or, simply, “experience.” The fruit of theory + training + time + variety + more time, seasoned by skepticism as well as common sense. Throw in orneriness too for good measure. Long live old guys.
We live in the millennials’ playground, the video game era. Authority has been conferred by on-screen content. The blue screen mesmerizes. We are reduced to being passive observers. Absent is knowledge of why what appears on the screen is valid content in the first place.
So, as test engineers, it falls to us to do a lot of explaining. First, to educate our customers. Second, to protect ourselves from undue liability. Third, and perhaps most important, to inoculate customers from the effects of unwise judgements based on incomplete, imperfect, ignorant or downright erroneous information. Otherwise we get blamed, regardless of who made the final call to proceed. Protecting customers from themselves, a large portion of our workday, is the right thing to do. It’s also good business practice, as well as an excellent survival skill.
For example, a customer may approach us with what they believe should be an in-circuit testing project. They would like us to develop a 3070 program and fixture for a certain board. Our engineer analyzes the CAD for mechanical access and finds only 30% of the nets on the board are accessible with probes, whether actual or virtual (vectorless). We counsel the customer away from ICT, recommending flying probe as a faster, cheaper alternative. (Flying probe testing, for one, doesn’t require dedicated test points, while ICT does.) In those cases where the customer accepts our advice (not always the case!), they are usually happier with the lower flying probe price. It took our guidance (experience) to pull this customer away from the script (an RFQ specifying ICT), for their own good.
Or consider functional testing, that great black hole of engineering effort. Requirements tend to come in two extremes: the first wanting every conceivable test, and spare no expense, with 30 pages of detail specifying exactly what everything means; while the second specifies nothing, except the requirement that the boards be functionally tested. (“You do have a lab to create these tests, don’t you?”) The first scenario falls flat and goes nowhere when the test engineer takes the spec literally and quotes it accordingly. This results in cardiac arrest upon customer receipt, specifying the quote. (“My budget is for $5,000, and your quote is for $50,000.”) We listen for distant thuds minutes after such quotes are delivered. The second scenario blows up when the recipient tries to coax a process from the practitioner (that would be us), free of charge. Smart practitioners don’t bite, or they demand an upfront engineering charge, so the customer has some skin in the game. This demand usually has a deterrent effect, and results in an immediate and abrupt loss of interest and change of emphasis on the customer’s end. (“Well, we just found a junior engineer who is between projects. S/he can build this in a few weeks. Thanks for taking the time to review our design.”) From such data are epiphanies born.
Going off-script can be an uncomfortable place for some. (“I found myself in a dark wood….”) But then again, some of us live there.
firstname.lastname@example.org. His column runs bimonthly.is president of Datest Corp. (datest.com);