The Importance of Knowing How to Fail
The term Predictive Failure AnalysisTM (PFA) originated as a proprietary IBM disk technology introduced in 1992, which was integrated for monitoring the likely failures of the hard disk drive. It later evolved into different technologies installed for evaluating varying failures that tend to loom in CPUs and other input/output devices.
IBM’s PFA was later merged with Compaq's IntelliSafe technology to become part of the reliability-prediction capability for "Self-Monitoring, Analysis and Report Technology" (S.M.A.R.T.).
However, the principles governing PFA were derived from a 1940 U.S. military method, known as “Failure Modes and Effect Analysis" (FMEA). It’s an approach that is mainly focused in analyzing failures that may come in any of the following forms:
- Possible defects in the design of a product;
- Shortcomings of a service rendered; or
- Flaws in a manufacturing process.
Notice the image above, which is a formula used in "Vibration Analysis for Electronic Equipment." Its purpose is to predict time of failure for solder joints as they are exposed to vibration.
The idea is to think of the proposed design or process as something that is imperfect and therefore bound to fail in one way or another. The aim of the analysis is to anticipate all the possible elements that would trigger the failure modes of the product ahead of the end-user. That way, zero defect or maximum efficiency can be achieved.
The process has been proven effective because the mitigating actions that were formulated or the fail-safe mechanisms that were installed had greatly improved in terms of preventing or curtailing the effects of likely defects or disasters.
Predictive failure analysis is the exact opposite of the predictive analysis method, since PFA analyzes the effects by approaching the problem with an inverse point of view. The risk prevention strategies are applied at the stages where the causal factors are bound to happen, instead of devising plans from the point of post-occurrence.
How to Perform a PFA or FMEA:
In performing a failure mode analysis, it is suggested that the brainstorming process should involve individuals with diverse knowledge about the process or the product being analyzed.
(1) Pose the question “how can we make the process go wrong?" or “how can we make the product fail?" instead of the predictive query “what could possibly go wrong?"
(2) Create a diagram that maps out the flow of the processes or one that depicts the procedures involved in the use of the product.
(3) Identify the functions at each stage, the materials needed, the human actions required and the expected outcome in its most ideal state.
(4) Analyze each stage or process step and come up with ways on how to deliberately prevent the system from working. Some examples include:
- A blown fuse,
- Water that is doused over or that seeps through the product,
- Overloaded in terms of capacity,
- Overuse or operating or running for days without let-up,
- Use of an incompatible peripheral or component,
- Malware or virus is introduced into the system,
- Other similar actions or occurrences like power fluctuations /shutdowns / failures, hurricanes, earthquakes or even terrorist attacks.
(5) Evaluate all the controls in place to determine their adequacy as barriers to the trigger factors. Devise ways to disable their functions, e.g.:
- The firewall crumbles from ground-shaking intensity;
- The wirings of the fire-alarm including its sprinkler system are chewed-on by rodents;
- A great tsunami follows a magnitude nine earthquake;
- A hurricane blows off the roof of the building;
- The area is badly flooded;
- The inspection reports are tampered / falsified;
- There is connivance between workers.
(6) Analyze the effects of failure or flaw and how it will affect the user-consumer or the user-worker and the business as a whole.
(7) For each effect, establish a rating system, usually from a scale of one to ten, in which the rating will depend on the significance or insignificance of the effect. Determining significance takes into consideration certain factors, like:
- The number of users affected
- The damage to business reputation
- The harmful effects to public health and the environment
- The possible governmental sanctions or penalties that will be imposed
- The partial or total breakdown of the entire system.
- The cost of repairs that would be incurred whether minimal or substantial
- Other similar conditions or factors that denote severity or harshness of the resulting damages or impairment.
(8) Perceive that the time of occurrence is likewise critical when devising ways to instigate the causal factors and in developiing preventive methods. Time and elements related to the time of day such as darkness, heavy vehicular or Internet traffic, rush hours, holidays, busy or deserted streets, school hours, lunch breaks, etc. are factors that could render the plan effective or ineffective in relation to the adverse conditions or preventive actions.
(9) Formulate the most effective courses of actions that would prevent the occurrence of the effects previously established. The aim is to come up with a risk management plan that has a more proactive perspective rather than reactive, wherein the basic aim is to quash or prevent all factors that would trigger human errors or systems failures.
(10) Test the entire risk prevention plan by creating simulations of the predicted failures or adverse conditions, with special attention to those that were rated as having the most critical or detrimental effects.
(11) Prioritize the mitigating courses of action and / or determine the importance of the fail-safe mechanisms to be installed. In some cases, it may also be important to enhance the in-depth safety controls or the barriers against the trigger factors.
In performing a preventive failure analysis, it would be best for the participants to always keep in mind that the ultimate goal is to achieve zero-defect or total prevention of adverse incidents as proactive courses of actions.
Reference Materials and Image Credit Section:
- Get S.M.A.R.T. for Reliability — https://webcache.googleusercontent.com/search?q=cache:IWp4JK9DG00J:www.seagate.com/docs/pdf/whitepaper/enhanced_smart.pdf+Intellisafe+and+predictive+failure+analysis&hl=en
- Failure Modes and Effects Analysis (FMEA) — https://asq.org/learn-about-quality/process-analysis-tools/overview/fmea.html