Become An Equipment Reliability Detective: Preserve Failure Data

Updated: January 5, 2022

Reading Time: 6 minutes

Articles

It’s amazing how quickly failure data (evidence) can disappear after a ‘failure’ has occurred. Following such an unexpected incident, there is sometimes a lot of confusion. Most sporadic/acute incidents occur on off-shifts or on weekends which can add to the confusion. Statistically, for a continuous manufacturing operation, there are more night and weekend hours than normal single-shift operating hours. Nobody is sure what to do or who to call to preserve failure data. All everyone knows is, “We better get this process or piece of equipment back on line.” This ‘production first’ paradigm drives the scene.

Oftentimes in the chaotic activities that follow, a wealth of failure data can be destroyed, altered or discarded. We see the following all the time:

Failed parts are marred, thrown away or taken to a shop and forgotten,
Lubricants and other fluids are mopped up,
Valve and instrument positions are changed in preparation for startup,
Product samples are not taken to test for contaminants,
Shifts change and operation and maintenance personnel are replaced with people who were not present at the time of failure.

Along with the disappearance of all this data, go the chances of uncovering the true root causes of the incident.

I often ask our RCA students, “Would you expect a homicide detective to be able to solve a murder without any clues?” The response is typically, “Of course not.” Then I ask them, “How can you expect to uncover the true root causes of an incident without any failure data?” Drawing the parallel between being a detective and a failure analyst, is an effective analogy. Consider that the investigating detective’s sole responsibility at the scene of a homicide, is to “FREEZE” the scene and collect as much data as possible for later analysis. How do they “FREEZE” the scene? They:

Take photographs of everything before anything is disturbed.
Bag and tag any items which may yield information (murder weapon, hair, bodily fluids, clothing, fibers, etc)
Dust for fingerprints.
Interview people (witnesses, neighbors, friends, colleagues, etc.).
Map the position of the body and the evidence, relative to its surroundings.
Review available and relevant paper documents.

As failure investigators, we can learn a lot from homicide detectives. When a significant failure occurs, we should treat it as if a homicide has occurred and develop the appropriate strategies to “PRESERVE THE FAILURE DATA.” If the failure investigators is to be successful, they must collect data from each of the 5 P’s. These are simply memory joggers that stand for:

Parts
Position
Paper
People
Paradigms

Failure data from each of these categories must be collected to ensure a successful Root Cause Analysis. Let’s briefly review each classification for some common examples.

PARTS

Any failed components and/or other tangibles such as:
- bearings
- seals
- shafts
- valves
- gears
- fans
- fasteners
- nozzles
- lubricants (fluids in general)
- chemicals from spills (product samples)

POSITION

Looking for positional related information relative to time and space
- Where were failed parts at the time of failure (noting conditions as well at the failure scene)?
- Were valves open or closed?
- What were the instrument’s settings at the time? Does the physical reading match the setting (looking for possible calibration related issues)?
- Time of day of failure as well as shift on duty?
- Position of relevant employees at the time, as opposed to where they were supposed to be? o Frequency of occurrence per year (are chronic failures cyclical/seasonal in nature)?
- Was any unexpected work being conducted at the time that changed the planned work (non-routine work)?

PAPER

Typically static data, but anything paper related.
- Operating conditions prior to, during, and after the incident (temperatures, pressures, levels, etc.)
- vibration monitoring results
- equipment histories
- standard operating procedures
- manufacturing procedures
- equipment specifications (OEM)
- training histories
- HR records citing employee qualifications and performance histories, o P&ID’s (‘as built’ as well as updates with modifications)
- Safety procedures (corporate, site and regulatory)

PEOPLE

Typically these will be people you want to interview that have information relevant to the failure (obviously who we talk to will be based on the nature of the failure)
- Direct observation witnesses o Mechanics
- Operators
- Engineers
- Safety personnel
- Quality personnel
- Reliability personnel
- Stores personnel
- Purchasing personnel
- HR personnel
- Finance personnel
- Supervisory personnel
- Leadership personnel
- OEM/Vendors/Suppliers
- Personnel at other sites who have similar operations (to see if their experience is similar to yours, in the nature of the failure on their operations).

We would interview to determine what they observed, sensed and concluded about the failure. The interviewing techniques and questioning strategies are critical to such an investigation.

PARADIGMS

Paradigms are derived from our interviews. All individuals will have their respective mindsets based on their own education, training and experience. When a population of people share the same mindset, then it becomes a collective paradigm which contributes to the organizational culture. Here we are seeking:
- What are the cultural norms of the organization?
  - Example: “They say safety is number one, but we all know production rules!”
- What do people accept as a way of conducting ‘normal’ operations?
  - Example: ‘When I am time-pressured, I am apt to take short-cuts and skip steps.”
- What repetitive remarks were made during the interview that indicates beliefs, values or deep- seated convictions?
  - Example: “The equipment failed because it was old.” Note: When the reality was it failed because it was operated beyond its design capacity, not because it was old.

DATA FRAGILITY

Data needs to be collected from each of the 5-P’s as quickly as possible following the failure. Obviously, the lead investigator cannot be at the manufacturing facility 24/7; therefore, provisions have to be made to train several people on each shift to be failure data collectors. These people should function much like the fire brigade at a plant. Each member is assigned a certain task, and when called into action should perform that task until the failure investigators can arrive and direct the effort in more detail. Certainly a pre-determined staging area should be designated to preserve the evidence collected until the investigators can get there.

Priority 1: Position and People

Position: Getting to the failure data before it becomes corrupted is a key to effective RCA. In addition, there is a definitive pecking order for collecting the 5 P’s because some data is more fragile than others. The most fragile data (data that becomes disturbed and tainted the fastest) are Position and People data. The fact that Position data is fragile makes sense to most people.

In order to “FREEZE” the failure scene, we often suggest to students, that the failure response team be equipped with brightly colored boundary tape so that they can “rope off” the area, and cameras to make a photographic account of the scene. (Note: Use any electronic equipment of any kind ONLY AFTER the area is cleared of flammable conditions.) This photographic data can be invaluable as the failure investigator tries to understand the causes of the failure.

People: The fact that People data is extremely fragile often surprises would-be investigators. Due to this perception, valuable failure data is lost. The problem is that as time passes following an incident, the raw sensory data that was taken in by people who were at or around the failure scene starts to become distorted. People start to evaluate what they heard, saw, smelled or felt and draw conclusions based upon this input. If something they sensed doesn’t fit their mental models of what the scene should contain, they may discount it and only inform the failure investigators of their conclusions about what happened, as opposed to providing them with the raw data.

It is imperative that the people who were at the scene be debriefed prior to their leaving the facility. At the very least, they should fill out a generic failure data collection sheet documenting what they sensed at the time of failure and anything unusual that was being done at the time of the incident. Preferably, each person should spend 15-20 minutes being debriefed by a failure investigator. This provides the failure investigator with much more meaningful data because they gather it firsthand.

Priority 2: Parts

Parts: Following Position and People, Parts should be bagged and tagged, and taken to a secured staging area for analysis at some future point. Here we have to ensure that parts critical to the investigation do not grow legs and run away, or they are thrown in the trash because of production pressure to hurry and get back online. ‘Positional’ data was to quickly note where the parts were located on a failure scene map. That is very time sensitive as people want to clean the scene up quickly for production. But we have to be sensitive to where the parts go, once their location is noted on the failure scene map.

Priority 3: Paper

Paper: Paper data (which includes electronic data which might be manipulated/altered or disappear) such as shift logs, sensitive DCS data, should be gathered and stored for later review. Such information is usually static, so we have some time to get it after the time-sensitive data described above is properly collected and preserved.

Priority 4: Paradigms

Paradigms: Finally, Paradigm data is the least fragile of the 5 P’s. The fact of the matter is that they are deeply ingrained within the organization and revealed as people are interviewed. RCA investigators should always be on the lookout for restraining paradigms that may have contributed to the failure. These restraining paradigms are considered latent root causes. We wish that paradigms would change faster than they do, but we will have plenty of time to capture people’s paradigms (their perceptions of their world).

Doing a good job of “PRESERVING FAILURE DATA” is a key step in conducting an effective RCA. Unfortunately, it is also the step that is usually second in priority to getting the process or the piece of equipment back online as quickly as possible.

About the Author
Robert (Bob) J. Latino is former CEO of Reliability Center, Inc. a company that helps teams and companies do RCAs with excellence. Bob has been facilitating RCA and FMEA analyses with his clientele around the world for over 35 years and has taught over 10,000 students in the PROACT® methodology.

Bob is co-author of numerous articles and has led seminars and workshops on FMEA, Opportunity Analysis and RCA, as well as co-designer of the award winning PROACT® Investigation Management Software solution. He has authored or co-authored six (6) books related to RCA and Reliability in both manufacturing and in healthcare and is a frequent speaker on the topic at domestic and international trade conferences.

Bob has applied the PROACT® methodology to a diverse set of problems and industries, including a published paper in the field of Counter Terrorism entitled, “The Application of PROACT® RCA to Terrorism/Counter Terrorism Related Events.”

Follow Bob on LinkedIn!

Articles

Resources

Root Cause Analysis

/