If You Fail This Mode

The Bathtub Curve and Product Failure Behavior
Part Ane - The Bathtub Curve, Infant Mortality and Burn down-in

by Dennis J. Wilkins
Retired Hewlett-Packard Senior Reliability Specialist, currently a ReliaSoft Reliability Field Consultant
This paper is adapted with permission from work washed while at Hewlett-Packard.

Reliability specialists often depict the lifetime of a population of products using a graphical representation called the bathtub curve. The bathtub curve consists of 3 periods: an infant mortality catamenia with a decreasing failure rate followed past a normal life flow (also known as "useful life") with a depression, relatively abiding failure charge per unit and final with a wear-out period that exhibits an increasing failure rate. This article provides an overview of how infant bloodshed, normal life failures and wear-out modes combine to create the overall product failure distributions. It describes methods to reduce failures at each stage of product life and shows how burn-in, when appropriate, can significantly reduce operational failure charge per unit by screening out infant mortality failures. The material will be presented in two parts. Part One (presented in this issue) introduces the bathtub curve and covers babe bloodshed and burn-in. Office Two (presented in next month's HotWire) volition address the remaining two periods of the bathtub curve: normal life failures and terminate of life wear-out.

Figure 1: The Reliability Bathtub Curve

Figure 1: The Bathtub Curve

The bathtub bend, displayed in Figure 1 above, does not describe the failure rate of a unmarried particular, simply describes the relative failure rate of an unabridged population of products over time. Some individual units volition neglect relatively early (infant mortality failures), others (we hope near) will terminal until article of clothing-out, and some will fail during the relatively long period typically called normal life. Failures during infant bloodshed are highly undesirable and are ever caused by defects and blunders: material defects, blueprint blunders, errors in assembly, etc. Normal life failures are ordinarily considered to be random cases of "stress exceeding forcefulness." However, as we'll come across, many failures ofttimes considered normal life failures are actually infant mortality failures. Wear-out is a fact of life due to fatigue or depletion of materials (such equally lubrication depletion in bearings). A product's useful life is limited past its shortest-lived component. A product manufacturer must clinch that all specified materials are adequate to function through the intended product life.

Annotation that the bathtub curve is typically used as a visual model to illustrate the three key periods of product failure and non calibrated to depict a graph of the expected beliefs for a item product family. It is rare to have enough short-term and long-term failure information to actually model a population of products with a calibrated bathtub curve.

Also annotation that the bodily time periods for these three feature failure distributions can vary greatly. Infant mortality does non hateful "products that fail inside 90 days" or whatever other divers time menses. Infant mortality is the time over which the failure rate of a product is decreasing, and may last for years. Conversely, habiliment-out will not always happen long afterwards the expected product life. It is a period when the failure rate is increasing, and has been observed in products afterward just a few months of use. This, of course, is a disaster from a warranty standpoint!

We are interested in the characteristics illustrated by the unabridged bathtub curve. The infant bloodshed period is a time when the failure rate is dropping, but is undesirable because a meaning number of failures occur in a curt time, causing early client dissatisfaction and warranty expense. Theoretically, the failures during normal life occur at random but with a relatively constant charge per unit when measured over a long flow of time. Because these failures may incur warranty expense or create service back up costs, we want the bottom of the bathtub to exist every bit low as possible. And we don't want any wear-out failures to occur during the expected useful lifetime of the product.

Infant Mortality What Causes It and What to Do About It?
From a client satisfaction viewpoint, infant mortalities are unacceptable. They cause "dead-on-arrival" products and undermine customer confidence. They are caused by defects designed into or built into a production. Therefore, to avoid infant mortalities, the product manufacturer must determine methods to eliminate the defects. Advisable specifications, acceptable pattern tolerance and sufficient component derating can help, and should ever exist used, but fifty-fifty the best pattern intent can fail to comprehend all possible interactions of components in performance. In addition to the all-time blueprint approaches, stress testing should exist started at the primeval development phases and used to evaluate design weaknesses and uncover specific assembly and materials problems. Tests like these are called HALT (Highly Accelerated Life Examination) or HAST (Highly Accelerated Stress Test) and should be applied, with increasing stress levels equally needed, until failures are precipitated. The failures should be investigated and design improvements should be fabricated to better product robustness. Such an approach tin can help to eliminate design and material defects that would otherwise testify up with product failures in the field.

After manufacturing of a product begins, a stress test can however exist valuable. There are 2 distinct uses for stress testing in production. One purpose (oftentimes chosen HASA, Highly Accelerated Stress Audit) is to place defects acquired by associates or material variations that tin lead to failure and to take action to remove the root causes of these defects. The other purpose (often chosen burn-in) is to apply stress tests as an ongoing 100% screen to weed out defects in a product where the root causes cannot exist eliminated.

The showtime approach, eliminating root causes, is generally the best approach and can significantly reduce infant mortalities. It is commonly most cost-effective to run 100% stress screens only for early production, then reduce the screen to an inspect (or entirely eliminate information technology) equally root causes are identified, the process/design is corrected and significant issues are removed. Unfortunately, some companies put 100% burn down-in processes in place and go on using them, addressing the symptoms rather than identifying the root causes. They just continue scrapping and/or reworking the same defects over and over. For nigh products, this is not effective from a cost standpoint or from a reliability comeback standpoint.

At that place is a grade of products where ongoing 100% burn-in has proven to be effective. This is with engineering science that is "country-of-the-art," such as leading border semiconductor fries. There are bulk defects in silicon and minute fabrication variances that cannot be designed out with the current state of technology. These defects tin cause some parts to neglect very early relative to the majority of the population. Burn-in can exist an effective way to screen out these weak parts. This volition be addressed later in this article.

A Quantitative Look at Infant Mortality Failures Using the Weibull Distribution
The Weibull distribution is a very flexible life distribution model that can be used to characterize failure distributions in all three phases of the bathtub bend. The bones Weibull distribution has two parameters, a shape parameter, often termed beta (β), and a scale parameter, oftentimes termed eta (η ). The scale parameter, eta, determines when, in time, a given portion of the population will fail (i.e., 63.ii%). The shape parameter, beta, is the key characteristic of the Weibull distribution that enables it to exist practical to whatsoever phase of the bathtub bend. A beta less than 1 models a failure rate that decreases with time, equally in the infant mortality menstruation. A beta equal to 1 models a constant failure rate, every bit in the normal life period. And a beta greater than 1 models an increasing failure charge per unit, as during wear-out. There are several ways to view this distribution, including probability plots, survival plots and failure rate versus time plots. The bathtub curve is a failure charge per unit vs. fourth dimension plot.

Typical infant mortality distributions for state-of-the-art semiconductor chips follow a Weibull model with a beta in the range of 0.2 to 0.half dozen. If such a distribution is viewed in terms of failure rate versus time, information technology looks like the plot in Effigy ii.

Figure 2: Infant Mortality Curve - Failure Rate vs. Time

Figure 2: Babe Mortality Curve - Failure Charge per unit vs. Time

This plot shows ten years (87,600 hours) of time on the x-axis with failure rate on the y-centrality. Information technology looks a lot like the infant bloodshed and normal life portions of the bathtub curve in Figure ane, but this curve models just baby mortality (decreasing failure charge per unit). Dots on this plot stand for failure times typical of an infant mortality with Weibull beta = 0.2. As you can see, there are 27 failures before i year, and only 6 failures from ane to x years. People observing this bend, and the failure points plotted, could not be blamed for thinking it represents both infant mortality failures (in the first twelvemonth or then), and normal life failures after that. But these are only infant mortality failures - all the mode out to ten years!

This plot shows the distribution for a beta value typical of complex, high-density integrated circuits (VLSI or Very Large Scale Integrated circuits). Parts such as CPUs, interface controller and video processing chips often exhibit this kind of failure distribution over time. A await at this plot shows that if you could run these parts for the equivalent of three years and discard the failed parts, the reliability of the surviving parts would be much higher out to ten years. In fact, until a wear-out mode occurs, the reliability would continue to improve over time. If there are mechanisms that can produce normal life failures (theoretically a constant failure rate) mixed in with the defects that crusade the babe mortalities shown to a higher place, burn-in can still provide meaning comeback equally long as the constant failure rate is relatively depression.

Burn-In for Leading Edge Technologies
To see how fire-in can improve the reliability of high tech parts, we'll apply a chart that looks somewhat like the failure rate vs. time bend in Effigy ii, simply is more than useful. This is a survival plot that directly shows how many units from a population have survived to a given time. Figure iii is a plot for a typical VLSI procedure with a small "weak" sub-population (defective parts that will fail equally infant mortalities) and a larger sub-population of parts that will fail randomly at a very low charge per unit over the normal operating life. The 10-axis scale is in years of use (cipher to 100 years!) and the y-axis is pct of parts still operating to spec (starting at 100% and dropping to 50%).

Figure 3 shows that, of the failures that occur in the first 20 years (almost 4%), almost failures occur in the start twelvemonth or and then, merely like we observed in the baby mortality example above. Because there is a low level, abiding failure charge per unit, this plot shows failures continuing for a hundred years. Of course, there could be a wear-out fashion that comes into play before a hundred years has elapsed, simply no habiliment-out distribution is considered hither. Electronic components, unlike mechanical assemblies, rarely have article of clothing-out mechanisms that are significant before many decades of operation.

Figure 3: Mixed Infant Mortality and Normal Life Survival Plot

Effigy 3: Mixed Infant Mortality and Normal Life Survival Plot

We're non really interested in the failures much beyond ten years, so permit'due south look at this same model for merely the outset ten years. In Effigy 4, we have included sample failure points from the simulation model used to create the plot. These enable u.s. to view which population (infant mortality or normal life) the failure came from.

Figure 4: Mixed Baby Bloodshed and Normal Life Failures

We see that the plot in Figure 4 looks similar the early life and normal life portions of the bathtub curve, and in fact includes both distributions. We run across that over two% of the units fail in the first twelvemonth, but it takes ten years for three% to fail. In actuality, at that place are still "infant" mortalities occurring well across ten years in this model, just at an ever-decreasing rate. In fact, in the ten twelvemonth span of this model at that place would be very few normal life failures. Merely two failures (~5% of all failures) in this case (big blue dots) come from the normal life failure population. About 95% of the failures plotted above (minor ruby-red dots) are babe bloodshed failures! This is what the integrated circuits (IC) industry has observed with circuitous solid-country devices. Even afterward 10 years of operation the primary failure crusade for ICs is still infant mortality. In other words, failures are yet driven primarily by defects.

In such cases, burn-in tin can help. In the plot higher up you lot tin can meet that if you lot could go 3 years of functioning on this part before you lot shipped it, you would have screened out over 80% (2% divided past iii%) of the parts that would fail in ten years. So if nosotros were to come up with a method to effectively "age" the parts the equivalent of iii years and eliminate virtually of the infant mortalities, the remaining parts would exist more reliable than the original population. Of course, the parts that become through the 3-year "burn down-in" would have to last an additional x years in the field, for a full of thirteen years. Let's meet what this looks like in Figure five.

Figure 5: Comparison of Failures from Raw and Burned-in Parts

Figure 5: Comparison of Failures from Raw and Burned-in Parts

To a higher place, we see fourteen years of failure distribution for the original parts (not burned-in) and xi years of expected failure distribution for parts that received iii years of burn-in. In this example, the full cumulative failures between 3 years and thirteen years for the original parts (or from aught to ten years for burned-in parts) is near 0.6%. Without burn down-in, the kickoff ten years would have had nigh 3% cumulative failures. This is about a 5 times reduction in cumulative failures past using fire-in, or in terms of a change, nosotros would have nearly 2% fewer cumulative failures in ten years with burn down-in if a dominant infant mortality failure fashion exists. Note that in the first twelvemonth or two, the relative improvement in reliability is fifty-fifty greater. At two years, but well-nigh 0.1% failures are expected after burn-in merely near two% without burn-in; a ratio of almost 25:i!

In reality, manufacturers don't have two to three years to spend on burn-in. They need an accelerated stress test. In the IC industry in that location are usually two stresses that are used to accelerate the effective time of burn down-in: temperature and voltage. Increased temperature (relative to normal operating temperatures) tin can provide an acceleration of tens of times (10x to 30x is typical). Increased voltages (relative to normal operating levels) tin provide fifty-fifty higher dispatch factors on many types of ICs. Combined dispatch factors in the range of 1000:ane, or more, are typical for many IC burn-in processes. Therefore, burn-in times of tens of hours tin provide effective operating times of ane to five years, significantly reducing the proportion of parts with babe mortality defects.

What if we attempt burn-in on a product with no dominant infant mortality problems? The survival plot for an assembly with a 1% per year "constant" failure rate (normal life menstruation) is shown beneath in Effigy vi.

Figure 6: Survival Plot for Constant Failure Rate

Figure 6: Survival Plot for Constant Failure Rate

It's pretty piece of cake to run into that burn-in for two years would notice ~2% failures, but operation for an boosted ii years would observe another ~2%. At ten years, nosotros would accept institute nigh 10%. Note, the line is not really a direct line because a constant failure charge per unit (equivalent to the normal life role of the bathtub) acts on the remaining population and the remaining population is decreasing as units fail. Looking at the same burn-in conditions equally in the last example, if nosotros were to provide 3 years of performance on these parts and then use them for an additional x years, what results would we have? The cumulative failures of the units that passed this screen would be very close to 9.5%. Without burn-in, the cumulative failures in 10 years would exist the same, almost 9.v%. At that place is no reward to fire-in with a constant (normal life) failure rate.

It should exist obvious that burn-in of an assembly that is declining due to a habiliment-out failure manner (failure rate increasing with fourth dimension) will actually yield assemblies that are worse than units that did not get through burn-in. This is only because the probability of failure is increasing for every hour the parts run. Calculation operating time simply increases the possibility of a failure in whatsoever future period of fourth dimension!

Conclusion
In this issue, Part Ane, we have introduced the concept of the bathtub bend and discussed bug related to the first flow, infant mortality, as well as the practices, such as fire-in, that are used to address failures of this type. As this commodity demonstrates, although burn-in practices are not usually a practical economic method of reducing baby mortality failures, burn-in has proven to be effective for land-of-the-fine art semiconductors where root crusade defects cannot exist eliminated. For virtually products, stress testing, such equally HALT/HAST should exist used during design and early product phases to precipitate failures, followed by analysis of the resulting failures and cosmetic action through redesign to eliminate the root causes. In Role Two (presented in next month'southward HotWire), we will examine the last two periods of the bathtub curve: normal life failures and terminate of life wear-out.

robertsoperepien.blogspot.com

Source: https://www.weibull.com/hotwire/issue21/hottopics21.htm

0 Response to "If You Fail This Mode"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel