5. Soft Errors Last Revision:
Apr. 14, 1998

Soft errors in memory are non-permanent errors which are "fixed" simply by writting new data to the memory. Soft errors can be caused by poor power regulation, alpha particles or cosmic rays. Approximately 98% of RAM errors are soft errors.

Soft errors can arise due to poor power regulation when a brown out occurs, since the memory in the computer may not get enough power to be properly refreshed. Brown outs are short duration undervoltages in the AC power supplied by the wall outlet. These can be caused by a temporary increase in demand for power nearby. For example, when the motor on a wash machine starts, the lights in the house will often dim due to the wash machine motor putting excess demands on the power supply. It is estimated that brownouts account for 87% of all power disturbances (Tripplite1997). According to APC (APC1997), an IBM study showed that a typical computer is experiences more than 120 power problems per month. The effects of these power problems can be a dramatic hardware failure, or what appears to be a simple software bug such as a program or operating system crash.

A simple solution to the power regulation problem is to not rely on the regulation within the computer power supply, but to purchase either a line regulator or a UPS (uninteruptible power supply) for the computer. These will provide the computer with a clean consistent source of power even in the presence of brown outs. As an added benefit, virtually all line regulators and UPS systems also provide good levels of surge suppression to prevent spikes and surges from damaging the computer.

Assuming that power problems have been minimized, there remain conflicting reports about the primary cause of soft errors in RAM. Alpha particle induced soft errors in RAM were first discovered by May and Woods at Intel, in 1978. Intel had been experiencing serious problems with its 2107 series 16 KBit DRAMs, and the problem was determined to be trace radioactivity in the memory packaging material. Due to increased demands for LSI ceramic packaging, Intel had built a new factory on the Green River in Colorado. Unfortunately, the factory was downstream from an old uranium mine, and the water used by the factory was had enough radioactivity to contaminate the ceramic packages (Ziegler1996a).

In 1981, IBM was experiencing reliability problems with their 16 kBit DRAM memory chips due to radioactive Kr85 contaminating the packaging. A special module tester was built and four million memory modules were tested. Approximately 2% of the modules were found to be contaminated (Ziegler1996a).

An alpha particle is a helium nuclei, which contains two protons and two neutrons. Since the alpha particle does not contain any electrons, it has a positive (+2) charge. Alpha particles are emitted from higher density atoms as a result of radioactive decay. Alpha particles can travel only a few centimeters in air before interacting with molecules in the air. In solid silicon, due to the higher density of the atoms, the alpha particle can only travels about 25 microns before interacting with a silicon atom.

Since the discovery of alpha particles causing soft errors in DRAMs, alpha particles have become commonly attributed as primary cause of soft errors in RAM. As an example, Lantz (1996) claims that alpha particle radiation "is the most common cause of soft errors in semiconductor memory devices".

According to Lage et al. (1993), it "is generally assumed that alpha particles generated in the package or in the interconnect layers of the memory circuit are the cause of the SSER events". However, their own research indicates that most errors in dense SRAMs are due to cosmic ray events. Soft DRAM erros due to cosmic rays have been studied by a number of researchers [McKee1996], [Tosaka1997], [Ziegler1998]. According to Tosaka et al. (1997), "we must tackle cosmic ray neutron-induced SE's." The soft error rate of CMOS latch circuits is dominated by neutrons, and the neutron-induced soft error rate in CMOS SRAMs is the same order as alpha-induced SERs (Tosaka1997).

McKee et al. (1996) have shown that the soft error rate of 4Mbit and 16Mbit DRAMs is dependent upon the neutron flux due to cosmic rays. McKee et al. further state that "cosmic ray neutrons are a dominant source of errors in these DRAM devices". It is expected that cosmic ray induced errors "will become an important reliability issue for DRAMs at the 64Mb generation and beyond" (McKee1996).

In a study by IBM, it was noted that errors in cache memory were twice as common above an altitude of 2600 feet as at sea level. The soft error rate of cache memory above 2600 feet was five times the rate at sea level, and the soft error rate in Denver (5280 feet) was ten times the rate at sea level. A review of the Denver operational logs also revealed that several multiple simultaneous memory errors had occurred (Ziegler1996a).

In 1992, IBM performed studied the soft error rate of a non-IBM 4 Mbit DRAM. In the first test measured the soft error rate of 864 modules for 4671 hours on the second story of a two story building. During the 4 million device hours of testing, 24 single bit fails were recorded, resulting in an SER of 5950 FIT per chip, where one FIT is equivalent to one fail per billion chip hours. In the second test, the DRAM samples were moved to a nearby vault shielded by approximately 20 meters of rock. The same 864 modules were tested for 5863 hours without a single bit failure. Since the 20 meters of rock will block virtually all of the cosmic rays, but would have no effect on alpha induced errors caused by radioactive contaminants, it appears that cosmic rays are the primary cause of soft errors in DRAM (OGorman1996).

In space, cosmic rays consist of approximately 92% protons and 6% alpha particles, with a flux of about 0.16 particles per square meter per second. As cosmic rays travel through our atmosphere, they interact with atoms in the atmosphere and produce multiple lower energy particles. At sea level, the particle flux is approximately one particle per square centimeter per second, and the distribution is 95% neutrons. The particle flux peaks at approximately 15 km above sea level, with a flux of about 100 particles per square centimeter per second. Lower altitudes have a lower flux due to the absorption of many of the particles. Many aircraft fly at 10-25 km above sea level, and it is known that the failure rate of electronics at airplane altitudes is approximately one hundred times greater than at sea level (Ziegler1979, Ziegler1996b).

Since one cosmic ray from space can fracture into many secondary particles, each of which may result in several tertiary particles, etc., it is possible for a large number of particles to strike the earth at the same time in the same general area. These occurrences are called extensive air showers (EAS). An extensive air shower may have a radius of approximately 100 meters and may contain up to one million particles. All of the particles in an EAS arrive at the earth within nanoseconds of each other. However, since the area covered by the million particles is quite large, the probabilty of multibit errors caused by multiple particles in an extensive air shower is quite rare.

In order to reduce the soft error rate due to cosmic rays, manufacturers can change the geometry of their memory cells. The physical geometry of the cell design plays a large role in the susceptibility of memory cells to cosmic rays. For example, Ziegler et al. (1998) found that trench cells with internal charge (TIC cells) had an annual soft error rate of less than 0.002 fails/32MB, whereas trench cells with external charge (TEC cells) had an annual soft error rate of approximately 3 fails/32MB, and stacked capacitor cells had an annual soft error rate between 0.2 and 1.1 fails/32MB.

The above figures provide some indication of the relative frequency of memory errors, but these memory chips were being bombarded with radiation to accelerate the testing (Ziegler1998). A better indication of the frequency of soft errors can be obtained from Micron (1997a, 1997b); the soft error rate MTBF of their DRAM modules ranges from 14 years to 142 years. Since Micron is generally considered to be a high end memory manufacturer, it may be that other manufacturers have higher failure rates and lower MTBF values.


© Malcolm Smith, 1998. HTML 3.2 Checked!