More than once I experienced hard-to-believe scenarios.
First engineering sample run consisted of 5 to 7 graphic cards. Only 1 card showed a single bit error during intense pattern tests of its 16 MByte SDRAM, consisting of 8 chips. Repeatable at a specific address + bit + pattern. After changing that very chip, the error was at a different location in same chip. Another faulty chip?
No! Looked at SDRAM clock with an oscilloscope. Using a standard probe (8 to 12 pF load), the error vanished! Using an active probe (0.6 pF but 100 kOhm), the error was still there. The clock looked good, well within spec, in timing relation to the other signals, too. Not fault visible. Nevertheless, clock waveform could be shaped to be better, more trapezoid, even less over- and undershoot.
Added a 10 to 33 ohm series termination resistor at clock source. Clock looked optimal now (using active probe), no more improvement possible. Error gone, never seen again.
Inside 16 MByte SDRAM at a graphics card there appeared some bit errors during intense pattern tests. Investigation revealed that some bits made an AND or OR operation of 2 or more bits adjacent in memory array. These errors of these chips had not been detected by Chip Maker's test. Chip Maker improved its test pattern following my suggestion.
Year 1998 to 2000, SDRAM chip 4 to 16 MBit. My RAM test found in average 1 chip out of 10000 having 1 to 3 bit errors, even from best brands.
I measured the active time of a monoflop consisting of CMOS version of NE555 timer + 2.2 uF 1206 X7R ceramic capacitor + resistor, set to a value in range 2 to 5 s, with sub-microsecond precision, more than 1 try per minute, for more than 30 minutes. This happened as by-product of verifying the reliable operation of a fix implemented that way.
All active times fell in the same 1 us range. Deviation during try was less than 1E-6 of interval time. I had expected to see fluctuations at least in single digit microsecond range.
More than once I was told: The error you reported can not be in our code, because our code does not work that way. E.g. I have seen error in 1 of 4 cases of different input parameters, but was told that it is always the same code, so cannot behave differently.
No surprise to me that tracing down to the error and disassembling proved me right, and the code writers wrong, at the cost of some hours or days. There was different code for each of 4 cases, and 1 case had error.
In school, in 1988, a fellow consulted me because his self-made computer did not start up reliably. Recently he increased the voltage of +5V supply a little, to be within spec, to improve reliability. I had an idea, interviewed him. He did not want to believe, but confirmed next day.
EPROM are proof-read in many programming algorithm at elevated voltage, to see whether charge is sufficient to reliably read back a programmed 0. This guy had not programmed his 2764 EPROMs the right way, but too short. Now they worked at reduced voltage, but not within their full specified normal operating range. Longer EPROM program time solved the issue.
The same basic principle applies to old single level cell Flash. This may be used to bring back data of already heavily discharged Flash, which fails under normal conditions. Try read at slightly reduced voltage and check whether read data changes.
A card passed Automatic Optical Inspection, but did not work in final test. A whole 32 bit wide SGRAM bank was missing said the test program. Manual optical inspection (bare eye) detected root cause: A 100 pin TQFP was offset (shifted) by 1 pin.
An unhappy customer appeared between 1990 and 1992 in the repair shop, presenting his Atari ST which suddenly stopped working. Interviewed, he believed there was smell of smoke for short time, and only a little. Opening the computer showed no special feature, but I disassembled even more and turned the PCB downside up.
Below a ROM sitting in a DIL socket, approx 1 square cm of FR4 PCB was burned out, only black oxidized remains of copper traces left, and glass fibre tissue, and a little carbon. Epoxy had been vanished. There was no component at that place. Looks like some fault in PCB ignited fire after years of apparently flawless operation in usual living room conditions.
An Atari ST did no longer produce a readable screen output. At that time, a level tester (showing H, L, and edges by LED, common English term might be Logic Probe) was my first tool for such tasks. I followed HSYNC and VSYNC and BLANK and all appeared ok at the few 74LS-series logic chips. Oscilloscope showed a difference. A 74LS02 NOR gate had changed its mind on one of its 4 outputs, and converted to an OR gate.
Do not go beyond the limits given by the product maker, unless you are abolutely sure you have the superior experience in this very field.
I like to give some examples, written down from human brain memory, because did not want to invest the time and dig out the old logs. Means the numbers may not match exactly, but their relations are true. Multiple samples have been tested.
A Xilinx XC3064A served as a NuBUS to PCI bridge, in years 1995 to 1996. The Xilinx software computes a maximum clock frequency for the implemented logic, in this case for worst case voltage (low) and temperature (85 celsius). Guess was there is significant headroom at 25 celsius and mid voltage, expected some MHz.
Computed frequency was 26 MHz. Headroom was less than 1 MHz. Malfunction started below 27 MHz. Same result when FPGA was cooled down to below minus 20 celsius by freezer spray.
In the 1995 to 1997 years. Specification for maximum pixel clock was 100 MHz. As mentioned in previous chapter, we expected to see significant headroom of at least 5 MHz at office temperature in range 20 to 25 celsius and mid voltage.
Malfunction started at 102 to 104 MHz. This shifted 1 to 2 MHz to higher frequency for a "frozen" chip. Headroom was much less than expected, nevertheless thousands of these chips operated flawlessly for years at specified max frequency.
In the 1990 to 1993 years. Soldered a adapter board from PLCC44 plug (male) consisting of single pins to DIL40 socket with precision contacts. Tested for connectivity and isolation. Detected a single short circuit between two adjacent pins of DIL40. Optical inspection of PCB found no root cause, despite using magnifier glasses. PCB had been electrically tested according to PCB Maker. Scratching between PCB traces did not change anything.
Eventually, cutted plastics body of DIL40 socket step by step between these two pins. Found a visible metallic bridge inside plastics body. Cutted bridge, short circuit vanished, problem solved.
--- Author: Harun Scheutzow ------ Last change: 2011-07-24 ---