Tired PCI-Express Extension Cable Caused System Instability
I was ready for a break from working on the Luggable PC Mark II project and wanted to enjoy the results of my labor for a while. I started learning PIC programming but was frustrated by an unstable computer.
Revision A proved that the system works, and all components can happily run together reliably for a few weeks. But revision B was a problem child. It started off with occasional temporary recoverable system freezes. Then the system freezes would not recover and I had to power cycle the computer. Degrading further, the unpredictable failures would spontaneously reboot the computer.
The unpredictable nature of these events makes diagnosis difficult. Sometimes many hours would pass before an event, sometimes they would happen twice within the same minute. When one variable is changed, the system has to be left running to test if the change helped. Sometimes this meant running a system for hours before another reset occurred.
My initial suspicion was on overheating because a tremendous heat wave hit Los Angeles this week. But there was little correlation between temperature and stability. One of the "reboot itself multiple times within a minute" events occurred during the cool night.
The next suspicion was on power, as an under-voltage could cause these symptoms and the heat wave means a lot of air conditioners running in the neighborhood. But reboots continued after swapping in a different power supply and putting the system on an UPS.
The key insight was a system freeze during a work session where I had music playing in the background. The music continued but the screen is frozen, implying the video subsystem.
The PCI-Express extension cable was an unknown. I explicitly excluded one from Luggable Mark I just to eliminate that variable. As a test, the video card is inserted directly into the motherboard. The system is not luggable at all in this state but it proved informative because the system stayed stable for 24 hours.
Looking at the cable I removed from the system, I can see a lot of wrinkles from all the times I experimented with the layout and changed relative dimension of the components. Hypothesis: metal fatigue has started cracking some of the wires in this ribbon cable causing intermittent connections and general system chaos.
Normally a system installer would bend the ribbon cable into place once and leave it. I consider my usage pattern of performing many different bends over many weeks beyond normal expectation. Like bending a paperclip back and forth until it breaks.
In short: "My bad".
I ordered another cable from the same vendor off Amazon (*), installed the replacement, and that restored system reliability. I plan to leave this second cable alone as much as possible. When I start working on revision C, I will use the old cable (now labelled "TEST") to try out different layout ideas. Bend and flex and twist as I experiment. I won't change the bends on the new cable until I settle on a layout.
(*) Disclosure: As an Amazon Associate I earn from qualifying purchases.