Reported PCI Express Error was Unrelated
I have a Dell Inspiron 7577 laptop whose Ethernet hardware is unhappy with Proxmox VE 8, dropping off the network at unpredictable times. [UPDATE: Network connectivity stabilized after installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve. The PCI Express AER messages described in this post also stopped.] Trying to dig deeper, I found there was an error message dump indicating a watchdog timer went off while waiting to transmit data over the network. Searching online, I find bug reports that match the symptoms but that's not necessarily the cause. A watchdog timer can be triggered by anything that gum up the works, so what resolves the network issue on one machine wouldn't necessarily work on mine. I went back to dmesg
to look for other clues.
Before the watchdog timer triggered, I found several lines of this message at irregular intervals:
[36805.253317] pcieport 0000:00:1c.4: AER: Corrected error received: 0000:3b:00.0
Sometimes only seconds apart, other times hours apart, and sometimes it never happens at all before the watchdog timer barks. This is some sort of error on the PCIe bus from device 0x3b:00.0
, which is the Realtek Ethernet controller as per this lspci
excerpt:
3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
Even though the debug message said the error was corrected, maybe it triggered side effects causing my problem? Searching on this error message, I found several possibly relevant kernel flags. This Reddit thread has a good summary of them all.
-
pci=noaer
disables PCI Express Advanced Error Reporting which sent this message. This is literally shooting the messenger. It'll silence those messages but won't do anything to address underlying problems. -
pci=nomsi
disables a PCI Express signaling mechanism that might cause these correctable errors, forcing all devices to fall back to a different mechanism. Some people reported losing peripherals (like USB) when they use this flag, I guess that hardware couldn't fall back to something else? I tried it and while it didn't cause any obvious problems (I still had USB) it also didn't help keep my Ethernet alive either. -
pci=nommconf
disables PCI Express memory-mapped configuration. (I don't know what those words mean, I just copied them out of kernel documentation.) The good news is adding this flag did eliminate those "Corrected error received" messages. The bad news it didn't help keep my Ethernet alive, either.
Up until I tried pci=nommconf
I had wondered if I've been doing kernel flags wrong. I was editing /etc/default/grub
then running update-grub
. After boot, I checked they showed up on cat /proc/cmdline
but I didn't really know if the kernel actually changed behavior. After pci=nommconf
, my confidence was boosted by the lack of "Corrected error received" messages, though that might still be a false sense of confidence because "Corrected error received" messages don't always happen. It's an imperfect world, I work with what I have.
And sadly, there is something I need but don't have today: ability to dig deeper into Linux kernel to find out what has frozen up, leading to the watchdog timer expiring. But I'm out of ideas for now and I still have a computer that drops off the network at irregular times. I don't want to keep pulling the laptop off the shelf to log in locally and type "reboot" several times a day. I concede I must settle for a hideously ugly hack to do that for me.
Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt "Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable"