[UPDATE: After installing Proxmox VE kernel update from 6.2.16-15-pve to 6.2.16-18-pve, this problem no longer occurs, allowing the machine to stay connected to the network.]

After setting up a Home Assistant OS virtual machine in Proxmox VE alongside a few other virtual machines, I wondered how long it would be before I encounter my first problem with this setup. I got my answer roughly 36 hours after I installed Proxmox VE. I woke up in the morning with my ESP microcontrollers blinking their blue LEDs, signaling a problem. The Dell Inspiron 7577 laptop I'm using as a light-duty server has fallen off the network. What happened?

I pulled the machine off the shelf and opened the lid, which is dark because of my screen blanking configuration earlier. But tapping a key woke it up and I saw it filled with messages. Two messages were dominant. There would be several lines of this:

r8169 0000:03:00.0 enp3s0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).

Followed by several lines of a similar but slightly different message:

r8169 0000:03:00.0 enp3s0: rtl_ephyar_cond == 1 (loop: 100, delay: 10).

Since the machine is no longer on the network, I couldn't access Proxmox VE's web interface. About the only thing I could do is to log in at the keyboard and type "reboot". A few minutes later, the system is back online.

While it was rebooting, I performed a search for rtl_ephyar_cond and found a hit on the Proxmox subreddit: System hanging intermittently after upgraded to 8. It pointed the finger at Realtek's 8169 network driver, and to a Proxmox forum thread: System hanging after upgrade…NIC driver? It sounds like Realtek's 8169 drivers have a bug exposed by Linux kernel 6. Proxmox bug #4807 was opened to track this issue, which led me down a chain of links to Ubuntu bug #2031537.

The code change intended to resolve this issue doesn't fix anything on the Realtek side, but purportedly avoids the problem by disabling PCIe ASPM (Active State Power Management) for Realtek chip versions 42 and 43. I couldn't confirm this is directly relevant to me. I typed lspci at the command line and here's the line about my network controller:

3b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)

This matches some of the reports on Proxmox bug 4807, but I don't know how "rev 15" relates to "42 and 43" and I don't know how to get further details to confirm or deny. I guess I have to wait for the bug fix to propagate through the pipeline to my machine. I'll find out if it works then, and whether there's another problem hiding behind this one.

So if the problem is exposed by the combination of new Linux kernel and new Realtek driver and only comes up at unpredictable times after the machine has been running a while, what workarounds can I do in the meantime? I've seen the following options discussed:

  1. Use Realtek driver r8168.
  2. Revert to previous Linux kernel 5.12.
  3. Disable PCIe ASPM on everything with pcie_aspm=off kernel parameter.
  4. Reboot the machine regularly.

I thought I'd try the easy thing first with regular reboots. I ran "crontab -e" and added a line to the end. "0 4 * * * reboot" This should reboot the system every day at four in the morning. It ran for 36 hours the first time around, so I thought a reboot every 24 hours would suffice. This turned out to be overly optimistic. I woke up the next morning and this computer was off the network again. Another reboot and I could log in to Home Assistant and saw it stopped receiving data from my ESPHome nodes just after 3AM. If the 4AM reboot happened, it didn't restore the network. And it doesn't matter anyway because the Realtek crapped out before then.

Oh well! It was worth a try. I will now try disabling ASPM, which is also an opportunity to learn its impact on electric power consumption.


Featured image created by Microsoft Bing Image Creator powered by DALL-E 3 with prompt "Cartoon drawing of a black laptop computer showing a crying face on screen and holding a network cable"