TL;DR: Adventure of reviving a CPU as a hypervisor
About few months ago, I found a bargain on a seemingly faulty Ryzen R9 5900X unit. Stangely enough, it was said to drop internet connectivity while gaming. Long story short, there are 2 defective physical cores out of the 12 present. I ended up isolating them with isolcpus= kernel parameter within boot menu.
Methodology:
Video recording of the process. Includes temporary solution for GRUB2 (see
/etc/default/grub
or /etc/kernel/cmdline
for
proxmox sysboot
for persistence), an example of OS installation failure without modification,
and a test example.
- Isolate every other thread than to be tested (e.g. 0, 1 for first core).
- If installation successfully completes, there is a good enough chance it is alright. Try multiple times at your leisure.
- Isolate cores which are proven to be defective by a failed system boot or failed OS installation.
Symptoms:
-
First symptom was lost connectivity about 5-15 seconds after starting a download on Steam desktop app on Windows. Connection would eventually break even if it started on a functional core initially.
-
Then page loading errors, especially for the ones behind CloudFlare captcha service, were observed. A simple reload would resolve it, however, not before the cache expiring and getting an actual reload on a different core scheduled.
- Getting errors during extracting tar.gz archives, thus also on package installs.
What have been tried:
- Tried to overvolt the CPU and RAM to no avail. In retrospect, I guess must have made it worse even (see last item).
- Tried disabling:
- Second chiplet (CCX1, half the cores) to no avail.
- 1, then 2 cores on each CCX to no avail.
- 3 cores on each CCX. That worked with loosing 12 threads, not a win but still the only working solution for Windows.
- Tried following this wonderful guide (by Tech YES City) on thermal throtling the CPU on a lower temperature, then undervolting the CPU to no avail. However, this had decreased the frequency of encountering an error.
Explanation as I understand
The Ryzen 5900X CPU has 12c/24t on two different chiplets. However, requiring to disable 3 cores on each chiplet while 2nd and 8th physical cores are defective could only mean that the cores are not disabled linearly on the numerical axis.
Here is a ASCII pamphlet:
|
|
Lastly, here is a related answer on SuperUser where I had commented my findings back then. essentially.