TL;DR: Adventure of reviving a CPU as a hypervisor

About few months ago, I found a bargain on a seemingly faulty Ryzen R9 5900X unit. Stangely enough, it was said to drop internet connectivity while gaming. Long story short, there are 2 defective physical cores out of the 12 present. I ended up isolating them with isolcpus= kernel parameter within boot menu.

Methodology:

Video recording of the process. Includes temporary solution for GRUB2 (see /etc/default/grub or /etc/kernel/cmdline for proxmox sysboot for persistence), an example of OS installation failure without modification, and a test example.

  1. Isolate every other thread than to be tested (e.g. 0, 1 for first core).
  2. If installation successfully completes, there is a good enough chance it is alright. Try multiple times at your leisure.
  3. Isolate cores which are proven to be defective by a failed system boot or failed OS installation.

Symptoms:

  • First symptom was lost connectivity about 5-15 seconds after starting a download on Steam desktop app on Windows. Connection would eventually break even if it started on a functional core initially.

  • Then page loading errors, especially for the ones behind CloudFlare captcha service, were observed. A simple reload would resolve it, however, not before the cache expiring and getting an actual reload on a different core scheduled.

Secure Connection Failed (Firefox)

  • Getting errors during extracting tar.gz archives, thus also on package installs.

What have been tried:

  • Tried to overvolt the CPU and RAM to no avail. In retrospect, I guess must have made it worse even (see last item).
  • Tried disabling:
    • Second chiplet (CCX1, half the cores) to no avail.
    • 1, then 2 cores on each CCX to no avail.
    • 3 cores on each CCX. That worked with loosing 12 threads, not a win but still the only working solution for Windows.
  • Tried following this wonderful guide (by Tech YES City) on thermal throtling the CPU on a lower temperature, then undervolting the CPU to no avail. However, this had decreased the frequency of encountering an error.

Explanation as I understand

The Ryzen 5900X CPU has 12c/24t on two different chiplets. However, requiring to disable 3 cores on each chiplet while 2nd and 8th physical cores are defective could only mean that the cores are not disabled linearly on the numerical axis.

Here is a ASCII pamphlet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
CPU Die -> Chiplet -> Core -> Threads (conceptual)

I guess, order of disabling cores on a chiplet is starting
from bottom-right to top-right (i.e. 06 -> 04 -> 02) and
then from bottom-left to top-left (i.e. 05 -> 03 -> 01).        

In my case, cores 2, and 8 are faulty. And both are
to be disabled third on their respective chiplet.

+--------------------[ Ryzen 5900X ]---------------------+
|                                                        |
|  +-------[ CCX 0 ]-------+  +-------[ CCX 1 ]-------+  |
|  |                       |  |                       |  |
|  |  +--[01]--+--[02]--+  |  |  +--[07]--+--[08]--+  |  |
|  |  |  00,01 | 02,03  |  |  |  |  12,13 | 14,15  |  |  |
|  |  +--------+--------+  |  |  +--------+--------+  |  |
|  |                       |  |                       |  |
|  |  +--[03]--+--[04]--+  |  |  +--[09]--+--[10]--+  |  |
|  |  |  04,05 | 06,07  |  |  |  |  16,17 | 18,19  |  |  |
|  |  +--------+--------+  |  |  +--------+--------+  |  |
|  |                       |  |                       |  |
|  |  +--[05]--+--[06]--+  |  |  +--[11]--+--[12]--+  |  |
|  |  |  08,09 | 10,11  |  |  |  |  20,21 | 22,23  |  |  |
|  |  +--------+--------+  |  |  +--------+--------+  |  |
|  |                       |  |                       |  |
|  |  +--[XX]--+--[XX]--+  |  |  +--[XX]--+--[XX]--+  |  |
|  |  |  XX,XX | XX,XX  |  |  |  |  XX,XX | XX,XX  |  |  |
|  |  +--------+--------+  |  |  +--------+--------+  |  |
|  +-----------------------+  +-----------------------+  |
+--------------------------------------------------------+

P.S.:
Cores labeled as XX are the ones disabled at firmware. These
would have been enabled on a 5950X 16c/32t and numbering would
change accordingly.

A hint on this is apparent when `stress-ng` program is run
with `--cpu -1` flag, resulting in 32 processes being run.
Whereas `--cpu 0` results in 24 processes being run to stress
test the CPU. Difference is former being the total cores on a
package, while latter being current number of cores on the system.

Lastly, here is a related answer on SuperUser where I had commented my findings back then. essentially.