Commscope/Ruckus/Brocade ICX7650-48ZP will do it. Be prepared for sticker shock.
Make the 10.0.0._ addresses loopback addresses, and do point-to-point connections from each box to every other box. No idea how to do this in ESXi, but it’s straightforward in *nix/BSD.
CPU1 handles almost everything about being a normal computer: booting, chipset, most of the I/O, etc. CPU2 is along for the ride and handles its own I/O lanes (PCIe) and whatever work the kernel wants to send to it. The load is not symmetrical, so if you have turbo enabled, CPU1 will be consistently boosting more than CPU2 as it is handling all of its tasks —> warmer CPU1. This is why “tandem” dual-CPU setups have CPU1 upstream in airflow from CPU2.
2667v2 and the 2697/2696v2 are really tops for this generation.
You could desolder it and solder a new one on, or possibly even solder one on top of the existing LED. Same as replacing any other on-board component.
You’ll definitely need something with fast PCIe lanes for NVMe. Something with either PCIe 4.0 x4 coupled with a very fast SSD, or something with a lot of PCIe 3.0 lanes.
Is your RAM on the QVL? Ryzen’s notorious pickiness about RAM carries over to TR and EPYC, too. One of the first things before POST and BIOS splash display is memory training. If it can’t get past that, something about memory needs adjustment. Have you tried downclocking it?
How close do you want to get? Budgeting about 200W per socket for “normal”-ish CPUs and 400-450W per socket for latest EPYC should get you in the right range.
Number one cause of random hard crashes/hangs is RAM. Re-seat it, replace it, down-clock it, run a single stick, do everything you can to either rule it out as a problem, or to isolate the problem to a particular module or channel.