rawhide kobayashi df952ea2d0
All checks were successful
blog.neet.works deployment / deploy_mamizou (push) Successful in 4s
forgot to build it and publish it hehe
2025-02-10 17:30:34 -06:00

1 line
69 KiB
JSON
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/tags/amd/","section":"Tags","summary":"","title":"AMD","type":"tags"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/tags/avx-512/","section":"Tags","summary":"","title":"AVX-512","type":"tags"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/categories/benchmarking/","section":"Categories","summary":"","title":"Benchmarking","type":"categories"},{"content":" Some Background # The Zen 5 launch was largely considered, by Gamers™, to be a disaster. There were accusations of AMD intentionally creating entirely false graphs, accusations of green-type greed for failing to cut \u0026gt;$100 off of the MSRP of brand-new SKUs to compete on value with older SKUs, and all other sorts of nonsense because Gamers™ didn\u0026rsquo;t get another ++30% generational performance uplift like between Zen 3 and Zen 4. AMD\u0026rsquo;s marketing department should actually be hung out to dry for their multiple, high-profile, embarrassingly inaccurate and/or misleading graphs over the years, but that\u0026rsquo;s neither here nor there. Shortly afterwards, Arrow Lake launched to raucus apathy, Zen 5 prices dropped to market value, and the 9800x3D became the best gaming CPU in the world. AMD are now Certified Good Guys once again.\nThe most outlandish claim AMD made at the time was the performance improvement in HandBrake.\nVarious charts claiming various performance improvements of Zen 5 CPUs compared to Raptor Lake Intel CPUs. Quite impressive, especially when comparing against SKUs with double-or-more core counts thanks to E-cores. What contrived circumstances did they pull to get a result like that? Six cores beating 6P+8E? It seemed ridiculous. The referenced end note card had these completely useless statements.\nEndnote card detailing the benchmark configurations from the above slides. No information about the video resolution, bit depth, encoding settings, encoder, or\u0026hellip; Literally anything you might want to detail to prevent people from thinking you just made up a number. Just, \u0026ldquo;HandBrake.\u0026rdquo; Thank you AMD, very cool! Some people accused them of accidentally using hardware acceleration while performing these benchmarks. Unfortunately, in this instance, this only serves to prove the accuser\u0026rsquo;s own ignorance, as hardware accelerated encoding has significantly higher performance gains than 41-94%. Except\u0026hellip;\nBut, actually, AVX-512 # Zen 5 does have an advantage that was ignored by the \u0026ldquo;muh value\u0026rdquo; glazers. From a certain point of view, you could actually call it hardware acceleration, but it\u0026rsquo;s not in the way that the detractors were claiming at the time. Zen 5 brought full-fat, true 512-bit AVX-512 execution units and data paths. Four of them per core!1\nAVX-512 is the 512-bit extension of the Advanced Vector Extensions instruction set, which is in the SIMD - Single Instruction, Multiple Data - family. It is an instruction designed to faciliate the simultaneous performance of mathematical operations on sets of numbers. Up to 512 bits\u0026rsquo; worth. You combine smaller sets of numbers - the Vector in Advanced Vector Extensions - and the CPU can then perform an operation on each number in that set, simultaneously, in a single clock cycle. With the four full-width AVX-512 execution units in every Zen 5 core, every CPU can (theoretically) do 2048 bits\u0026rsquo; worth of calculations, per core, per clock cycle. This is spherical cow in a vacuum territory, especially with the painful limitation of dual-channel memory, but on the 9950X, which has 16 cores, that means you could do operations on up to 10242 32-bit numbers simultaneously, on every single clock cycle.\nFor those who have not been keeping track of such things, AVX-512 has been a staple in Intel server SKUs since Skylake, but their consumer SKUs lacked it until Rocket Lake. Alder Lake introduced the asymmetrical P/E core paradigm on desktop SKUs, and it launched with AVX-512 - but only on the P-cores, which caused problems with then-current schedulers. They did not know how to deal with heterogenous instruction sets, and as a result, AVX-512 applications would crash when a thread moved to an E-core. Access to the instruction was eventually removed via microcode/BIOS updates, and then fused off physically on newer production runs. Intel has, as of this writing, yet to reintroduce AVX-512 to the consumer market, though it continues to bring meaningful performance benefits to their server platforms.\nAMD lacked AVX-512 support for all their products until Zen 4, but when they did get there, they implemented it across their entire hardware stack. For Zen 4, rather than full 512-bit hardware, they had \u0026ldquo;double pumped\u0026rdquo; 256-bit hardware that took two clock cycles to complete an instruction rather than one. This has a number of beneficial implications in terms of power consumption, silicon area, and the ability to re-use existing 256-bit silicon for the AVX-512 execution units. It holds back the maximum possible performance, but when you\u0026rsquo;re competing against Intel\u0026rsquo;s utter lack of AVX-512 in consumer chips, that\u0026rsquo;s an ♾️% advantage!\nAVX instructions have a history of derision among consumers, particularly gamers and overclockers. AVX execution units need large areas of silicon, because they work on large amounts of data. Logically, per clock cycle, it will use more power, generate more heat, and do more work, because there are more transistors involved in executing the instructions. To make up for that, you will have to run fewer clock cycles per second to avoid overheating and excessive power draw that may cause voltge droop. There\u0026rsquo;s a balancing act between clock speed and the accelerated overall speed of computation that AVX enables. Intel has historically maintained a very poor balance between those factors, leading to unstable overclocks, complaints about power consumption and thermals, and sometimes objectively reduced overall performance. It doesn\u0026rsquo;t help that consumers are highly uneducated about thermal management, either. 100*C is fine, but a gamer won\u0026rsquo;t accept it.\nAs a result of Intel\u0026rsquo;s inelegant handling of these requirements, AVX has garnered a bad rap in the general populace. The vast majority of applications are still compiled without AVX2, let alone AVX-512. Zen 4/5 desktop CPUs may have been bestsellers for a while now, but relative to the global population, very few people own CPUs with AVX-512 support. Some people are still out there using Core 2 Duos or other very, very low-spec Pentium/Atom chips that lack support for AVX2! In addition, most software isn\u0026rsquo;t written in a hyper-optimized fashion that has inline assembly or AVX intrinsics, and compilers are not very good at auto-vectorizing code that was not written in a format that is intended to be vectorized. Consumers don\u0026rsquo;t tend to process vast amounts of data that would benefit from the capabilities of AVX instructions anyway, right? So who really cares?!\nWell, AMD does, and I\u0026rsquo;m glad that they do. There is one area where (relatively) average consumers need heavy compute, and that\u0026rsquo;s video encoding. Hardware acceleration has gotten pretty good for casual use, especially for livestreaming. The energy efficiency and quality at low-latency is impossible for a software encoder to beat. However, if software is given the chance to stretch its legs latency/processing time-wise, the quality-per-bit, or compression efficiency, just can\u0026rsquo;t be beat. Plus, hardware encoders, as fixed-function silicon, don\u0026rsquo;t really get to keep up with new innovations in software design. AV1 may be fast if you use NVENC, but will it match the latest release of SVT-AV1-PSY at any given bitrate? Absolutely not. Besides, there are a variety of other operations in video processing that can benefit from (non-disastrously-downclocking) AVX-512 (provided the software was written correctly!) that are unrelated to the final encoding task.\nVideo encoding software like x265 and SVT-AV1, which are what I will be using for my test here, contain large quantities of hand-written assembly optimized for various SIMD instruction sets, including AVX-512. They exist regardless of the compiler flags used to build the software. Every build should be capable of using AVX-512 acceleration, and everyone should be able to reap the benefits without seeking out special builds of these pieces of software.\nThere are a litany of asterisks to go along with all that information, and I\u0026rsquo;m not an expert in assembly or CPU design. There\u0026rsquo;s a much more detailed breakdown of Zen 5\u0026rsquo;s AVX-512 over on numberworld.org. Go ahead and give it a read if you\u0026rsquo;re interested! Phoronix also has a benchmark comparing the performance of a Zen 5 Turin server SKU with AVX-512 off, in double-pumped 256-bit mode, and full 512-bit mode in a variety of applications.\nTest Goals / Parameters # While AMD, for some indeterminable reason, did not make this obvious in their presentation, the gains in HandBrake could be attributed to the AVX-512 improvements present in Zen 5, and the continued efforts from video encoders to provide optimized AVX-512 code. HandBrake ships x265 as its HEVC encoder, and SVT-AV1 as its AV1 encoder, so those are what I will be testing here.\nI have two goals with this test:\nQuantify the performance differences between a Zen 5 CPU with AVX-512 off, and on Quantify the performance differences between a Zen 5 CPU with AVX-512 on, and a Raptor Lake CPU The first goal will implicate exactly how much the presence of AVX-512 could theoretically improve performance on Intel, should they adopt it. The second goal will determine to what degree AMD may have engaged in selective benchmarking tomfoolery in the earlier slides.\nMost reviewers don\u0026rsquo;t have any knowledge of encoders beyond extremely surface-level use of HandBrake or export options in video editing software. They just toss a file into HandBrake, pick something - hopefully it\u0026rsquo;s consistent between tested products! - and hit go. Maybe they use the Phoronix Test Suite if they\u0026rsquo;re a Linux shop, but that doesn\u0026rsquo;t adequately cover the bases regarding A-B testing the impact of AVX-512 on performance on a single SKU. I waited six months and found zero reviews examining this specific topic to my satisfaction. Now, incidentally, I had reason to purchase a Zen 5 CPU, and decided to bench it for myself.\nThe tests are as simple as possible. There are far, far too many possible combinations of command line options to test within any kind of reasonable amount of time, and I don\u0026rsquo;t think it\u0026rsquo;s useful to test in that way. Most people just choose a built-in preset, maybe a -tune parameter, but that\u0026rsquo;s the extent of their customization. I chose to perform a simple, like-for-like comparison of x265 and SVT-AV1. I used the three most common video resolutions - 720p, 1080p, and 4k - swept through every stock preset in x265 and SVT-AV1, under three different hardware configurations: Raptor Lake, Zen 5 AVX-512 Off, and Zen 5 AVX-512 On. I ran each configuration five times, and took the average of the combined wall-to-wall run times as my measurement. Then, I created graphs exhibiting the execution time improvements compared to the presumed slower configuration. They are available below, but there is more relevant information to get to before that.\nBy default, x265 does not enable AVX-512, even on supported systems, even if you do build it with relevant microarchitecture features enabled. You have to pass the parameter asm=avx512 to enable it. HandBrake does not pass this parameter by default, either. You have to do it manually in the \u0026ldquo;Advanced Options\u0026rdquo; section. SVT-AV1 does enable AVX-512 by default, and for this test, I had to limit the featureset with the asm=9. This restricts SVT-AV1 to AVX2 and older features.\nIn addition to the run time, I also catalogged some other details like the average MHz and power draw as reported by turbostat, and reported average die temperature by sensors, but I didn\u0026rsquo;t find the results very interesting (or accurate, in some cases) so they\u0026rsquo;ve been omitted from the below analysis. If you\u0026rsquo;d like to look at my test scripts and the raw data, you can do so at the repository below.\nrawhide_k/zen-5-avx-512-encoding-benchmark Python 0 0 Systems Setup # For this test, I had two systems. One was based on the Ryzen 9 9950X, and the other on the Intel i7-14700F. Both systems have 2x32GB memory kits running at 6000MHz, though they aren\u0026rsquo;t identical. Don\u0026rsquo;t you worry, the slight variations on the timings are utterly irrelevant. Both systems ran identical software configurations - as identical as possible, at least, considering the architecture differences. They both ran Arch Linux with CachyOS optimized repositories - x86-64-v3 for the i7-14700F, and znver4 for the 9950X. The kernel version was 6.12.10-2-cachyos-lts. Other relevant package versions were ffmpeg 2:7.1-6.1, x265 4.0-1.1, and svt-av1 2.3.0-2.\nAMD Ryzen 9 9950X based test bench. The motherboard is an ASRock B650E PG Riptide Wi-Fi The 9950X had the socket power limit set to the stock 200w, and the current limit to the stock 160A. This is just about exactly what an NH-D15 can dissipate, when using a graphite thermal pad, as I did with this setup. Not great, not terrible, but it\u0026rsquo;s what I had on hand. It didn\u0026rsquo;t thermal throttle at stock settings, so that\u0026rsquo;s good enough for me, for this test. The only options I changed regarding performance are memory related, enabling XMP and dropping vSOC to 1.1v. No PBO, no undervolting, stock fmax. 2000MHz fCLK and 3000MHz uCLK, as typical of 6000MHz memory.\nIntel i7-14700F based test bench. The i7-14700F is a non-K SKU, so you can\u0026rsquo;t overclock it. On some motherboards you can undervolt non-K SKUs, but not the one that I have. It\u0026rsquo;s some kind of stripped down model ASUS uses for prebuilts. You can change the vdroop, which I have adjusted to whichever setting gave me optimal performance, but the differences were extremely minor, and that\u0026rsquo;s really all you can do with it. I have the power/current limits technically uncapped, but the board has a hard current limit somewhere around 220-280w power draw, load type dependent. With a 240mm liquid cooler, temperature is not a concern. The only hardware-based limit on the performance is the motherboard\u0026rsquo;s current limit.\nNow you might say, \u0026ldquo;But Mr. Blogger! None of the slides earlier in the deck had a 9950X or an i7-14700F! This comparison is not fair, and you are a hack fraudster!\u0026rdquo; To which I would say\u0026hellip; Yes, absolutely. It\u0026rsquo;s not a fair comparison, and I\u0026rsquo;m not going to prove any of the aforementioned slides right or wrong. However, if you look at this Phoronix benchmark, you can observe that there\u0026rsquo;s not a huge difference between the 9900X and 9950X in SVT-AV1 and x265, nor between a selection of Raptor Lake chips. Scaling suffers greatly beyond twelve cores, even at 4k, unless you specifically invoke paralellism-enhancing commands that cause the compression efficiency to suffer. Chunking up a video and running multiple encode jobs to get the absolute maximum possible performance out of a given CPU with a given encoder with the best possible compression efficiency is a whole \u0026rsquo;nother topic. By the way, the Phoronix Test Suite does not enable AVX-512 in x265. Their numbers would be much further apart between AMD and Intel if they did. Please feel free to email me if you have a desire to send me free hardware to conduct additional testing!\nResults! # For those with a short attention span, here\u0026rsquo;s the gist.\nThe 9950X demolishes the i7-14700F, as you would hope, with double the \u0026ldquo;P\u0026rdquo;-core count and AVX-512 present. AVX-512 gains are not significant with presets faster than \u0026ldquo;slow\u0026rdquo; with x265, or faster than \u0026ldquo;4\u0026rdquo; with SVT-AV1. SVT-AV1 benefits less from AVX-512 than I expected overall, given the fact it\u0026rsquo;s newer and sees more consistent development. 4k brings out the biggest differences both in AVX-512 and between the processors in general, as expected. Faster presets and lower resolutions are more dependent on single core performance, with even worse scaling under the default conditions that SVT-AV1 and x265 operate under. Global Geomean # Geomean kinda sucks with the spread of values here. The superduperfast presets really bring the averages down, especially with x265. This graph is almost entirely useless. Read on for information on your specific preset and encoder of interest.\n9950X vs 9950X, x265 # Various charts detailing the uplift from enabling AVX-512 on the 9950X on the x265 encoder, from 720p to 4k. We see significant performance gains here thanks to AVX-512, on the slow-placebo presets, across every tested resolution. The x265 documentation has not commented on AVX-512 since the version 2.8 release in May 2018\u0026hellip; Where it said, \u0026ldquo;For 4K main10 high-quality encoding, we are seeing good gains; for other resolutions and presets, we dont recommend using this setting for now.\u0026rdquo;3 However, it seems that there are slight gains universally, increasing at 4k, but increasing greatly at every resolution as long as you use slow-placebo presets. I\u0026rsquo;d like to see the default behavior changed to enable AVX-512 by default, with a toggle to turn it off, in case you\u0026rsquo;re running mixed workloads on older Intel servers with less well-behaved AVX downclocking.\n9950X vs 9950X, SVT-AV1 # Various charts detailing the uplift from enabling AVX-512 on the 9950X on the SVT-AV1 encoder, from 720p to 4k. SVT-AV1 has AVX-512 enabled by default, and the documentation makes no special note of it. These gains aren\u0026rsquo;t all that great. I would have expected SVT-AV1 to have greater uplifts than x265, given it\u0026rsquo;s newer, and under more active development - but, that could also be exactly why the gains are fewer - functions are still being actively worked on, and have not been finalized in a way that makes anyone want to commit to writing a fully-optimized assembly version of them. It could also be possible that SVT-AV1 is already approaching memory starvation without AVX-512, and you need more bandwidth to get additional gains. Either way, I\u0026rsquo;m not particularly fond of AV1 in general, and I\u0026rsquo;m not interested in going down any rabbit holes related to this result, unless someone feels like donating a Sapphire Rapids or Genoa (or newer) server. Or a Zen 4 threadripper system.\ni7-14700F vs 9950X, x265 # Various charts detailing the uplift between the i7-14700F and the 9950X on the x265 encoder, from 720p to 4k. i7-14700F vs 9950X, SVT-AV1 # These are nice results. The AVX-512 gains from x265 really let the 9950X mog the poor i7-14700F (and by extension, i9-14900K, as the extra E-cores make almost zero difference). Keep in mind the i7-14700F is using ~260w throughout these tests, while the 9950X is capped to 200w. Pure performance aside - which is obviously significant - the efficiency improvement is also a huge win. Up to twice as fast, while using less power. What\u0026rsquo;s not to like?\nVarious charts detailing the uplift between the i7-14700F and the 9950X on the SVT-AV1 encoder, from 720p to 4k. Given the previous lack of significant uplift with SVT-AV1 due to AVX-512, it makes sense that these figures are much less impressive than x265. It\u0026rsquo;s still a nice uplift, to be sure, but not as outstanding as 2x!\nOverall Conclusion? # Zen 5 carries a clear advantage over Raptor Lake, core for core. Gains attributed to the presence of AVX-512 can be great, or insignificant, depending on the encoder, the resolution, and the preset. For those interested in finding out more about the specific settings each preset contains that might be accelerated by AVX-512, you can find what features are enabled by specific presets in the x265 documentation and on the SVT-AV1 GitLab.\nDid AMD lie? Maybe. I did two additional synthetic tests, not graphed here. I can very, very closely emulate the 9700X test, by just limiting the 9950X to a single chiplet, and using the i7-14700F as-is. Under the most favorable circumstances, the speedup was only 1.2x. Then, I emulated the i5-14600K/9600X graph, by limiting i7-14700F to 6P+8E, as well as limiting the 9950X to 6 cores on one chiplet. In that scenario, the speedup was 1.24x. A far cry from 41-94%. However, there are a number of differences between my setup and AMD\u0026rsquo;s, apart from the inexact hardware.\nAMD tested on Windows. I tested on Linux. Windows\u0026rsquo; scheduler is famously terrible with assymetrical architectures. Could that have nerfed Intel enough to make up the difference? I only tested SVT-AV1 and x265. HandBrake also offers x264 for AVC, and libvpx for VP9. They could have tested with those, and gotten more disparate results. I\u0026rsquo;m using more recent versions of the encoders than HandBrake would have shipped when those graphs were made. There could be other improvements in the interim that have closed the gap. I was not intending to do a strict DEBOONKING or affirmation of AMD\u0026rsquo;s graphs in any case. They\u0026rsquo;re just here as comparison, and they prompted my interest in examining AVX-512\u0026rsquo;s presence in video encoders. I do not care enough to investigate the differences in encoders that I\u0026rsquo;m not interested in, so this story ends here, for now, at least.\nZen 5 mobile and Zen 5c continue to use double-pumped/otherwise hybrid AVX-512 implementations. These claims are only accurate to the Granite Ridge chiplets used in desktop SKUs and some server SKUs. Presumably upcoming HX-type mobile SKUs as well, since they\u0026rsquo;re just unpackaged desktop SKUs.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n512 / 32 * 4 * 16\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nhttps://x265.readthedocs.io/en/master/releasenotes.html#version-2-8\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"February 10 2025","externalUrl":null,"permalink":"/posts/benchmarking-avx-512-video-encoding-zen-5/","section":"Posts","summary":"","title":"Benchmarking AVX-512 Video Encoding On Zen 5","type":"posts"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/tags/intel/","section":"Tags","summary":"","title":"Intel","type":"tags"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/","section":"N.E.E.T. Works","summary":"","title":"N.E.E.T. Works","type":"page"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/tags/svt-av1/","section":"Tags","summary":"","title":"Svt-Av1","type":"tags"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/categories/video-encoding/","section":"Categories","summary":"","title":"Video Encoding","type":"categories"},{"content":"","date":"February 10 2025","externalUrl":null,"permalink":"/tags/x265/","section":"Tags","summary":"","title":"X265","type":"tags"},{"content":"","date":"January 19 2025","externalUrl":null,"permalink":"/categories/homelab/","section":"Categories","summary":"","title":"Homelab","type":"categories"},{"content":" Overview # Following the implementation of my whole-rack watercooling system in my homelab, astute viewers may have noticed that despite my proclaimed desire to use NVLINK, I still wasn\u0026rsquo;t. I thought I had the hardware to connect the inlet/outlet ports on the GPU waterblocks with zero slots between them, but it turned out I didn\u0026rsquo;t, and I didn\u0026rsquo;t want to wait to get everything together for testing.\nFrom the time I first acquired the GPUs, regardless of the watercooling plan, I\u0026rsquo;d wanted them to be attached to my main server. It\u0026rsquo;s just more convenient to have everything in one place, and it allows me to turn the desktop off whenever it\u0026rsquo;s not needed for CPU computer, which is the vast majority of the time. Unfortunately, GPU-focused servers and PCIe expansion boxes are not cheap, and they\u0026rsquo;re very, very proprietary. My main server is in a Supermicro CSE-846 chassis, which accepts standard ATX/SSI-sized motherboards. The motherboard is a Supermicro X11DPH-T, which has 3x PCIe x16 slots, 4x PCIe x8 slots, and even two NVMe x4 slots on the board. Each is fully wired, so that\u0026rsquo;s plenty of connectivity, but when it\u0026rsquo;s all smooshed together with every slot adjacent, you\u0026rsquo;re blocking off valuable connectivity by putting multi-slot cards like GPUs in there. It physically blocks a slot, and for my use case, they don\u0026rsquo;t need access to the full bandwidth of a x16 or x8 slot either.\nMy server motherboard, the Supermicro X11DPH-T. Fortunately, modern server motherboards allow you to partition slots via PCIe bifurcation. This is commonly used for NVMe and U.2 riser cards, but you\u0026rsquo;re not technically limited to using those types of devices. As long as you have the hardware to break out the electrical signals from a single slot into multiple physical slots, and somewhere you can mount the cards, you could use up to four GPUs via a single x16 slot! The crossbars for the radiator mount gave me a great place to mount them, so all I had to do was find the hardware to hook it all up\u0026hellip; And here it is!\nUnboxing # MaxCloudON is the only provider that I could find of a pre-packaged set of parts to fit such a need. You can find their online store here, based in Bulgaria, but they ship internationally. It seems apparent to me that they use this product for their own business renting GPU servers, and have also done us the courtesy of making it available for sale for the general public. I got this specific set, available at time of writing for $165 USD, which comes with the hardware to break a single x16 slot out into four x4 slots. They also offer x8 to two x4, x16 to two x8, longer cables, and some other options along the same vein.\nEverything was packed tightly together with plenty of bubble wrap, giving it very little space to move around, though only the riser card was in packed an ESD bag. For your $165 USD, you get the PCIe breakout card with four SFF-8087 connectors, four purportedly special 60cm SFF-8087 cables, and four daughter boards with an SFF-8087 connector, a PCIe six-pin power input, and a physically x8 / electrically x4 PCIe slot with an open back. The bottom of the daughter board is coated in foam, and has pre-drilled holes in the PCB, so you can securely mount it to some surface if you wish. It\u0026rsquo;s fairly thick and robust foam, but you might want to take care to add another insulating surface and/or avoid tightening it down too much if you mount it on something conductive, just in case.\nMy current setup with three GPUs connected. There\u0026rsquo;s an Arc A380 and 2x 2080 Tis plugged into the daughter boards, powered by a separate PSU and hanging off of the crossbar for the watercooling radiator mount. I\u0026rsquo;m no politician, so it\u0026rsquo;s pretty hard to say anything else about this, other than, \u0026ldquo;it works.\u0026rdquo; I mean, they\u0026rsquo;re very basic PCBs, there\u0026rsquo;s not much to go wrong as long as the PCIe signal integrity is fine. They didn\u0026rsquo;t cheap out on the quality of the plugs, and everything just works. Good job MaxCloudON! I could only be happier if the cables were very slightly longer (for the same price plsthx 🙏) and if the daughter boards were slightly more narrow. You can\u0026rsquo;t fit them side-by-side with two-slot cards, but I was able to use another riser cable I had on hand to bridge the gap without issue.\nValue \u0026amp; Comparisons # Alright, so, this would be a pretty short and boring post if I didn\u0026rsquo;t explore any alternatives. It works, and that\u0026rsquo;s great, but can you do better than $165 USD for a similar setup? Not to mention that I had to pay more than $40 USD for shipping to the US. Are five small circuit boards with barely a dozen parts on each, and a few cables really worth that much? Well, I couldn\u0026rsquo;t find any other all-in-one solutions for this use case, so from a certain point of view, they could possibly charge even more. Perhaps jankier solutions aggregated from disparate providers could prove to work just as well, at a lower price? Let\u0026rsquo;s take a look.\nOCuLink A # This is a link to a product page on Amazon that contains a variety of OCuLink adapters. OCuLink is a standard for PCIe signaling over cables. Using this page, let\u0026rsquo;s create an equivalent kit.\nPCIe x16 to dual x8 OCuLink - $46 USD 50cm x8 OCuLink to dual x4 OCuLink cable - $50 USD * 2 = $100 USD OCuLink PCIe daughter board with x4 connector - $43 USD * 4 = $172 USD Total $318 USD Not only is this setup twice the price in total, it has shorter cables, the daughter boards have 24 pin ATX power input instead of 6 pin PCIe, and the position of the OCuLink connector would be highly inconvenient in my setup. If you actually checked the product page, there is a good reason they use a 24 pin input - It\u0026rsquo;s meant to be used as an eGPU dock for handheld PCs, so the daughter board includes hardware that will switch an external PSU on. It also includes an OCuLink cable and an NVMe to OCuLink adapter, which I couldn\u0026rsquo;t use, since they only offer riser boards with x8 connectors\u0026hellip; Even if there was a card with quad x4 OCuLink connectors, which isn\u0026rsquo;t available here, the daughter boards alone are still more expensive than MaxCloudON.\nOCuLink B # Okay, but actually, quad x4 OCuLink cards do exist\u0026hellip; At least on AliExpress. This is not an endorsement. I haven\u0026rsquo;t used it. It has a PCI bracket that it passes the connectors through, so that\u0026rsquo;s neat. It looks nice. It seems that no product quite like this has made it to Amazon yet, and there are no worthwhile daughter boards on amazon either. So, what does the situation look like if we stick with AliExpress?\nPCIe x16 to quad x4 OCuLink - ~$16 USD x4 OCuLink to PCIe with SATA power-in - ~$20 USD * 4 = $80 USD 1m x4 OCuLink cable - ~$14 USD * 4 = $56 USD ~$152\u0026hellip; Still very, very close to the MaxCloudON price point, and with drawbacks. I can only find models that use 24 pin input, or SATA power input. 24 pin input has an added cost, as I don\u0026rsquo;t have multiple 24 pin cable splitters on hand. In addition, theoretically, you can draw up to 75w through the PCIe slot, and the 12v power delivery on a 24 pin connector is only rated for 150w. Four daughter boards could exceed that rating. The same issue exists with the SATA input models. SATA connectors are only rated for 56w, and 75 is a bigger number than 56. SATA connectors are already infamous for melting under normal circumstances, driving low power devices like the hard drives they were explicitly designed to power. This situation is workable, but not ideal. With a price point so close to MaxCloudON, I\u0026rsquo;m not willing to risk the jank for maybe, maybe 20 bucks saved after getting the requisite splitters and non direct, not-daisy-chained SATA connectors.\nNVMe Risers? # U.2 bifurcation riser cards are fairly cheap, but I haven\u0026rsquo;t been able to find any female U.2 to PCIe slot risers. NVMe bifurcation riser cards are also fairly cheap, and you can get NVMe to PCIe riser adapters, so let\u0026rsquo;s compare the price there.\nPCIe x16 to quad NVMe - Anywhere from $15 USD to $60 USD, depending on the brand name. 50cm NVMe to PCIe riser - $34 USD * 4 = $136 USD ADTLink sells through Amazon in the US, and makes their products to order. They take a few weeks to ship, but I\u0026rsquo;ve ordered from them multiple times with great results. In terms of NVMe to PCIe risers, I\u0026rsquo;m not sure there are any worthwhile competitors\u0026hellip; They\u0026rsquo;re not even that much more costly compared to the competition, and being able to choose exactly which direction your cable comes off of the NVMe interface and the PCIe slot is a great feature for this type of product.\nBut, we\u0026rsquo;re back again with the same issue as above: SATA power input, and the combined cost is still at minimum $151 USD. Plus, flat cables are more annoying to route, unless they\u0026rsquo;re going straight up and out of the case, which is an option for sure, but not one that I prefer. 50cm is also a little bit short. You can get longer ones, but you need to directly contact them, and obviously it\u0026rsquo;s going to cost even more. I used this type of riser to great effect when I was running the GPUs out of the prebuilt desktop before the watercooling project, but for this use case, I don\u0026rsquo;t think they\u0026rsquo;re a great choice either, unless you have an open-top or caseless build where you don\u0026rsquo;t really have to do cable routing. Maybe, maybe then, it would be cheaper, as long as you\u0026rsquo;re absolutely confident you won\u0026rsquo;t burn up any SATA cables\u0026hellip; Which you likely won\u0026rsquo;t, and I didn\u0026rsquo;t, but it\u0026rsquo;s still not in spec. SATA cables love to burn up when you least expect it, so I prefer to avoid them as much as possible. That being said, you could swap them out for anything you want, but then we\u0026rsquo;re adding cost again, and MaxCloudON looks better and better\u0026hellip;\nConclusion # NVIDIA GPUs happily chugging away on a CUDA benchmark while the Intel GPU is also there. Arc cards do not have sensor/stat reporting on Linux before kernel 6.12\u0026hellip; Alright, so there are setups that are nominally cheaper than MaxCloudON\u0026rsquo;s package deals\u0026hellip; With significant drawbacks in terms of violating power specifications, and janky, bulky connector splitters. You can call me privileged if you want, but I think if you can afford to spend a few hundred dollars each on multiple GPUs, and you have middle-aged enterprise hardware that has PCIe bifurcation, you can probably afford the slight premium to get a bundle from MaxCloudON, even if you have to pay for international shipping. Don\u0026rsquo;t burn down your house to save 20 bucks on something like this. Personally, the international shipping, 6 pin board power, and convenience of getting one cohesive product from one supplier are definitely worth the slight upcharge from the other solutions I\u0026rsquo;ve proffered here. If you know of anything cheaper and less fire hazardy, especially if it has a nice PCI bracket, please send me an email and let me know so I can update this page.\n","date":"January 19 2025","externalUrl":null,"permalink":"/posts/maxcloudon-pcie-bifurcation-riser-review/","section":"Posts","summary":"","title":"MaxCloudON PCIe Bifurcation Riser Review","type":"posts"},{"content":"","date":"January 19 2025","externalUrl":null,"permalink":"/tags/nvidia/","section":"Tags","summary":"","title":"Nvidia","type":"tags"},{"content":"","date":"January 19 2025","externalUrl":null,"permalink":"/tags/pcie-bifurcation/","section":"Tags","summary":"","title":"Pcie Bifurcation","type":"tags"},{"content":"","date":"January 19 2025","externalUrl":null,"permalink":"/categories/reviews/","section":"Categories","summary":"","title":"Reviews","type":"categories"},{"content":"","date":"January 19 2025","externalUrl":null,"permalink":"/categories/servers/","section":"Categories","summary":"","title":"Servers","type":"categories"},{"content":"","date":"January 19 2025","externalUrl":null,"permalink":"/tags/supermicro/","section":"Tags","summary":"","title":"Supermicro","type":"tags"},{"content":"","date":"January 16 2025","externalUrl":null,"permalink":"/tags/alphacool/","section":"Tags","summary":"","title":"Alphacool","type":"tags"},{"content":"","date":"January 16 2025","externalUrl":null,"permalink":"/tags/aqua-computer/","section":"Tags","summary":"","title":"Aqua Computer","type":"tags"},{"content":"","date":"January 16 2025","externalUrl":null,"permalink":"/tags/arduino/","section":"Tags","summary":"","title":"Arduino","type":"tags"},{"content":"","date":"January 16 2025","externalUrl":null,"permalink":"/categories/diy/","section":"Categories","summary":"","title":"Diy","type":"categories"},{"content":"","date":"January 16 2025","externalUrl":null,"permalink":"/categories/watercooling/","section":"Categories","summary":"","title":"Watercooling","type":"categories"},{"content":" Overview # I\u0026rsquo;ve been watercooling my desktop since 2020, and case modding custom cooling solutions since my first modern dGPU in 2012. I enjoy it for the aesthetic, as well as the ability to run my hardware to the redline, despite the lower gains that modern hardware has\u0026hellip; It\u0026rsquo;s still fun to try and see how high you can get on benchmark scoreboards. It was a big initial investment, but with most parts being reusable, the ongoing cost for component upgrades are minimal. Early in 2024, I bought some GPUs to use for dedicated ML tasks in my server rack, and immediately had watercooling on the mind. There weren\u0026rsquo;t particularly strong reasons to do so, but it would lower the power usage a bit, give me some more VRAM overclocking headroom, and give me a bit more core clock stability, as well as the ability to use a cheap two-slot NVLINK connector without suffocating a GPU.\nMy initial setup with 3x 2080 Tis, using m.2 NVMe to PCIe risers in an ASUS prebuilt. Two are connected by NVLINK, which I found to provide a slight performance benefit on the order of ~1-5% in multi-GPU SISR training, which is not worth the typical price for NVLINK bridges from this era. I was lucky to get this ugly, quadro-oriented bridge for just $40. I\u0026rsquo;ve never had a leak on my desktop, but with wider temperature swings in my garage, and collectively, a whole lot more expensive hardware that might get damaged than compared to my desktop setup, I was hesitant. The benefits seemed minimal, and I considered it a fun what-if scenario until I upgraded my main server and discovered that the forced-air passive chassis cooling was insufficient for my new CPUs. At that point, I had to make a decision: Get better heatsinks, which would be single-purpose and cost in excess of $100 each, or whole rack watercooling. I chose whole rack watercooling.\nWith enough reason for me to go ahead with the project, and the only thing holding me back being a fear of leaks, I had to figure out how to actively monitor for, and preferably prevent such an eventuality. I happened upon a product called LEAKSHIELD, from Aqua Computer, that advertised itself as doing exactly that.\nHow does it work? You pull a vacuum inside the fluid loop and monitor the loss rate. Water can\u0026rsquo;t get out if air is trying to get in, and if air gets in, the vacuum is reduced. It\u0026rsquo;s so simple, I don\u0026rsquo;t know why it took this long for such a product to enter the PC watercooling space. I could have just bought a LEAKSHIELD and called it a day, but Aqua Computer\u0026rsquo;s software doesn\u0026rsquo;t support Linux, and the specs for the vacuum pump seemed a bit weak for a loop of my scale. But, its functionality was something I was confident I could emulate with more robust off-the-shelf parts. Thus, sufficiently armed with esoteric plumbing knowledge, I took the plunge and started loading up on parts.\nThe Hardware # Most of the build uses pretty standard off-the-shelf parts for PC watercooling, but there are a few bits and pieces that most builders won\u0026rsquo;t have seen before, and a custom control system that offers a more relevant experience for a server rack. The control system is based on an Arduino Uno that feeds vital statistics over serial, and features pressure control and monitoring similar to the LEAKSHIELD, with fan control based on a PID algorithm keeping the water temperature at a fixed setpoint above ambient.\nOff-the-shelf # General Details # The centerpiece of the build, which the control unit and pump mount to, is the \u0026ldquo;MOther of all RAdiators\u0026rdquo;, version 3, from Watercool. This is the 360mm version with support for up to 18x 120mm fans, or 8x 180mm fans. It\u0026rsquo;s constructed more in the spirit of a vehicle radiator than a traditional PC radiator, with a less restrictive fin stack and large, round tubes rather than thin rectangular ones. It provides several mounting points for accessories which I was able to utilize to secure it to my server rack in a satisfactorily sturdy fashion. An in-depth teardown on the construction method and material quality of the MO-RA can be found on igor\u0026rsquo;sLAB. For fans, I have a collection of old DC-control Corsair Air Series SP120s. They\u0026rsquo;ve all been retired from regular use, because of noise-related aging issues. In fact, one of them failed to turn at all once I had everything wired up, and another had its bearing disintegrate about 8 weeks after putting the thing into service. That being said, they did survive (and continue to survive, in the remaining 16 cases) 24/7 use for anywhere from 4-10 years, at bottom of the barrel pricing, so that\u0026rsquo;s not too bad. I\u0026rsquo;m not exactly pushing the limits of this radiator here, so a few fans breaking down over time isn\u0026rsquo;t the end of the world.\nA MO-RA V3 360 PRO PC Watercooling Radiator from Watercool I got a secondhand Corsair XD5 pump/res combo from eBay for about sixty bucks, which is pretty good for a genuine D5-based pump/res combo. It has PWM support which I did wire up, but the flow ended up being rather anemic even at 100%, so I just run it full speed all the time. The flow rate is measured through an Aqua Computer flow sensor, which is simply a hall-effect tachometer translated to l/h through software. I did not attempt to verify the accuracy of the sensor in my setup. The absolute accuracy is less relevant than simply getting an overall idea of whether or not the measurement is consistent with flow behavior, which it is.\nSimple, cheap aluminum bars and angles mount to the studs on the radiator and into the stud holes on the server rack, and the pump and control box mount onto brackets along with the fans. CPUs # My thermally problematic server upgrade was to dual Xeon Gold 6154s, which are Skylake-SP architecture. This specific SKU is pretty beefy, with 18 cores at sustained all-core speeds of 3.7GHz SSE / 3.3GHz AVX2 / 2.7GHz AVX512, and a TDP of 200 watts. The rated tjmax was 105C, and with the chassis cooling, they readily met that and started throttling under all-core loads, with idle as high as 60-70*C. I previously had Xeon E5-2697 v2s, which had TDPs of 130w. They got toasty, but never throttled. I\u0026rsquo;m not sure if the chassis had any easy fan upgrades available that might have made a difference, and I certainly could have moved to 4u-compatible tower coolers rather than forced air, but since I wanted to watercool the GPUs anyway, adding the CPUs as well would be minimal cost/effort, with more future compatibility for the waterblocks compared to a specialized LGA3647 tower cooler.\nAlphacool Eisblock XPX Pro coldplate Image credit \u0026amp; copyright - igor\u0026rsquo;sLAB The CPU waterblocks are Alphacool Eisblock XPX Pro Aurora Light models, which are significantly cheaper than the XPX Aurora Pro not-light version. They appear to be entirely identical, functionally\u0026hellip; I\u0026rsquo;m not sure if there any actual performance benefits offered by the not-light version. It\u0026rsquo;s a relatively obscure block family without many thorough reviews, which makes sense, given this block is designed for full coverage on Xeons/Threadrippers. The coldplate appears to be skived, which is uncommon in this price bracket for a discrete block, and the fins are incredibly short and dense. In smaller desktop loops, I\u0026rsquo;ve seen this block criticized for having overly restrictive flow, but when you have four blocks + quick disconnects, \u0026ldquo;good\u0026rdquo; flow is relative. At the power limit of 200w, the maximum core temperature delta relative to the water temperature is 25*C, with a ~1-2*C average delta between the two serially-connected sockets at a flow rate of ~130L/h, and that\u0026rsquo;s more than sufficient.\nInterior view of the Supermicro CSE-846 chassis showcasing the installed waterblocks and other components. GPUs # The GPU blocks are Phanteks 2080 Ti Founder\u0026rsquo;s Edition blocks. Nothing special, they\u0026rsquo;re just the cheapest matching ones I could find in 2024 that looked like they\u0026rsquo;d fit these almost-reference-but-not-quite OEM cards. They\u0026rsquo;re generic OEM models that would have gone in prebuilts. The most interesting thing about these cards, is that they\u0026rsquo;ve been modded to have 22GB of VRAM. There\u0026rsquo;s a dedicated supplier still offering them, and it\u0026rsquo;s by far the best $/GB value for VRAM in modern NVIDIA GPUs.1 Whether or not this is a better value overall than, say, a 3090 (Ti) depends on your usecase. Performance improvements in ML tasks between the 2080 Ti and 3090 (Ti) range from as little as ~20% to as much as ~100% depending on how memory bandwidth constrained your workload is. With secondhand 3090 (Ti)s still going for minimum $700 on the used market in the US, I found the alternative 2080 Ti option to be more alluring for my usecase, which is primarily single image super-resolution. More VRAM is desirable to increase the size of tiles for inference, and to increase the batch size during training. Training speed scales almost linearly, and inference speed scales linearly, per GPU. So, for my usecase, where I\u0026rsquo;m not really limited by the performance of a single GPU, the 2080 Ti mod route offers better overall value both for VRAM and combined core performance compared to 3090 (Ti)s. The idea of having a modded GPU in itself was also appealing and definitely part of why I made that decision. Pulling up a hardware monitor and seeing a 2080 Ti with 22GB of VRAM feels a little bit naughty, and I like that.\nThe blocks installed in an ASUS prebuilt gaming tower. I did initially buy three of them, as pictured at the beginning of this post, but one of them failed just after the 30 day warranty period listed on their website. Despite that, they were kind enough to offer a full refund if I covered return shipping, and were very communicative and responded in \u0026lt;24 hours every time I sent them any kind of message/inquiry.\nThe biggest benefit that watercooling theoretically brings to modern video cards is a prolonged lifespan. Not due to lower core temperatures,2 in an absolute sense, but due to the reduced stress from thermal cycles. Mismatches in the rate of thermal expansion between the die and the substrate will eventually cause their bond to break, and this happens faster the larger your die is, and the more extreme the temperature differences are. Today\u0026rsquo;s GPU dies are huge, and it\u0026rsquo;s hard to say how many failures are attributable to this factor alone, but it is certain to be more of a risk than it has been in the past. I\u0026rsquo;d rather buy a used mining GPU than a used gaming GPU any day, because it has likely been kept at roughly the same temperature for most of its life, as opposed to experiencing wide periodic swings.\nThe GPU blocks required a moderate amount of light massaging to properly fit on these OEM model cards. The power plugs are in a different position and a singular capacitor on these models is slightly taller than on the actual Founder\u0026rsquo;s Edition reference card, but they\u0026rsquo;re otherwise close enough to identical. That\u0026rsquo;s not to say that there are no benefits from lowering the operating temperature. As an absolute value, within manufacturer limits, it affects boost clocks, and leakage current. A cooler chip will use less power to run at the same clock speed compared to a hotter chip due to reduced leakage current, making them measurably more energy efficient per clock cycle the colder they run. In my case, with the fan on max, while not thermal throttling, these GPUs would bounce off the power limit of 280w while attempting to hit a core clock of 1800MHz. Under water, at a reported core temperature of ~30*C, the reported board power draw is only ~220w at 1800MHz3 core clock for the same workload. The type of fan typically found in these coolers is rated anywhere from 15-30w on its own, so a reduction of at least 30w can likely be attributed to a lower leakage current.\nDIY Time # In no particular order, here is a list of the major components involved in the control system.\nGeneric metal box, formerly from a PBX system. Arduino Uno clone, unknown brand 60mm Corsair fan RS232 TTL shifter Aesthetic retro power switch 12v DC vacuum pump U.S. Solid 12V NC Solenoid 12v relay modules HX711 ADC MD-PS002 Absolute Pressure Sensors L298-like PWM motor driver Apple White iMac PSU Adafruit Arduino Uno Proto Shield DS18B20 temperature probes Fit check for all the major components. I didn\u0026rsquo;t take excruciatingly detailed pictures of every single step of the assembly/prototyping process. For the most part, I was just plugging pre-made components together. The most interesting production notes include the pressure sensor and the power supply.\nPutting New Life into an iMac PSU # The power supply I used is from a first-gen Intel White iMac, which is visually very similar to the G5. It was one of the earliest things that I installed Linux on, and I used it as a seedbox for a bit, but eventually took it apart and saved some of the more interesting stuff.\nAll the credit goes to the user ersterhernd, from this thread on the tonymacx86.com forum for figuring out the pinout of this PSU, which is almost entirely identical to the one that was in my unit, apart from the power rating on mine being 200w. There are two banks of pins, half of which are always on, half of which are toggleable. Each bank has 12v, 5v, and 3.3v. I didn\u0026rsquo;t end up using 3.3v for anything other than the power switch. I have no idea what the energy efficiency of this unit is, obviously it doesn\u0026rsquo;t have an 80+ certification\u0026hellip; But I\u0026rsquo;m assuming that Apple would make it at least halfway decent. Hopefully more efficient than a random 12v power brick with additional converters, I\u0026rsquo;d hope.\nMy schematic for the control unit. It was the first time I\u0026rsquo;ve ever used KiCad, and the first time I\u0026rsquo;ve ever made a schematic like this at all. I hope it\u0026rsquo;s relatively legible. As you can see in the schematic above, the always-on 3.3v pin is connected to SYS_POWERUP through a relay board. The relay input is pulled low by a single pole switch, which turns the relay on, which connects ground to SYS_POWERUP, engaging the other rail of the power supply. This is kind of a convoluted solution to not having a double-pole switch\u0026hellip; But I didn\u0026rsquo;t have a double-pole switch, so that\u0026rsquo;s what I did.\nMeasuring vacuum # The leak-resisting aspect all hinges on monitoring the pressure of the loop\u0026hellip; Or potentially running a vacuum pump constantly, but that\u0026rsquo;s stupid. For some reason, I had a really hard time finding a vacuum pressur sensor. There are plenty of physical, analogue vacuum gauges available, but as far as an electronic sensor\u0026hellip; I just couldn\u0026rsquo;t find any located in the US, at a reasonable price. There were a few hobbyist grade differential sensors, but I wanted to be able to measure down to an almost complete vacuum, and they didn\u0026rsquo;t have the range. Maybe I had the wrong search terms, but I just wasn\u0026rsquo;t finding anything. Eventually I found an unpackaged sensor with obscure, not entirely legible datasheets that claimed to have an acceptable pressure range for my application. The MD-PS002 is what I settled on, available on Amazon in the US in a 2-pack for $8. It\u0026rsquo;s a tiny little thing, and it took two attempts to successfully create a sensor package that didn\u0026rsquo;t leak.\nSensor package details, installed and all gooped up. I drilled a hole in a G1/4\u0026quot; plug, just slightly bigger than the metal ring on the sensor, coated that ring with J-B Weld, and inserted it, letting it cure before grinding away the exterior of the top of the plug and building up more J-B weld to add some strain relief for the wires as well as edge-to-edge sealing. The current vacuum loss rate, after running the system for a few months, allowing the loop to very thoroughly de-gas, is now less than 50mbar per day at -500 to -600mbar. I was slightly worried about the lifetime of the pump, given it\u0026rsquo;s a cheap thing from Amazon, but given it only has to run for about a second every 2-3 days, I imagine that won\u0026rsquo;t be an issue.\nHere\u0026rsquo;s a quick video showing the system not leaking!\nThis pressure sensor is a wheatstone bridge, which works the same way as load cells for digital scales. The resistance changes are very, very low, thus the signal must be amplified before being fed into an ADC. You could use an op-amp, and feed that signal into an analog input on the Arduino, but I felt more comfortable using the HX711, a two-channel ADC with integrated amplifier designed to be used with wheatstone bridge load cells. Here\u0026rsquo;s a code snippet showing how I converted the raw analog measurement to mbar.\nfloat pressure_raw_to_mbar(int32_t pressure_raw) { return (pressure_raw - 390000) * (1700.0 / (5600000 - 390000)) - 700; } I calibrated it manually, comparing it to an analogue gauge. It\u0026rsquo;s calibrated to a zero point at atmospheric pressure in my locale, and from -700mbar to +1000mbar. I figured out that, when setting the HX711 to a gain of 64 with the Adafruit HX711 library, a change of 100mbar is a change in the ADC measurement by 30k, highly consistent across the entire pressure range that I tested. I can\u0026rsquo;t be 100% sure how accurate the analogue gauge is, but 100% accuracy doesn\u0026rsquo;t really matter for this application. All I really need to know is the fact that an adequate vacuum is present, and a general idea of the leak rate, which is a requirement that this setup meets.\nOther stuff # I got a beefy PWM motor driver with L298 logic, claiming a continuous current of 7 amps per channel, which nicely fit my requirements. 120mm PC fans are typically 0.2-0.3 amps, and mine in particular are 0.25. For 18 fans, it should be approximately 4.5 amps at 100% speed. It\u0026rsquo;s a bit oversized, and I\u0026rsquo;m only using one channel, but it leaves me the option in the future to use larger, generic radiator fans that have more demanding power requirements. I\u0026rsquo;m already down two from eighteen, eventually enough of them are going to fail that I\u0026rsquo;ll have to find another solution.\nRequired additions to the solenoid, pump motor, and the complete assembly without cover. In my initial tests, I found that operating the pump and solenoid would cause the Arduino to reset, seemingly at random, or cause other undefined behavior. Since they were not electrically isolated on a second power supply, that makes sense. They were backfeeding energy and causing a notable amount of general interference during operation, to the point that the LEDs on the inactive relay modules would dimly illuminate when the motor was in operation, and very visibly illuminate whenever the motor or the solenoid deactivated. I had to add flyback diodes, and, for peace of mind, I added ceramic filtering capacitors to the pump as well. Those additions completely eliminated the issue. Below is a video demonstrating the bad behavior.\nI did a similar plug-drilling setup for the water temperature sensor with a generic Dallas temperature probe. The air probe was taped to the exterior of the box, in the path of the incoming air. All that remained was to solder up a sort of bus bar for the radiator fan connectors, get the temperature probes and pull-up resistors wired into the proto board, and hook everything up to the Arduino, then write the software to tie it all together.\nThe Software # rawhide_k/server-watercooling-controller C\u0026#43;\u0026#43; 0 0 The Arduino operates independently, without a server. The fan speed and loop pressure are managed autonomously. The serial connection is only used to report vital stats, for later integration into more connected monitoring systems. Ultimately, there are only two actions it can take: Change the fan speed, and turn on the vacuum pump. Neither of these require any external knowledge. It doesn\u0026rsquo;t need any information about the connected hardware to function correctly. All it\u0026rsquo;s concerned with is keeping the water temperature at a certain delta above ambient temperature, and keeping the loop pressure within a certain range. Ultimately, very simple.\nOnce per second, the temperature sensors are sampled, the PID loop for the fans runs with the new temperature data points, and the fan speed, temperature data, vacuum pressure, and pump/flow measurements are sent over serial with the help of the ArduinoJSON library. I settled on a target water delta of 4*C relative to ambient, with a chosen min/max temperature range where the fans turn off or pin themselves to 100% completely. The 4*C delta is rather arbitrary. It\u0026rsquo;s approximately the delta that exists when the systems are on, but idle, and the fans are at their minimum speed. That delta can be maintained during 100% CPU load, and during medium-heavy GPU loads, but not both combined. It still stays well under a 10*C delta in that case, though, so I can\u0026rsquo;t complain.\nThere are also hard stops to turn the fans off if the water temperature hits 5*C, and pin them to max if it hits 40*C. I\u0026rsquo;m not sure how realistic either of those figures are, but it\u0026rsquo;s better to be safe than frozen up and/or exploded.\nint fan_PID(float* air_temp, float* water_temp, uint32_t* cur_loop_timestamp) { static const float kp = 120.0; static const float ki = 0.16; static const float kd = 4.0; static float integral = 0; static float derivative = 0; static float last_error = 0; static float error; static float delta; static float last_time = *cur_loop_timestamp; static const float min_water_temp = 5.0; static const float max_water_temp = 40.0; static const uint8_t min_fan_speed = 90; static const uint8_t max_fan_speed = 255; static const uint8_t temp_target_offset = 4; static const uint8_t fan_offset = 10; static int16_t fan_speed; if (*water_temp \u0026gt;= max_water_temp) { return max_fan_speed; } else if (*water_temp \u0026lt;= min_water_temp) { return 0; } else { error = *water_temp - min(*air_temp + temp_target_offset, max_water_temp); delta = *cur_loop_timestamp - last_time; //mitigate unlimited integral windup if (fan_speed == max_fan_speed) { integral += error * delta; } if (*air_temp + temp_target_offset \u0026gt; *water_temp - 1) { integral = 0; } derivative = (error - last_error) / delta; fan_speed = round(constrain(min_fan_speed + fan_offset + (kp * error + ki * integral + kd * derivative), min_fan_speed, max_fan_speed)); last_error = error; return fan_speed; } } Occasionally, the temperature probes as well as the HX711 return spurious readings that cause poor behavior, such as crashing the Arduino. In particular, the temperature probes will sometimes return -127, which caused my PID algorithm to crash the Arduino for reasons I could not divine.\nFor the temperature probes, I simply ignore the one problematic result that I\u0026rsquo;ve observed.\nnew_water_temp = sensors.getTempC(water_therm); new_air_temp = sensors.getTempC(air_therm); if (new_water_temp != -127) { water_temp = new_water_temp; } if (new_air_temp != -127) { air_temp = new_air_temp; } sensors.requestTemperatures(); In case of any other freezing/crashing issues, I also enabled the watchdog timer for 2 seconds. So, if, for some reason, it does freeze/crash, it should self reset after 2 seconds. It seems to be working, although I guess time will tell in the long-term. I haven\u0026rsquo;t experienced any operation issues since I added it over a month ago. The other concern is undefined behavior when the timer overflows. The Uno only has a 32-bit timer, so it will overflow around 50 days of uptime. This function pre-emptively resets it.\n//we will use this function to periodically self-reset to avoid timer overflows void(* resetFunc) (void) = 0; ... //reset the system when approaching timer overflow if (cur_loop_timestamp \u0026gt;= 4000000000) { resetFunc(); } In addition to the temperature probe problem, the HX711 occasionally returns wildly wrong results that need to be filtered out. To compensate for that, if the pressure dips below the threshold, I wait one second, and if the pressure is still below the threshold, I then begin pumping. This check happens approximately ten times per second, as the default behavior of the HX711 board that I have is to run in 10hz mode. I\u0026rsquo;m not sure if the issue springs from some kind of interference with the tachometer interrupts messing up the signaling timing, or if I\u0026rsquo;m misunderstanding the correct way to sample the HX711 over time.\nif (cur_loop_timestamp - last_pressure_check \u0026gt;= 100) { loop_pressure = pressure_raw_to_mbar(hx711.readChannelRaw(CHAN_A_GAIN_64)); if (sucking == false) { if (loop_pressure \u0026gt; low_pressure_threshold) { if (checking_low_pressure == false) { checking_low_pressure = true; low_pressure_confirmation_timestamp = cur_loop_timestamp; } if (cur_loop_timestamp - low_pressure_confirmation_timestamp \u0026gt;= 1000) { digitalWrite(pump_relay, HIGH); digitalWrite(solenoid_relay, HIGH); sucking = true; checking_low_pressure = false; } } else if (cur_loop_timestamp - low_pressure_confirmation_timestamp \u0026gt;= 1000) { checking_low_pressure = false; } } else { if (loop_pressure \u0026lt; high_pressure_threshold) { digitalWrite(pump_relay, LOW); digitalWrite(solenoid_relay, LOW); sucking = false; } } last_pressure_check = cur_loop_timestamp; } Currently, my server-side software is incomplete. It\u0026rsquo;s just a brute-force JSON-over-serial reader written in Python, that I glance at from time to time. I plan write a Zabbix bridge, and have that manage the monitoring, alerts, and reactions to catostraphic events, once I have Zabbix properly setup for my systems\u0026hellip; But that hasn\u0026rsquo;t happened just yet. I don\u0026rsquo;t expect it to be a particularly interesting event, but if anything comes up I might write a post about it.\nOther Thoughts? # The Arduino\u0026rsquo;s software hadn\u0026rsquo;t been 100% finalized when I took the below pictures. The control box does have a lid now, and all the cable management is a lot cleaner\u0026hellip; Promise!\nEverything installed and working! When I was testing it, I had an incident where the Arduino crashed, which means the fans stopped\u0026hellip; That\u0026rsquo;s a big drawback with the motor controller that I have, it fails off instead of on. But, I haven\u0026rsquo;t experienced any more issues after adding those software fixes. At that time, I was running a full GPU workload\u0026hellip; The water temperature exceeded 70*C. It happened at night, and I have no idea how long it ran like that\u0026hellip; Hours. Pretty scary stuff, but it all came out alright.\nThis project had a lot of firsts for me. It was the first time I\u0026rsquo;ve done any kind of embedded-adjacent development beyond \u0026ldquo;ooooo look at the blinky light, oooooooo it turns off when you press the button, wwaaaow\u0026rdquo;, and the first time I\u0026rsquo;d designed something with so many individual parts. I\u0026rsquo;ve never worked with air pumps, solenoids, or pressure sensing before, nor had to debug issues like the lack of flyback diodes.\nThe biggest mistake I made was using that stupid battery box. It\u0026rsquo;s steel, and I don\u0026rsquo;t have the tools or experience to work with steel in the way that I intended to. I thought it would look cool, and it does, but if I did it again, I\u0026rsquo;d use a generic aluminum or plastic project box instead, because it took two entire days plus waiting for new drill bits that can actually cut through it.\nIf I were to ever take it apart again, I\u0026rsquo;d add a passthrough for the SPI header, and/or an external reset button. I should have gotten a physical display of some type that could show the sensors and debug info on the device itself without being connected to another device to readout the data.\nI\u0026rsquo;d like to get a second pump, for redundancy\u0026rsquo;s sake and to increase the flow rate. But it\u0026rsquo;s going to be such a pain to install that I feel like I\u0026rsquo;m never going to bother to do it, unless the current pump fails, or I add more components to be cooled and the flow is adversely affected. I was slightly concerned about the evaporation rate of the liquid via the vacuum tank, and that I\u0026rsquo;d need to add some kind of fluid level detection system, but there\u0026rsquo;s been no noticeable loss thus far. Now that I know the pump turns on so infrequently, I can\u0026rsquo;t imagine that it\u0026rsquo;s going to need to be topped up anytime soon.\nIn terms of value\u0026hellip; This was unbelievably bad. Buying tower coolers would have allowed the CPUs to run without throttling, and buying another GPU would have overpowered any benefits gained by NVLINK. I haven\u0026rsquo;t tallied up exactly how much I spent on it, but it was at least $1000, including buying new tools and excess materials that I haven\u0026rsquo;t fully used, and excluding the original cost of re-using some parts I already had. I\u0026rsquo;ve added risk, maintenance overhead, and pain whenever I swap out hardware in the future. Custom watercooling4 is an ongoing abusive relationship between your fingertips and your ego, or your fascination for slightly more optimized numbers on a screen\u0026hellip; But I\u0026rsquo;d do it again in a heartbeat, because it was fun.\nIn modern, post-Turing cards, that is. Please stop buying mesozoic-era Kepler/Maxwell Quadros and Teslas just because they have VRAM. There\u0026rsquo;s a reason they\u0026rsquo;re going for like, $20, and if you paid more for anything from that era, I\u0026rsquo;m sorry. Electrical costs are a thing, and your life is worth more than waiting for any meaningful, current-year work to happen on those decrepit e-waste cards. I feel even worse for you if you got tricked into buying one of those \u0026ldquo;24GB\u0026rdquo; or \u0026ldquo;16GB\u0026rdquo; cards that are actually 2x12GB and 2x8GB. You can make an argument for Volta, but only if you\u0026rsquo;re doing some deranged pure FP64 stuff. Consumer Turing and newer are faster at everything else! And if you\u0026rsquo;re buying them for HW-accel encode\u0026hellip; The quality is awful compared to any Intel ARC card. Buy one of those instead.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nI don\u0026rsquo;t understand why people don\u0026rsquo;t trust the manufacturer specifications when it comes to silicon temperature limits, beyond unfounded conspiracy nonsense around planned destruction/obselence. In terms of Intel server SKUs, you find that the throttling temp is higher than on consumer SKUs, despite the higher reliability demanded by the enterprise market\u0026hellip; I\u0026rsquo;m assuming that this is due to reduced hotspot variance thanks to generally lower voltage spread from lower boosting clock speeds. On enterprise SKUs which are focused on single threaded performance, the throttling temp is typically lower than those without the ability to boost as high. If you have evidence to the contrary, let me know.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nNVIDIA overclocking on Linux is awful. There\u0026rsquo;s no way to edit the voltage-frequency curve through the Linux drivers. You can only set the offset. So you can\u0026rsquo;t really make use of dynamic boosting if you want to undervolt. I have the max clock speed clamped at 1800MHz with a core offset to emulate undervolting as you would do on Windows, but it\u0026rsquo;s hard to say if I\u0026rsquo;m getting the peak performance that I could be getting at whatever the core voltage is under these circumstances - because NVIDIA\u0026rsquo;s Linux drivers ALSO don\u0026rsquo;t report that. VRAM temperature? Nope. VRM? Nope. Hotspot? Nope. You better hope your card works with NVML, too, because otherwise you\u0026rsquo;re going to have to mess around with X to use nvidia-settings.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nStop calling them open loops. Open loop systems exchange fluid with the environment. Unless you\u0026rsquo;re getting your water out of the sink and flushing it right down the drain, your system is not open loop. It\u0026rsquo;s closed. I don\u0026rsquo;t understand why people call custom loops \u0026ldquo;open\u0026rdquo; and AIOs \u0026ldquo;closed\u0026rdquo; when their modes of operation are identical. It\u0026rsquo;s just plain wrong.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"January 16 2025","externalUrl":null,"permalink":"/posts/watercooling-homelab/","section":"Posts","summary":"","title":"Watercooling My Homelab","type":"posts"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"}]