2

Are you self-hosting LLMs (AI models) on your headless servers? I’d like to hear about your hardware setup. What server do you have your GPUs in?

When I do a hardware refresh I’d like to ensure my next server can support GPU(s?) for local LLM inferencing. I figured I could put in either a 4090 or x2 3090’s(?) maybe into an R730. But I’ve only barely started to research this. Maybe it isn’t practical.

I don’t know much other hardware lineups besides the Dell R7xx lineup.

I host oobagooba on an R710 as a model server API, and host sillytavern and stable diffusion which use oobagooba as clients. I use an R710 using a CPU, so as you can imagine inferencing is so slow it’s basically unusable. But I wired it up as a proof of concept.

I’m curious what other people who self-host LLMs do. I’m aware of remote options like Mancer or Runpod. I’d like the option for purely local inferencing.

Thanks all

you are viewing a single comment's thread
view the rest of the comments
[-] tigress667@alien.top 1 points 10 months ago

One challenge with the 4090 specifically is I don't believe there are any dual-slot variants out there, even my 4080 is advertised as a triple-slot card (and actually takes four because Zotac did something really, really annoying with the fan mounting)..you could liquid-cool and swap the brackets, but then you have the unenviable task of mounting sufficient radiators and support equipment (pump, res, etc) into a rackmount server. That assumes you're looking at something 2-3U, since you mentioned an R730; if you're willing to do a whitebox 4U build it's a lot more doable.

Of course if money is no object, ditch plans for the GeForce cards and get the sort of hardware that's made to live in 2U/3U boxes, i.e. current-gen Tesla (or Quadro, if you want display outputs for whatever reason). If money is an object, get last-gen Teslas. Tossed an old Tesla P100 (Pascal/10-series) into my Proxmox server to replace a 2060S with half the VRAM, for LLMs I didn't really notice an obvious performance decrease (i.e. still inferences faster than I can read), and in a rack server you won't even have to mess with custom shrouds for cooling, since the fans in the server are going to provide more than enough directed airflow.

this post was submitted on 15 Nov 2023
2 points (75.0% liked)

Homelab

371 readers
2 users here now

Rules

founded 11 months ago
MODERATORS