1
llama.cpp for GPU only
(lemmy.ml)
Hmm. I'd actually argue it's a good solution in some cases. We run multiple services where load is intermittent, services are short-lived, or the code is complex and hard to refactor. Just adding hardware resources can be a much cheaper solution than optimizing code.
Do you have a degree in theoretical physics, or do you theoretical have a degree. ;)