9
submitted 5 months ago* (last edited 5 months ago) by mariah@feddit.rocks to c/fosai@lemmy.world

Ive been playing koboldai horde but the queue annoys me. I want a nsfw ai for playing on tavernai chat

top 9 comments
sorted by: hot top controversial new old
[-] tal@lemmy.today 6 points 5 months ago

koboldai horde

I mean, you can run KoboldAI locally.

I don't know whether you'd consider that sufficiently fast. But if you're already using that and happy with it, it's probably what I'd try first.

[-] j4k3@lemmy.world 2 points 5 months ago

The 7600 is the 16GB? I can't say for AMD but a 16 GB 3080Ti can run a whole lot of something. I don't do Kobold because building it was too much of a headache of dependencies. I don't do silly tavern either because I prefer more control and versatility.

I'm using an 18 core 12th gen with 64GB of sysmem and mostly use llama.cpp so that I can split the load between CPU and GPU. I wrote a little command line function that polls nvidia-smi and parses the GPU memory to tell me exactly how much I have used and what I have left over. That runs every 5 seconds in the terminal and displays the metrics on the title bar. Knowing exactly how much RAM you're using in the GPU and dialing in the settings with new models makes a big difference. The various models have very different requirements and settings optimisation potential.

I run an 8×7B quantized model at 5 bits most of the time. It takes around 50GB to initially load, but runs like a 13B after that and is quire light weight.

I'm somewhat limited when it comes to training LoRA's. Like I can only do 7-8B model stuff in that space, but with a GGUF I can run up to a 70B. I wish I had more than 64 GB of system memory though. At 96 or 128 I could run some of the 120B models. Command R is pretty popular and powerful, but I can't load that one.

The 16 GB can run something like moistral 11B in transformers and 4-bit using bits and bites too.

[-] projectmoon@lemm.ee 1 points 4 months ago

How much speed are you actually getting on Mixtral (I assume that's the 8x7b). I have 64 GB of RAM and an AMD RX 6800 XT with 16 GB of VRAM. I get like 4 tokens per second with Q5_K_M quant.

[-] Fisch@discuss.tchncs.de 2 points 5 months ago

There's a fork of text-generation-webui with HIP support, you should use that

[-] rufus@discuss.tchncs.de 2 points 5 months ago* (last edited 5 months ago)

https://github.com/YellowRoseCx/koboldcpp-rocm

That will be optimized for AMD and as far as I know has the same / a very similar user interface.

(The 8GB of VRAM on your graphics card will be some limitation. So maybe stick with smaller and quantized models.)

And share your success stories on !ChatbotsNSFW@lemmynsfw.com

[-] Even_Adder@lemmy.dbzer0.com 1 points 5 months ago
[-] mariah@feddit.rocks 1 points 5 months ago
[-] Even_Adder@lemmy.dbzer0.com 1 points 5 months ago

What do you mean?

[-] projectmoon@lemm.ee 1 points 5 months ago

Install ollama. It has ROCm support (on Linux at least). Then hook it up to your favorite client. It has its own API and an openai compatible one.

this post was submitted on 04 Jun 2024
9 points (84.6% liked)

Free Open-Source Artificial Intelligence

2889 readers
3 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

More AI Communities

LLM Leaderboards

Developer Resources

GitHub Projects

FOSAI Time Capsule

founded 1 year ago
MODERATORS