0

I've been using airoboros-l2-70b for writing fiction, and while overall I'd describe the results as excellent and better than any llama1 model I've used, it doesn't seem to be living up to the promise of 4k token sequence length.

Around 2500 tokens output quality degrades rapidly, and either starts repeating previous text verbatim, or becomes incoherent (grammar, punctuation and capitalization disappear, becomes salad of vaguely related words)

Any other experiences with llama2 and long context? Does the base model work better? Are other fine tunes behaving similarly? I'll try myself eventually, but the 70b models are chunky downloads, and experimentation takes a while at 1 t/s.

(I'm using GGML Q4_K_M on kobold.cpp, with rope scaling off like you're supposed to do with llama2)

1
submitted 1 year ago* (last edited 1 year ago) by actuallyacat@sh.itjust.works to c/localllama@sh.itjust.works

Highlighting something cool that you may have not seen yet - today, kobold.cpp upgraded its context scaling to the newer NTK method and now it's actually quite useful. This is different from SuperHOT - it works with unmodified models (although maybe a model specifically tuned for it would work even better, we'll see)

It's not in a release yet so to try it out you need to pull the changes with git and build from source. After that, start kobold.cpp with --contextsize 4096 or 8192, put the same number (or less, see below) in context length in the UI (the slider only goes to 2048, but you can type in anything), and there you go. It works!

Prompt:

USER: The secret password is "six pancakes". Remember it and repeat when requested. Here is some filler, ignore it:
[5900 tokens worth of random alphanumeric characters]
USER: What's the secret password?
ASSISTANT:

Response:

The secret password is "six pancakes".

(model: airoboros-33b-gpt4-1.4, Q5_K_M - not a model finetuned for extended context!)

For comparison, with the linear scaling in kobold.cpp 1.33, this only worked up to about 2200 tokens.

This performs dramatically better, although there still seem to be some sort of memory overflow bug, as it will suddenly explode into random characters when crossing around 6000 tokens. So with 8k I suggest setting max tokens in the UI to 5900.

(note about doing this test in kobold.cpp: if you try it with context size set to 2048, it might still appear to work, but that's only because kobold is automatically trimming out some of the filler in the middle. This is only a valid test if you're not overfilling the context.)

According to perplexity measurements there is some degradation in overall quality, especially when the context is almost empty, but it's not noticeable to me, at least at 4k which is what I tested. There's room for improvement in only applying as much scaling as is necessary at current sequence length to eliminate the degradation. I tried to implement that, but I just get garbage, I must be missing something. But in any case at the rate things are progressing someone else will have done it by the time I wake up tomorrow...

1

The All feed on both Hot and Active modes is exactly the same as it was most of a day ago, all the same posts in the same order, except they're all 12h+ old now. At first I thought federation might not be working due to overload, but on New posts are coming in all the time, both local and remote.

What's up?

[-] actuallyacat@sh.itjust.works 0 points 1 year ago

Reddit has over 2,000 employees most of whom are doing bullshit nobody using the site actually needs or wants, it's possible to run a lot leaner than that. Like Reddit itself used to, before they started burning hundreds of millions trying to compete with every other social media site at once instead of being Reddit

actuallyacat

joined 1 year ago