Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B (lemmy.world)

submitted 1 year ago by Blaed@lemmy.world to c/technology@lemmy.world

11 comments fedilink hide all child comments

cross-posted from: https://lemmy.world/post/3879861

Beating GPT-4 on HumanEval with a Fine-Tuned CodeLlama-34B

Hello everyone! This post marks an exciting moment for !fosai@lemmy.world and everyone in the open-source large language model and AI community.

We appear to have a new contender on the block, a model apparently capable of surpassing OpenAI's state of the art ChatGPT-4 in coding evals (evaluations).

This is huge. Not too long ago I made an offhand comment on us catching up to GPT-4 within a year. I did not expect that prediction to end up being reality in half the time. Let's hope this isn't a one-off scenario and that we see a new wave of open-source models that begin to challenge OpenAI.

Buckle up, it's going to get interesting!

Here's some notes from the blog, which you should visit and read in its entirety:

https://www.phind.com/blog/code-llama-beats-gpt4

Blog Post

We have fine-tuned CodeLlama-34B and CodeLlama-34B-Python on an internal Phind dataset that achieved 67.6% and 69.5% pass@1 on HumanEval, respectively. GPT-4 achieved 67% according to their official technical report in March. To ensure result validity, we applied OpenAI's decontamination methodology to our dataset.

The CodeLlama models released yesterday demonstrate impressive performance on HumanEval.

CodeLlama-34B achieved 48.8% pass@1 on HumanEval

CodeLlama-34B-Python achieved 53.7% pass@1 on HumanEval

We have fine-tuned both models on a proprietary dataset of ~80k high-quality programming problems and solutions. Instead of code completion examples, this dataset features instruction-answer pairs, setting it apart structurally from HumanEval. We trained the Phind models over two epochs, for a total of ~160k examples. LoRA was not used — both models underwent a native fine-tuning. We employed DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in three hours using 32 A100-80GB GPUs, with a sequence length of 4096 tokens.

Furthermore, we applied OpenAI's decontamination methodology to our dataset to ensure valid results, and found no contaminated examples.

The methodology is:

For each evaluation example, we randomly sampled three substrings of 50 characters or used the entire example if it was fewer than 50 characters.

A match was identified if any sampled substring was a substring of the processed training example.

For further insights on the decontamination methodology, please refer to Appendix C of OpenAI's technical report. Presented below are the pass@1 scores we achieved with our fine-tuned models:

Phind-CodeLlama-34B-v1 achieved 67.6% pass@1 on HumanEval

Phind-CodeLlama-34B-Python-v1 achieved 69.5% pass@1 on HumanEval

Download

We are releasing both models on Huggingface for verifiability and to bolster the open-source community. We welcome independent verification of results.

https://huggingface.co/Phind/Phind-CodeLlama-34B-v1

https://huggingface.co/Phind/Phind-CodeLlama-34B-Python-v1

If you get a chance to try either of these models out, let us know how it goes in the comments below!

If you found anything about this post interesting, consider subscribing to !fosai@lemmy.world.

Cheers to the power of open-source! May we continue the fight for optimization, efficiency, and performance.

top 11 comments

sorted by: hot top controversial new old

[-] drspod@lemmy.ml 6 points 1 year ago

Is this model trained specifically for problem solving, or does it also perform as well as ChatGPT on conversational and generic text-generation tasks?

[-] L_Acacia@lemmy.one 6 points 1 year ago

Specifically probleme sovling, chatgpt has multiple model too it is just hidden to the user

[-] 0421008445828ceb46f496700a5fa6@kbin.social 3 points 1 year ago

We fined-tuned on a proprietary dataset of ~80k high quality programming problems and solutions.

[-] ChrisLicht@lemm.ee 3 points 1 year ago

Dumb question: Does one install the Python model, or access online?

[-] L_Acacia@lemmy.one 6 points 1 year ago* (last edited 1 year ago)

The best way to run a Llama model locally is using Text generation web UI, the model will most likely be quantized to 4/5bit GGML / GPTQ today, which will make it possible to run on a "normal" computer.

Phind might make it accessible on their website soon, but it doesn't seem to be the case yet.

EDIT : Quantized version are available thanks to TheBloke

[-] ChrisLicht@lemm.ee 2 points 1 year ago

You are awesome; thanks for the clue-in!

[-] babysharknanana@lemmy.world 1 points 1 year ago

Exciting, but as far as I know, we can't use LLaMA commercially. So I ask myself how to use it in a non-commercial context? Isn't it expensive to embedd such a model in free/open-source software?

[-] L_Acacia@lemmy.one 3 points 1 year ago

Llama 2 now uses a license that allows for commercial use.

[-] babysharknanana@lemmy.world 1 points 1 year ago

I know, but the text is only talking of Llama. So this is using Llama 2?

[-] abhibeckert@lemmy.world 2 points 1 year ago* (last edited 1 year ago)

LLama2 and Llama are basically excatly the same model, except the "2" version has a more permissive license and was trained with a larger source data set. Nobody should use the old one ever, and I expect the noncommercial license is part of a contract Meta signed with someone who provided source material.

This is "CodeLlama" which was built on Llama2 and allows commercial use.

[-] backgroundcow@lemmy.world 0 points 1 year ago* (last edited 1 year ago)

I understand LLaMA and some other models come with instructions that say that they cannot be used commercially. But, unless the creators can show that you have formally accepted a license agreement to that effect, on what legal grounds can that be enforceable?

If we look at the direction US law is moving, it seems the current legal theory is that AI generated works fall in the public domain. That means restricting their use commercially should be impossible regardless of other circumstances - public domain means that anyone can use them for anything. (But it also means that your commercial use isn't protected from others likewise using the exact same output).

If we instead look at what possible legal grounds restrictions on the output of these models could be based on if you didn't agree to a license agreement to access the model. Copyright don't restrict use, it restricts redistribution. The creators of LLMs cannot reasonably take the position that output created from their models is a derivative work of the model, when their model itself is created from copyrighted works, many of which they have no right to redistribute. The whole basis of LLMs rest on that "training data" -> "model" produces a model that isn't encumbered by the copyright of the training data. How can one take that position and simultaneously belive "model" -> "inferred output" produces copyright encumbered output? That would be a fundamentally inconsistent view.

(Note: the above is not legal advice, only free-form discussion.)

this post was submitted on 26 Aug 2023

81 points (92.6% liked)

Technology

59312 readers

4915 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS