trained a prototype LLM (Grok-0) with 33 billion parameters. This early model approaches LLaMA 2 (70B) capabilities on standard LM benchmarks but uses only half of its training resources. In the last two months, we have made significant improvements in reasoning and coding capabilities leading up to Grok-1, a state-of-the-art language model that is significantly more powerful, achieving 63.2% on the HumanEval coding task and 73% on MMLU.
Hey everyone, I was wondering what you think about having a bot for nightly builds to test the latest changes and discover regressions. This would allow some instances to test the latest changes and discover regressions so that we don't get stuck with those until the next big release. I think this would be a great way to improve the user experience. What do you think?