So I wrote a blog post on LLM performance. It was focused on SWE-Bench and discussed why performance is topping out.
As part of the post I pulled down gigs of runs from the SWE-Bench S3 bucket and went through several of the harder test cases. I focused on improvements in the last six months. Primarily on Opus.
Regrettably I’m probably not moving forward on that post. Why? Because after going through the data I found that the LLMs are cheating on the tests. And that’s a whole different thing.
@feliks It's neat that math is a domain where you can fact check things automatically without needing to have "out of system" knowledge. I've been looking at software for math proofs and it looks pretty straightforward. Unexpected that language models are that good at using provers already.
Occult Enby that's making local-first software with peer to peer protocols, mesh networks, and the web.
Yap with me and send me cool links relating to my interests. 👍