So I wrote a blog post on LLM performance. It was focused on SWE-Bench and discussed why performance is topping out.

As part of the post I pulled down gigs of runs from the SWE-Bench S3 bucket and went through several of the harder test cases. I focused on improvements in the last six months. Primarily on Opus.

Regrettably I’m probably not moving forward on that post. Why? Because after going through the data I found that the LLMs are cheating on the tests. And that’s a whole different thing.

@feliks It's neat that math is a domain where you can fact check things automatically without needing to have "out of system" knowledge. I've been looking at software for math proofs and it looks pretty straightforward. Unexpected that language models are that good at using provers already.

Imagining if we set aside .1% of all the money being spent to make AIs that write bad software to compensate all the open-source developers who write good software

"Please sir, can you spare some thousands of chat rooms across 7 platforms?" 🥺👉👈

Show thread

My matrix server is down and so are its bridges. Now I have to access my chats app by app like some sort of peasant. 😿

Show older
Mauvestodon

Escape ship from centralized social media run by Mauve.