Show newer

So I wrote a blog post on LLM performance. It was focused on SWE-Bench and discussed why performance is topping out.

As part of the post I pulled down gigs of runs from the SWE-Bench S3 bucket and went through several of the harder test cases. I focused on improvements in the last six months. Primarily on Opus.

Regrettably I’m probably not moving forward on that post. Why? Because after going through the data I found that the LLMs are cheating on the tests. And that’s a whole different thing.

@feliks It's neat that math is a domain where you can fact check things automatically without needing to have "out of system" knowledge. I've been looking at software for math proofs and it looks pretty straightforward. Unexpected that language models are that good at using provers already.

"Please sir, can you spare some thousands of chat rooms across 7 platforms?" 🥺👉👈

Show thread

My matrix server is down and so are its bridges. Now I have to access my chats app by app like some sort of peasant. 😿

I’m curious about the ground reality here.
How much person-to-person recruiting to Mastodon is actually happening?
No judgment in this — I’m just trying to understand the landscape a bit better. Boosts help widen the sample. Ran out of room for more options... #BuildTheFediverse

Question: How many people have you brought into Mastodon?

@DiverDoc @Nerdfest some folks feel pain in their joints during strong pressure fluctuations. 🥲 Among other symptoms. If you get random pains try correlating them. I hope you never do though!

Woof. That baromatric pressure drop has some kick to it.

Same thing with LLMs. This is just the information equivalent of leaded gasoline. You can get ahead locally by generating some plausible bullshit that convinces someone to send funds your way, and then when the truth slaps down the ideas, someone else faces the consequences.

The solution is to end bosses and hierarchy and make decision making and benefits and consequences matter

I honestly thing I could handle glimpses at cosmic horrors without going more mad.

@stacithekitten.bsky.social@bsky.brid.gy do iiiit

@ellyxir I'm absolutely basking in it's warmth like a fat lizard on a rock

Re purple'd my hair. Also tried to do a green streak, but probs gotta redo it

I haven't been doing random tinkering as much since moving in with my partners and on one hand I'm less "productive" but on the other hand I'm not sitting alone in the dark in front of the computer all day/night.

For everyone who asked, here’s the full section from the syllabus.

Show thread

Uruguay did what most nations still call impossible:
it built a power grid that runs almost entirely on renewables
—at half the cost of fossil fuels.
The physicist who led that transformation says the same playbook could work anywhere
—if governments have the courage to change the rules.
For Ramon Méndez Galain,
the energy transition isn’t just about climate
—it’s about economics.
Uruguay’s shift to renewables, he argues,
demonstrated that clean energy can be cheaper, more stable, and create more jobs than fossil fuels.
Once the country adjusted the playing field that had long favored oil and gas,
renewables outperformed on every front:
halving costs,
creating 50,000 jobs,
and protecting the economy from price shocks.
forbes.com/sites/kensilverstei

Show older
Mauvestodon

Escape ship from centralized social media run by Mauve.