Posts

2025.FEB.23

Marshmallow Test and Parenting

The marshmallow experiment is famous: a little kid in a room staring at a marshmallow. If they wait 15 minutes, they get two marshmallows instead of just one. Some kids would poke the marshmallow, lick it, or just gobble it up. Others found clever ways to distract themselves – singing, closing their eyes, even falling asleep. The results – children who waited supposedly went on to achieve higher scores in school and better life outcomes. The message was clear: if you can delay gratification, you’re set for life. But later studies revealed some serious holes in that conclusion.

Emphasis mine. The rest of the post is about the myriad of ways that the original conclusion was wrong. Easy read.

2025.JAN.26

Speed matters: Why working quickly is more important than it seems

James Somers:

The obvious benefit to working quickly is that you'll finish more stuff per unit time. But there's more to it than that. If you work quickly, the cost of doing something new will seem lower in your mind. So you'll be inclined to do more.

The converse is true, too. If every time you write a blog post it takes you six months, and you're sitting around your apartment on a Sunday afternoon thinking of stuff to do, you're probably not going to think of starting a blog post, because it'll feel too expensive.

What's worse, because you blog slowly, you're liable to continue blogging slowly—simply because the only way to learn to do something fast is by doing it lots of times.

This is true of any to-do list that gets worked off too slowly. A malaise creeps into it. You keep adding items that you never cross off. If that happens enough, you might one day stop putting stuff onto the list.

That hit hard. Read the whole post, it’s well worth the time.

2025.JAN.26

Humanity's Last Exam

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 3,000challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

The sample questions are fun to go through, as a way of understanding the level of expertise these models are going to end up at, eventually. Eventually is the keyword there — even the best frontier models do very poorly on this benchmark right now.

Via: Installer newsletter by The Verve

2025.JAN.12

Book: AI Engineering by Chip Huyen

From the book's Preface:

This book provides a framework for adapting foundation models, which include both large language models (LLMs) and large multimodal models (LMMs), to specific applications.

There are many different ways to build an application. This book outlines various solutions and also raises questions you can ask to evaluate the best solution for your needs.

I picked up this book after reading its preface (through the free sample on Amazon). I’m excited to work through it over the next few weeks.

Although we can learn a lot of the stuff from the book by digging through free resources online, I find the way a book is organized super helpful. It pulls everything together, letting me explore the topics both broadly and deeply without getting lost.

(via Simon Willison)

2025.JAN.11

Things we learned about LLMs in 2024

Simon Willison:

A lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.

It’s a long read, but excellent post that gives a good sense of the action around LLMs over the last year.

2024.NOV.14

First Post

This post has one purpose — get me off my bum and write something to get this blog going. Don't expect coherence. In fact, you might want to stop reading now and do something more productive with your time.

Read Now →