Posts

2025.JAN.26

Humanity's Last Exam

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 3,000challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

The sample questions are fun to go through, as a way of understanding the level of expertise these models are going to end up at, eventually. Eventually is the keyword there — even the best frontier models do very poorly on this benchmark right now.

Via: Installer newsletter by The Verve

2025.JAN.12

Book: AI Engineering by Chip Huyen

From the book's Preface:

This book provides a framework for adapting foundation models, which include both large language models (LLMs) and large multimodal models (LMMs), to specific applications.

There are many different ways to build an application. This book outlines various solutions and also raises questions you can ask to evaluate the best solution for your needs.

I picked up this book after reading its preface (through the free sample on Amazon). I’m excited to work through it over the next few weeks.

Although we can learn a lot of the stuff from the book by digging through free resources online, I find the way a book is organized super helpful. It pulls everything together, letting me explore the topics both broadly and deeply without getting lost.

(via Simon Willison)

2025.JAN.11

Things we learned about LLMs in 2024

Simon Willison:

A lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.

It’s a long read, but excellent post that gives a good sense of the action around LLMs over the last year.