Tagged: llm

Feed
2026.FEB.18

Codex CLI vs Claude Code on Autonomy

nilenso:

I spent some time studying the system prompts of coding agent harnesses like Codex CLI and Claude Code. These prompts reveal the priorities, values, and scars of their products. They're only a few pages each and worth reading in full, especially if you use them every day. This approach to understanding such products is more grounded than the vibe-based takes you often see in feeds.

While there are many similarities and differences between them, one of the most commonly perceived differences between Claude Code and Codex CLI is autonomy, and in this post I'll share what I observed. We tend to perceive autonomous behaviour as long-running, independent, or requiring less supervision and guidance. Reading the system prompts, it becomes apparent that the products make very different, and very intentional choices.

Very interesting comparison. But I don't believe the difference in the behaviour is primarily, or even likely, driven by the system prompts. The difference is far more ingrained, most likely RL'd during post-training.

Why do I say this? I've been using both the models in Pi coding agent with its default system prompt1, which is both really small and the same for all models. And even in Pi, this difference in behaviour comes across clearly.2

Footnotes

  1. Pi allows us to replace the entire system prompt by placing a markdown file at ~/.pi/agent/SYSTEM.md

  2. I feel that the models both behave better in Pi than in their respective canonical harnesses; but this is a very subjective opinion.

2025.APR.13

AI adoption is a UX problem

Nan Yu:

These tools are casually dismissed as “GPT wrappers” by some industry commentators — after all, ChatGPT (or Sonnet or Gemini or Llama or Deepseek) is doing all the “real work”, right?

People who take this perspective seem to be throwing away all the lessons we’ve learned about software distribution. It’s like they saw Instagram and waived it off as an “ImageMagick wrapper”… or Dropbox as an “rsync wrapper”.

Those products won because they made powerful, highly technical tools accessible through thoughtful design. The biggest barrier to mass AI adoption is not capability or intelligence; we have those in spades. It’s UX.

Amen.

2025.APR.06

Building Python tools with a one-shot prompt using uv run and Claude Projects

Nice and clever use of uv’s run inline dependency management and Claude Project Custom Instructions to create Python scripts that are easy to run without any setup, even while depending on Python’s rich set of libraries.

I’ve used this workflow for a few scripts in the last couple of weeks, and it works remarkably well.

You can then go a step further — add uv into the shebang line for a Python script to make it a self-contained executable.

2025.JAN.26

Humanity's Last Exam

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 3,000challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

The sample questions are fun to go through, as a way of understanding the level of expertise these models are going to end up at, eventually. Eventually is the keyword there — even the best frontier models do very poorly on this benchmark right now.

Via: Installer newsletter by The Verve

2025.JAN.12

Book: AI Engineering by Chip Huyen

From the book's Preface:

This book provides a framework for adapting foundation models, which include both large language models (LLMs) and large multimodal models (LMMs), to specific applications.

There are many different ways to build an application. This book outlines various solutions and also raises questions you can ask to evaluate the best solution for your needs.

I picked up this book after reading its preface (through the free sample on Amazon). I’m excited to work through it over the next few weeks.

Although we can learn a lot of the stuff from the book by digging through free resources online, I find the way a book is organized super helpful. It pulls everything together, letting me explore the topics both broadly and deeply without getting lost.

(via Simon Willison)

2025.JAN.11

Things we learned about LLMs in 2024

Simon Willison:

A lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.

It’s a long read, but excellent post that gives a good sense of the action around LLMs over the last year.