My mom recently recounted an early 1970’s-era childhood memory she has of seeing her own mother, crouched over the family dining room table, hand-checking the math of the newly-acquired family calculator. Her mom (my grandma) used the calculator for feats like filing family taxes and balancing the checkbook, but given the novelty of having semiconductors do the kind of work previously performed meticulously by hand, she wanted a second opinion (her own). After a second to marvel at something as quaint as “balancing the checkbook” (a phrase that evokes in me the same level of mystery as “undercollateralized lending” or “reinforcement learning with human feedback” probably evokes in my mom) I couldn’t help but relate it to the present.
Fast-forward to the 2020’s and LLMs are either hunting in-memory for the right answer (as Ilya Sutskever and others say) or are generating the Garden of Eden, followed by civilization and an evolved formal theory of mathematics in the time it takes to complete an API call answering “what’s 3+3?” (as Scott Alexander wrote about GPT-2 in 2019).
The above is just a fancy way of saying “LLMs sometimes take an inefficient path to the right answer.” Unlike a calculator bit-shifting 0’s and 1’s (systematic, direct) LLMs either perform an impressive feat of recall or use fuzzy estimations to arrive at approximate results (Anthropic confirms Scott Alexander’s conjectures). It costs GPT-4o ~$0.000035 to ask and answer “what’s 3+3?”12 – which is probably more than the amortized cost of asking your iPhone calculator (e.g. over the lifecycle of your iPhone, what percent of the total cost of the phone went toward answering that one question, if we calculate percent as thumb-strokes per desired output?), or the wattage required for our parietal lobes to answer that same question.
It also may not always get the answer right.
o1 and o3-mini still struggle with their multiplication tables. Even if models do get the right answer on more advanced questions, it might not be worth the output. In the realm of complex math, the organizers of the ARC prize conceded that the o3 is economically inefficient:
…you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
The efficiency falls further into question when we introduce things like AI agents and operators, who struggle to complete simple tasks without human intervention: like many people, I still would not trust an AI to book a flight for me. There is also now at least one startup dedicated to parachuting in humans when the AI agent gets stuck performing a task. There’s something hilariously ouroboros about this: it’s like hiring a maid to push around a Roomba.
Even startup founders are confessing healthy skepticism about the utility of AI benchmarks: do they actually point toward progress (the ability to automate previously ungeneralizable things like bug detection? the ability to discover novel insights?) or are they just well-marketed standardized tests, whose only purpose is to display how well AI’s perform in a pillow-lined LLMARENA?
In the 1970’s, if you didn’t trust a calculator, you were mistaken, perhaps a bit superstitious, and the labor you performed to verify its work was redundant. In the 2020’s, you actually do need to check an AI’s work, and the act of checking the work is only worthwhile if the aggregate time saved or value of the output far exceeds the opportunity cost (time, value) of checking the work. I think this is a generally fine state of things (in a recent interview, Sam Altman remarked “If you want something deterministic, you should use a database.”) but it’s worth stating as an exercise in realism.
My experience of AI has been that it’s great for accelerating me in domains I’m already deeply opinionated in (e.g. I use AI to help me develop arguments and strawman myself, something I previously did Socratically in my own brain), and not as useful for areas that I only have a tacit understanding of. Even awareness of limits is actually not enough: a couple of weeks ago, my boyfriend and I got stuck on an application we’re working on, which needs map data and web-scraping in order to run. To build it, we staged Cursor and Github Copilot in opposition to each other like warring deities: one was adamant that we should be attempting to build the app using a standard email login to Google Maps API. The other was cajoling us to investigate Oauth2 as an alternative. We got stuck here for several hours: trying to debug why scraping data from Google Maps wasn’t working, creating workaround for rate-limiting, going to sidequests to investigate Oauth2, and lamenting our lack of expertise.
This was despite having prompted o1 at the beginning of this entire process with a 2-page prospectus on what we wanted to accomplish, including an entire section devoted to outlining how much (little) knowledge about app development we had, and the degree of hand-holding we expected to need. We assumed that if we were forthright about our limitations, o1 would act as a spiritual medium between ourselves and the engineering underworld. We even used Ben Hylak’s instructional guide as a template. We were realistic, we didn’t expect to magically one-shot a solution. But despite getting a pretty thorough step-by-step guide from o1, we were lost within hours.
If you’re a designer, you know not just to ask for “an image” but how to implement a whole workflow to turn outputs from 4o into editable SVGs. If you’re making AI videos, your work looks better if you optimize frame rates per second. If you’re a developer, you know the difference between next.js and Remix, and that you need to pick a frontend framework if you’re deploying something to the internet, and how to web-scrape properly, how to correctly organize files in an IDE, and how to use git. As predicted in AI 2027 “The job market for junior software engineers is in turmoil: the AIs can do everything taught by a CS degree, but people who know how to manage and quality-control teams of AIs are making a killing.” You need to not only have a big-picture understanding of the output you want, but a microscopic understanding of end-to-end tasks and inputs as well.
There are of course exceptions to this. A couple weeks ago, levelsio went viral with a vibe-coded flight simulator game, which was received with fanfare by some, and skepticism and scorn by others. Jonathan Blow, the creator of notoriously complex games like Braid and The Witness, was especially vocal, and engaged in a long back-and-forth trying to demonstrate the decades of networking and optimization wisdom that levelsio’s game was willfully ignoring. At one point he wrote “I'll put it this way: If you have never tried to make a game, I am sure it is fun to have a game-making experience.”
I thought this was a fascinating way of framing things. Do you want to understand what it means to be a game designer, or do you just want to play game designer tycoon? Or maybe it’s more like: “Do you like actually doing the thing or do you like having looked like you did it?” Who are you actually signalling to?
To experts, LLMs represent the ability for you to augment your talents, to the point where you’re overwhelmed with choice about what to spend time on (my friend Tina He had a great piece exploring this phenomenon). To novices, LLMs allow you to live in an achievement simulator, where you can cosplay as the person who has accomplished expertise. Experts can create custom debuggers for individual projects, or custom toolchains for their workflows. Novices know just enough about programming to prompt a project to life. Experts use LLMs as a part of the process. To novices, the LLM is the process.
That was maybe phrased a bit pessimistically, but I don’t mean it that way. Going from a prompt to a jankily-working game in a couple of hours, sharing the prompts publicly, cutting corners, and creating something you don’t fully understand is the fun part. The great thing about the internet is that it actually supports audiences for both kinds of fulfillment: some portion of the population likes to make and consume higher-order art (Braid), while some people revel in the “game-making experience” for spectators who like cheering on the novelty of what the “game making experience” yielded. Games like Braid (or Baldur’s Gate 3, or Balatro3, or any other great game ever made) have a massive audience. Games or projects created with LLMs have new, different kinds of audiences.
One interesting question is whether this kind of behavior gets codified into some kind of “Experience Economy 2.0.” The old notion of the experience economy (that now-antiquated, overused bit of propaganda from the mid-20teens) suggested millennials prefer buying experiences over material goods. The new notion of the “experience economy” is that people enjoy producing artifacts that emulate professional output. I chose the words in that previous sentence very carefully: I think the act of production is more important than what is produced. And to be clear: persisting (and knowing enough) to vibe-code a flight simulator is actually hard: you do need a baseline level of programming and networking architectures. Codification of this kind of work might come in the form of new distribution mechanisms and platforms, or end-to-end products that somehow make anything in the vibe-[x]-ing genre feel like an even more powerful simulator.
Whether or not a model is economically efficient or even correct 100% of the time in this paradigm is almost besides the point: in the new experience economy, correctness or cheapness isn’t the point, being in the simulator is. The assessment isn’t: “is this model a more cost-effective, faster, and accurate way of doing the thing?” The assessment is: “is the experience I’m paying for worthwhile?” Yes, a calculator is the cheapest way to do math correctly. But doing math with a calculator doesn’t make you feel like a mathematician. Prompting o3 kind of does though.
What happens if and when models dramatically improve? Is this period of time a short-lived blip: one where beginners (“why can’t I one-shot this?”), novices (“amazing, I can basically one-shot this”), and experts (“I created an end-to-end solution for the niche problem only I encounter when I use vim”) all have wildly different experiences of LLMs. Will they converge in the future, and will their outputs look the same? Is there a future where you really can just one-shot a prompt and emerge with a fully functional app that passes Jonathan Blow’s muster? And it’ll be cheap too?
I’m not so sure. Experts are really good at changing the goalposts, especially when it comes to AI (this is not another way of saying “AI is whatever hasn’t been done yet” though I guess it kind of is). A couple of weeks ago, I read an article about Freestyle Chess (sometimes known as Fischer Random or Chess960), a form of chess introduced in 19964, in which the pieces in the back row of the board are arranged randomly. Chess champion Bobby Fischer proposed this new method of chess as a way to overcome the fact that the game was turning into a test of rote memorization (for both humans and AI’s), especially when it comes to chess openings. The randomness of Freestyle Chess is designed to make experts less complacent by making the game less familiar.
Freestyle Chess today is quite popular: Magnus Carlson is currently commanding the leaderboard in the Paris Freestyle Chess Grand Slam (he is also playing 100k simultaneous players in a game of Chess960 on chess.com). Other well-known chess players like Vincent Keymer have become involved in promoting the chess variant. To me, this feels like a statement about what happens when a field becomes overly deterministic: the experts find a way to make it creative again.
Thank you to Sam for feedback on this post.
For both input and output tokens. But unclear if OpenAI is doing this below, above, or at-cost.
When I calculated the number of output tokens, I was amused that 4o answered “2 + 2 = 4 😄” – which is three tokens. Without the smiley face, it would have only been two tokens. It’s expensive to have a personality.
The creator of Balatro had a fantastic blogpost on his process making the game, worth reading here
A year before Deep Blue beat Garry Kasparov, for those keeping score