The eval that saved our butts: behind the scenes of Lenny's Newsletter
Before going out to 1.3 million readers, our AI had an existential crisis.
Just a quick heads up, next week is getting really busy:
On Monday, Hilary Gridley, Aman Khan and I are banding together to bring you a free Lightning Lesson to give you a repeatable system for validating AI ideas cheaply: How to Know What AI Products to Build.
On Thursday, Aman and I are embark on our second cohort of Build AI Product Sense. Thanks to Lenny, you get $375 off and $1,395 in free credits.
Looking forward to seeing you at one or both!
Two days ago, Aman Khan and I published our guest post in Lenny’s Newsletter that is deceptively not a post, but instead one big prompt for Cursor to teach you Cursor from inside Cursor. (If you haven’t experienced it yet, do that now, then come back.)
Before sending it out to 1.3 million subscribers, we ran usability studies with about a dozen people. We sent them the instructions and asked them to record themselves going through the process. We then watched each recording end to end (we were not gonna let AI get in between us and this gold) and made incremental improvements between each round.
The feedback leveled off and we felt good. By the time Lenny gave it a spin, we thought we’d seen everything.
Famous last PM words…
Of course Lenny discovered a new bug. Of course it was the worst one of all: the LLM found itself in a weird existential crisis, the “game over” kind with no clear way to recover, because it’s misleading the reader as well.
The lazy PM voice in my brain went, “Meh! one in a dozen… how often would this actually happen?” The less-lazy PM voice in my brain said, “You could simulate this a bunch of times automatically and see.”
In a normal product, Aman and I could use an observability platform to copy Lenny’s chat thread and re-run it at scale. Unfortunately, since it’s a open-source experience that runs locally on each person’s laptop, we gave up all analytics and traces.
We did however, have Lenny’s screenshare video. I manually reconstructed Lenny’s AI chat thread by repeatedly hitting pause and copying the text on his screen.
Here’s how I got a sense of how often it was really happening:
I recreated Lenny’s conversation inside an eval platform (any of them work). The left side is the simulation: what Lenny pasted in, what the LLM said back, Lenny’s first response, the tool call, the faked tool response. I rebuilt the whole thread up to the moment right before it went haywire.
I created 10 duplicate rows to start, which would allow me to run 10 alternate universes where Lenny hits that point.
It was happening a lot.
Ok bug, you’ve caused this PM to get up from his chair and…. edit the prompt! Minus getting up from my chair.
Now, I could just look at the results myself and see if the changes worked… or I could create an automated eval to do it for me.
I created an “LLM as a judge” or “Model-based grader” that would evaluate the result. As the subject matter expert, I had the eval assert “is it having an existential crisis” (not in those words, see video).
I kept tweaking the eval prompt and seeing if it evaluated correctly until I trusted it.
Now that I trusted my eval, I could now iterate on the post’s prompt itself. I didn’t start with any professional-grade prompting, more just clarified and removed wishy-washy language.
Each time I made a change, I re-ran the 10 alternate universes. When those were done, I ran the eval on each of the 10 results.
As the original prompt got better and better, the results started to look more and more green. After a bunch of iterations, I got to 9/10 passing, which is way better than what I had before.
From here, I could get more certainty by raising the number of parallel Lenny universes.
All this to say, I’d love to hear of any weird experiences you’re having (email me at talsraviv at gmail.com). Maybe there’s another eval Aman and I need to build! If so, like Lenny, we’ll probably ask for your help reconstructing your trace.
-Tal


