A primitive for enabling environments: early work on machine-generated prompts
November 2024. Part of “Letters from the Lab”, a series of informal essays on my research written for patrons. You can also listen to this essay (14 minutes).
I want to create a new kind of spaced repetition system—a new kind of enabling environment—with a different central primitive.
The need for a new central primitive
The central primitive of existing spaced repetition systems is the flashcard. If you want to remember some fact, you transform it into flashcard form and add it to your system. The system doesn’t know about the fact you want to remember, and it certainly doesn’t know how that fact connects to other things you’re thinking about. The system knows about flashcards.
When you want to learn a simple fact, like the capital of a country or some foreign-language vocabulary, that arrangement works well enough. But spaced repetition systems are important because they can be used to support all kinds of learning, to help you internalize all kinds of ideas more deeply. The trouble is that—in my experience—the further you get from that simple fact–flashcard correspondence, the less well these systems work, and the more difficult they are to use.
Michael Nielsen and I have argued that “memory system” is a more powerful framing for this problem space than “spaced repetition system”. The latter phrase emphasizes one tactic. Cognitive psychologists have characterized a wide variety of deep facts about human memory, and we should exploit as many as possible. But another reason to prefer “memory system” is that this phrase is closer to what you actually want. You want a system which causes you to robustly remember. (And in fact, you want even more, as we’ll soon discuss.)
For robust memory of vocabulary, mere spaced repetition may be enough. But for more complex ideas, I find that flashcards alone often produce a brittle memory. Once I’ve seen a flashcard a few times, the specific wording will often act as a strong cue. I’ll remember the answer not because I’m actually processing what the words mean, but through kneejerk pattern matching. An effective memory system would help me build robust memory by presenting the idea with different cues, from different angles, through different connections, so that I encode the memory in many ways.
We can zoom the frame back again. Most of the time, I don’t just want to remember; I want to learn. I want those topics I’m practicing to be alive and functional. I want to be able to apply the material flexibly and fluently. I want my understanding to deepen over time. I want to see new implications and have new ideas. Piotr Wozniak, inventor of the modern spaced repetition system, has written that you should learn before you memorize. But as I see it—both from experience and from my understanding of cognitive architecture—this is an ongoing parallel process. As we exercise and elaborate material we’ve learned, we make new connections and understand more deeply. Memorization is intertwined with that process, not something that happens before or after.I’m probably over-simplifying Piotr’s view here to make the point. I expect he’d agree in broad strokes with what I’m saying. He might (fairly) say that those new connections and deeper understandings are actually different knowledge atoms—so, what’s actually happening is a sequence of related learn/memorize phase pairs. At a high level, from the perspective of system design, my point stands: learning and memorization are intertwined as part of an ongoing process; a good learning system should support that process. And so, even the term “memory system” is inadequate. Our aspirational system needs a broader name—one reason I’ve often used the term ”enabling environment″.
All this is awfully cognitive. Much of what I put in my spaced repetition system isn’t. I’ll capture an insightful observation from a friend over dinner, or a beautiful quote, or something someone did that surprised me. The point here isn’t really to memorize. It’s to be changed—to metabolize an experience so that I feel or act differently in the future. But that point is also true of much of my more traditional learning. When I study music theory, I don’t just want to learn; I want to feel and act differently when I play music. When I study a historical figure, I don’t just want to learn; I want to add their way of looking at the world to my own set of lenses, so that I experience the world differently. So that, in some small but real way, I become a different person.This frame has some clear connections to the project I described in last month’s letter, Towards scalable blip cultivation. A system focused on this would be an “enabling environment” in a much deeper sense.
An alternative primitive: situated ideas
With those lofty goals articulated, I can now state my central complaint about flashcards: they’re static. They’re ”dead fish″. Robust memory requires varied cues and connectivity; robust learning requires rising depth and complexity; robust metabolization requires contextuality and vividness.
To bring dynamism to these systems, I think we need new central primitives. Today’s spaced repetition systems are structured around flashcards—adding them, organizing them, scheduling them. I think we need to move upstream. If we want our review sessions to vary and deepen and connect over time, we can’t just supply a static task. In fact, if the goal is to support transfer learning, we can’t write the task ourselves at all: transfer requires surprise. We need to somehow point to the idea which inspired that task, situated within the context which inspired us, so that a stream of varying and deepening tasks can emanate from it over time.This idea has been percolating for a while; for earlier related discussions, see: Lessons from summer 2022′s mnemonic medium prototype (Oct ’22); Fluid practice for fluid understanding (May ’23); Highlight-driven practice and comprehension support (Sep ’23); How might we learn? (May ’24).
If I’m reading a book, a natural way to point at an idea is by literally pointing at some part of the prose, perhaps with a highlighter, and perhaps with some marginal comments about what we found meaningful. If I’m reflecting on a conversation or experience, the same approach might work for my journal or notes.
More concretely, instead of question/answer fields, the primitive I have in mind—a “situated idea”?—would store:
- a pointer to relevant context, with full text; e.g. a book, a journal, etc.
- a range (or ranges?) within that context representing the idea to be metabolized; i.e. like a highlight you made in your book
- an optional extra comment clarifying your intent or interest; i.e. like marginalia you wrote next to your highlight
And then the system would synthesize appropriate activities over time, based on that input, and on connections with other situated ideas in related contexts.
All this roughly mimics the work that professional instructional designers do: given a set of “knowledge points” introduced in a text, they construct a series of activities (worked examples, exercises, reflections) and present them in varying and deepening ways over time. In some cases, that sequence may even respond to your performance, though few courses will reinforce your memory as effectively as a spaced repetition system.
The key difference in the system I’m proposing is that it shifts the locus of control from the instructional designer to the user. That was the most important lesson from my work with the mnemonic medium these past few years: self-motivated adult readers rarely want to passively study whatever an author tells them. People want different things from a text. They have different goals. One learner wants to learn the theorems; another wants to be able to prove them. They have different backgrounds. One learner will need a lot of reinforcement in one spot; another in a different spot. And they’re interested in different subtopics. One learner will skim a section which another will eagerly devour, and vice versa.
Those insights led me to last year’s experiments with a “magic” highlighter, and the delightful frame: what if we could make highlighters actually do what people wish they did? When students are polled about their favorite study practices, the most common responses are usually re-reading and highlighting. Meanwhile, if you make a list of the most effective study practices, those two methods are usually at the bottom. But highlighting feels great: it’s a way of indicating interest, a way of participating, of literally making your mark on the text. People imagine that highlighting will help them internalize the material. It (mostly) doesn’t. But maybe it could: you could use a special highlighter to add “situated ideas” to your library, and then the system would ensure that you’d internalize that material.
Alright, okay, fine: machine-generated retrieval practice tasks
When I started writing about this idea in 2022, I thought about it very differently. Large language models (LLMs) couldn’t generate good retrieval practice tasks for ranges of a source text. In fact, as of late 2024, LLMs still can’t generate good retrieval practice tasks for ranges of a source text. But I’m now collaborating with Ozzie Kirkby on a project to fix that.
Lots of people have made small attempts on this problem. None of the solutions I’ve tried performs well enough to be interesting. I’ve resisted the problem myself because I don’t like the dominant motivating frame: efficiency, ease, accessibility. Those aims feel too much like premature optimization to me. I want a difference in kind. It would often be nice to avoid the cost of flashcard-writing, yes. But the main thing I want is for these systems to work better—to produce more robust memory, deeper learning, richer metabolization.
The opportunity for a new primitive is a much more interesting frame. I want something like a spaced repetition system, but where review activities vary and deepen and connect over time. Such a system would necessarily require extremely expensive content-by-content labor, or machine-generated tasks. And only machine-generated tasks afford the possibility of activities tailored to idiosyncratic personal contexts.
So: machine-generated tasks it is! Ozzie and I have been working together on this for about six weeks, so it’s still quite nascent, but I’d like to share a bit about our approach and learnings so far.
First: today’s models aren’t automatically good at generating effective tasks for content beyond simple facts, at least with mere prompt engineering and N-shot examples. This makes sense. Very few people use memory systems for anything other than simple facts; there would be few examples in the training set. Problem sets and exercises, yes, but retrieval practice tasks are a very different dialect. And, even with a specific highlight and the full source context, the models are quite bad at targeting—deciding what a reader might want tasks about within that highlighted range.
That said, the models sometimes generate well-constructed, well-targeted tasks. So we’ve begun by training a classifier to score the quality of tasks. The theory is that it’s fine—at least initially—if the models rarely produce good output, so long as we can filter out the bad tasks. And then, with a way to evaluate different models and architectures over time, we can “scale” our relatively rare task-writing taste.
As with most machine learning pipelines, data is a limiting factor. We began by hand-constructing a data set of several hundred “situated ideas”—source content, a highlight range, a retrieval practice task, and a classification score. We’ve augmented that data with manually-scored LLM-generated tasks (both good and bad), so that now we have more than two thousand samples. So far, the classifier performs well within the same source text (on held-out samples) but generalizes much less well to out-of-sample source texts.
And so, we’re collecting more data. We’ve created a nice workflow:
- as we read a PDF or web article, we highlight it with Hypothes.is
- a bot replies to the highlights with proposed tasks
- we reply to each of those replies with a machine-readable “grade”; freeform critical comments; (optionally) a “corrected” rewrite of the same task; and (optionally) tags for failure modes we’ve noticed (e.g. focusing on trivial details)
- when none of the machine-generated tasks is good, we reply to the highlight ourselves with the tasks we would have wanted
The classification scores and hand-written tasks from this workflow are then fed back into improving the classifier, which in turn improves the output. We’ll also feed the critical comments and hand-improved examples into N-shot or fine-tuning material for task generation. Our hope is that over time, we’ll accept more of the model’s outputs, which make our manual feedback cheaper, which will in turn let us collect more data, and so on. And once the output is somewhat reliable, we can crowdsource labels to accelerate the process more.
What’s next
My long-term goal here is to move us towards a more dynamic learning system as I’ve described. But in the short term, I expect this project will be quite useful to users of existing spaced repetition systems. Our plan is to work towards an interface which would allow users to conveniently highlight texts and import the resulting tasks into Anki, Mnemosyne, and other tools. That’s assuming our pipeline ever performs well enough. In any case, we’ll certainly open-source our data set and code.
Even within the limited frame of traditional flashcard generation, good integration with existing systems will eventually require more invasive changes. In an ideal integration, the user wouldn’t evaluate the machine-generated tasks while they’re reading. They would just read, and highlight, and then later review. The trouble here is that sometimes a given highlight could reasonably point to several distinct ideas—and you probably don’t want all of them. Users will need to give the system feedback on targeting, on their desired level of depth, and so on.
The easy way to implement that is to make the user evaluate machine-generated tasks during their reading session, but I can tell you from experience: that’s unpleasant. It would be better to provide feedback at review time. That will require more complex integration. And longer-term, if I get my way, it wouldn’t make sense to have the user evaluate and approve the machine-generated tasks, because the task will change with each review. It’s a different conceptual model.
For this kind of dynamic review, machine-generated tasks are necessary but not sufficient. Some early experiments suggest that LLMs can pretty reliably generate simple surface variations of known-good tasks, to avoid the pattern-matching problem. But we want tasks which deepen, connect, and recontextualize over time. Those will need separate investigations and pipelines.
My thanks to Ozzie Kirkby for joining me on this adventure, for insightful conversation on these topics, and for comments on a draft. Thanks, also, to David Holz for pushing me to think about classifiers.
This work was funded by my Patreon community. If you find my work interesting, you can become a member to help make more of it happen.
Finally, a special thanks to my sponsor-level patrons as of April 2026: Adam Marblestone, Adam Wiggins, Andrew Sanchez, Andrew Sutherland, Andy Schriner, Ben Springwater, Bert Muthalaly, Boris Verbitsky, Calvin French-Owen, Dan Romero, David Wilkinson, Dylan Houlihan, fnnch, Greg Vardy, Heptabase, James Hill-Khurana, James Archer, James Lindenbaum, Jesse Andrews, Kevin Lynagh, Kinnu, Lambda AI Hardware, Ludwig Petersson, Maksim Stepanenko, Matt Knox, Michael Slade, Mickey McManus, Mintter, Patrick Collison, Peter Hartree, Ross Boucher, Russel Simmons, Salem Al-Mansoori, Sana Labs, Thomas Honeyman, Todor Markov, Tooz Wu, William Clausen, William Laitinen, Yaniv Tal