Case StudyJun – Dec 2024Solo Engagement

Teaching a language model to teach

A pilot for a major educational publisher: turn structured skills metadata into pedagogically sound lessons through a Python pipeline, GPT-4, and an editorial surface designed for trust.

year: 2024
role: Design + Build, Solo
domain: Intro Python (Pilot)
stack: Python · GPT-4 · Airtable · Cursor

— 01The Brief

A publisher trying to keep up with itself.

A major educational publisher had built a carefully structured catalog of skills, competencies, and learning objectives; the unsexy backbone of how their courses get scoped, sequenced, and assessed. The catalog was sound. The bottleneck was downstream: turning each objective into actual lesson content took weeks of instructional designer time per unit, and the queue was always full.

The question put to me was narrower than "can we use AI to write courses," even if that's how it would inevitably get summarized in a deck somewhere. The real question was whether the existing content model (i.e., already structured, already opinionated about pedagogy) could drive an LLM well enough to take a meaningful slice of the first-draft labor off the table, without losing the things that made the catalog theirs in the first place.

— 02The Bet

That structure could carry a model further than prompting could.

The first instinct in mid-2024 was to throw bigger prompts at GPT-4 and hope for the best. I was skeptical of that approach for the same reason I'm skeptical of most "just prompt harder" arguments: the interesting work in any content system happens upstream of the model, in how the inputs are structured. The publisher's skill graph already encoded the pedagogy. The job was to design a pipeline that let that structure do work at every level, as a hierarchy of constraints passed from parent to child to block.

The publisher's catalog had spent years getting the upstream right. The bet was that if you respected what they'd built, the model only had to be good enough at the last mile.

That was the thesis going in. The interesting parts of the project turned out to be the places where the thesis was right and the places where it wasn't quite enough.

— 03approach

Three moves, in order.

Most of the build collapses into three decisions about where the design work lived. The pipeline was the visible artifact. The real architecture was upstream of it and around it.

Model the content

Treat the existing skill graph as the prompting hierarchy. Skills hold competencies hold objectives hold blocks. Each level inherits constraints from its parent. The model never sees a freestanding ask.

build the pipeline

A Python orchestration layer over GPT-4's function-calling, with structured JSON outputs mapped to the publisher's design system. QA layered in: coherence scoring, perplexity checks, standards alignment. Validation runs alongside generation, not after.

Wrap it in editorial surface

Reviewers don't accept lessons; they accept blocks. A web UI that lets an instructional designer expand an objective into an outline, expand the outline into draft blocks, and regenerate or refine at any granularity. The model proposes; the editor decides.

An Airtable proof-of-concept that started the project gets folded into step one; useful as a forcing function for thinking about structure, not durable as infrastructure.

— 04the system

From skill graph to draft lesson, one surface at a time.

The walkthrough below is the production flow as it ran during the pilot. Each surface answers a single question: what's the designer doing here, and what does the model owe them next?

Airtable showing skills and learning objectives, the data objects that feed the content builder

The skill model was fleshed out in Airtable. Every parent skill, child competency, and learning objective is a node with attached prompting rules. The rules drive complexity downward: an objective inherits its parent competency's voice and scope, and adds its own about Bloom's level, activity type, and example density.

— 05in practice

What worked, and where the model ran out of road.

The block-level generation was solid by the second month. With the full hierarchy of constraints in place, GPT-4 could produce paragraph-scale teaching content that was on-voice, on-standard, and structurally correct often enough that the editorial surface stopped feeling like a safety net and started feeling like a workspace.

The harder problem was holding a lesson together. Cross-block coherence — the thing that makes a twelve-block lesson read like one mind wrote it rather than twelve neighbors — was where 2024 hit its ceiling. Context windows and reasoning at that scale weren't quite there yet. The pipeline could generate twelve good blocks; it couldn't always make sure block seven didn't quietly contradict block three. The editorial surface absorbed that gap, which is part of why the surface mattered so much. But it meant the system was doing more reviewer-assisted than designer-replacing work, and the honest framing for the pilot was that.

The blocks were solid by month three. Stitching twelve of them into a lesson that read like one mind wrote it, which is where 2024 ran out of road.

The other thing the pilot clarified, and this matters more than the coherence story: the editorial UI was the product. The pipeline was infrastructure. The model was a vendor. What determined whether the system would get used by an instructional designer on a Tuesday afternoon was the texture of the review surface: how fast it was to regenerate one block, how hard it was to lose your place, how trustworthy the feedback signals were. That was design work, and it's what the pilot proved was buildable.

— 06 / Pilot Outcomes

Numbers, in their proper frame.

These are pilot results, not production claims. The system was scoped to prove the architecture against one domain (intro Python); the broader question of whether it could carry the publisher's whole catalog at full volume was a separate engagement that didn't materialize. Read accordingly.

75%

Reduction in first-draft authoring time per lesson, measured against the publisher's standard process during pilot.

Pilot domains shipped end-to-end (intro Python). One full course, structured and generated through the system.

~40%

Of generated blocks accepted with light edits in editorial review. The remaining block needed substantial work or full regeneration.

— 07reflection

What survived the pilot.

The specific implementation is a 2024 artifact; the model is deprecated, the pipeline lives in a repo I haven't opened in a year. What survived is the architectural claim underneath it: structured pedagogy is the thing that makes a generated lesson teachable, and the editorial surface is the thing that makes the system trustworthy. The model was almost incidental to both.

It also reframed how I think about AI in design work generally. The common framing, which is that the model writes a draft and a human polishes it, kept producing systems that felt slightly off at the seams. The framing this pilot pushed me toward, that the model is the cheap labor inside an opinionated structure designed by someone who knows what they're doing, has held up better in everything I've built since.

postscript, mid-2026

I'm now rebuilding the underlying ideas on my own terms: once for a learning game, and once for a course-creator tool aimed at independent educators. Mid-2026 makes the work substantially less harrowing than mid-2024 did: cross-lesson coherence isn't the wall it was, the editorial surface is doing more of the design work it always wanted to do, and the structuring layer, the part that was always the actual project, finally has models that can keep up with it.

year: June – December 2024
role: Design + Buildsolo engagement
domain: Introductory Pythonpilot course
stack: PythonorchestrationOpenAI GPT-4 APIfunction callingAirtableskill graph, databaseCursorbuild environment