ChalkboardAI

A self-hosted AI math tutor that actually knows what your instructor taught. Running entirely offline on a Mac Studio for FERPA and FOIA compliance. Self-directed, in development.

Role

Product, design, engineering

Timeline

5-day build, 2026. Actively iterating.

Stack

Running on my personal Mac Studio as a proof of concept.

React TypeScript Ollama gemma3:27b Fabric.js IndexedDB Whisper.cpp Kokoro TTS Built with Claude Code
ChalkboardAI interface showing the infinite canvas with a math problem and AI tutor chat panel

Shipped a working v1 in five days

Designed, architected, and built the complete student-facing experience against mock curriculum. Local inference, local RAG, voice I/O, infinite canvas, streaming chat with citations. Actively iterating.

Chose architecture against the default

Rejected cloud-dependent AI in favor of a local-first stack because FERPA and FOIA make privacy non-negotiable in education.

Named what five days cannot validate

Real-user testing, instructor tooling, and institutional sign-offs are all documented as pending. The honest limitations are part of the case study.

Why I built this

I've taught statistics and calculus to 500+ students at Washtenaw Community College over the years. AI tutors are already in my students' hands, and the ones they're using don't know what happened in our classroom. They fill gaps I didn't leave, and they miss the ones I did.

I wanted to prototype the version that starts from the lecture itself. Five days was the constraint: enough to test the architectural thesis end-to-end, not enough to validate it with real users. Building solo was the point. I wanted to understand the full stack of decisions involved when no one else is making them. Work continues.

The problem

Generic AI tutors don't know how your instructor actually taught a topic. Notation, approach, example sequence, and vocabulary all vary between instructors, and those variations matter when students are trying to reinforce what they just learned.

ChalkboardAI bridges this gap. An instructor uses it as a live lecture board. After class, it becomes each student's personal AI-assisted study tool, grounded in the specific lecture the student's instructor actually taught, with citations back to the source material.

Generic AI tutor

No instructor context. No lecture grounding. Answers from a model that has never seen your class, your notation, or your examples.

ChalkboardAI

Grounded in the lecture the student's instructor actually taught, with citations back to the source material.

Why local-first

Privacy isn't a feature in education. It's the foundation.

FERPA & FOIA compliance

Every part of the AI stack runs locally on one Mac Studio. Inference via Ollama with gemma3:27b. Transcription via Whisper. Embeddings via nomic-embed-text. Text-to-speech via Kokoro. Student interactions with AI tutors are educational records under FERPA, and transcript logging and retrievability matter under FOIA. On-premise is the right architecture for real institutional deployment.

$0 per-student cost

Local inference zeroes out per-student API cost, which is what makes institutional deployment actually viable at scale.

Local RAG

Vector embeddings run on the client. The AI grounds responses in the student's curriculum without leaking interaction history to any cloud service.

What's shipped.

Student-facing v1. End-to-end, running against a synthetic curriculum. Actively iterating.

Infinite spatial canvas

Pen, eraser, shapes, text, laser tool, image paste, pan and pinch-zoom, undo/redo that covers math objects, and a pinned pen library with auto light/dark color pairing. Students work spatially, not linearly.

Contextual grounding

A vision-capable local model reads the student's board each turn and answers in context. Every response carries a citation badge back to the specific lecture it's grounded in, pulled from a local vector store with active-lecture retrieval boosting. Chat sessions are scoped per course and lecture, with rolling summarization so long sessions don't lose earlier work.

Local voice I/O

Whisper.cpp for transcription from the browser, Kokoro TTS for spoken responses. Math is spoken via the same accessibility pipeline MathJax uses for screen readers (Speech Rule Engine with ClearSpeak). Accuracy disclaimer permanently visible; one-time-per-session acknowledgment modal before any AI interaction.

ChalkboardAI design system light mode, built with Claude Design
ChalkboardAI design system dark mode

Tokens, type, and modes built with Claude Design

The design foundation was built using Claude Design, Anthropic's design tool: color tokens, type scale, and component variants across light and dark modes. Treating the design system as a documented artifact rather than ad-hoc styles meant every new screen started from a consistent base.

Decisions and their costs.

Four calls I made during the build that say more about how I think than a resume bullet would. Each is documented in a commit or a persistent decision file from the time it was made.

01

Socratic enforcement: accept leakage, fix the UX instead

The problem

The model occasionally leaks worked answers in the chat body even with strict system prompting.

What I rejected

A two-stage critic model (added latency) and blunt regex filtering (too many false positives on legitimate recall queries).

What I landed on

Accept occasional leakage, but hide AI-generated artifact tags during streaming so the flash-and-vanish is less jarring. No "hide answers" toggle. The complexity wasn't worth the marginal benefit.

02

Inline chat artifacts over a duplicated help zone

Figma wireframe showing the earlier layout with a separate AI help canvas on the left sidebar

The earlier layout: AI-generated items appear in a separate left panel alongside the chat.

The problem

An early version had a top-of-sidebar "help canvas" where AI-generated math items appeared.

The observation

It duplicated the chat. Users would have to track where a generated item lived in two places.

What I landed on

Removed the help zone. AI-generated items now render inline under the chat bubble that produced them, still draggable onto the main board. One place for everything worked better than two.

03

Rolling summary over sliding window for context management

The problem

The model started slowing down and drifting on format rules mid-session.

What I rejected

A sliding context window. That would have broken "the AI remembers what we worked on twenty minutes ago," which is first-class important for the product.

What I landed on

Rolling summarization. Preserves continuity at the cost of a second model call.

What this taught me

Designing for context pressure from the start would have avoided the triggering degradation. On the next build, context management is a day-one concern.

04

Accessibility-first as a tiebreaker default

The problem

A working bespoke regex pipeline for speaking math notation aloud. It worked, but required case-by-case authoring for each notation type.

What I found

The Speech Rule Engine, the standards-based pipeline MathJax uses for screen readers, covered a much broader set of notation correctly out of the box.

The trade

About 300KB of bundle size for correct, broad coverage without hand-authoring.

What I landed on

Delete the bespoke pipeline. When an accessibility-aware library exists and works, it wins by default.

05

Routing math computation through a verified solver

Early LLM chat: model confidently gives wrong answer of 7 for a limit, then recalculates to 3 Early LLM chat: model revises again to -8, student asks why it got it wrong, model apologizes First working session: ChalkboardAI canvas with the math solver connected

Three different wrong answers in one session (correct answer: 2), and the first working session after connecting the solver.

The problem

The base LLM could not be trusted to do arithmetic. On the same problem (find the limit of f(x) = 12 - 5x as x approaches 2, answer: 2) it returned 7, then 3, then -8 in a single session. A math tutor that gets the math wrong is worse than no tutor at all.

What I rejected

Relying on better prompting alone. The model was just computing badly, and better prompting couldn't fix that.

What I landed on

Connected the model to a math solver for computation, so the LLM handles reasoning and explanation while verified tool calls handle the arithmetic. The model went from three wrong answers on one problem to consistent accuracy on the same class of questions.

What building it taught me

Four things I learned that a spec couldn't have surfaced:

01

Socratic enforcement is fundamentally hard to solve in prompting alone

Specified the constraint, observed the violations, tried filtering, reverted, landed on "accept it and fix the UX." Only runtime behavior would have surfaced this path.

02

Context growth degrades format adherence, not just speed

The model didn't just slow down mid-session. It also started violating format rules. That observation informed the rolling-summary fix and aggressive trimming of the API payload.

03

Bespoke regex pipelines win early, then lose

Targeted rules for derivative notation and function application worked fine at first. When I asked whether a standard existed, the accessibility TTS standard covered a much broader set of notation correctly with no case-by-case authoring. Default to the accessibility-aware library when one exists.

04

Install-time failures don't show up in specs

Browser-bundling a speech ruleset requires locale JSON served as static assets. New Python deps don't fit the default macOS Python. Both surfaced only at install time, which led to writing a postinstall hook and a uv-managed Python 3.12 path.

What's not done yet

Five days ships a working v1. Five days does not ship a validated product. Here's what's still unvalidated. Work is ongoing.

  • Real-user validation is zero. No students, tutors, or faculty have exercised it. Everything about response quality, retrieval thresholds, and tutor tone is based on self-testing, which I know isn't enough.

  • The corpus is synthetic. Mock transcripts are hand-written, not real Whisper output. They're cleaner and more retrieval-friendly than real lectures will be.

  • Instructor side is entirely unbuilt. The half of the product that generates the content students study against doesn't exist yet.

  • No auth, no accounts. Single-machine, single-user local storage. Switching students means clearing browser state.

  • Institutional sign-offs pending. Spec documents FOIA coordinator approval and academic-integrity approval as prerequisites before a real launch.