Today I walked into Claude, picked up my Apple Pencil, and… couldn't draw anything. The most sophisticated AI in the world, and it only accepts text. I can type. I can talk. But I can't sketch a box, draw an arrow, and say "this is where the problem is" — the way I've communicated complex ideas my entire life.
I think by drawing. When I'm working through complex ideas, I don't write paragraphs. I draw boxes and arrows while thinking through it. And there's no way to do this with AI today.
So I spent a session trying to fix this.
What I tried
Screenshots. Draw in Excalidraw → screenshot → paste into Claude. Works technically. Kills the flow. It's presenting, not brainstorming.
Canvas inside Claude. Built a React drawing canvas as an artifact in the chat with a "Share with Claude" button. Canvas rendered fine. The share button didn't fire.
Excalidraw + browser + Apple Pencil. Claude opened Excalidraw in Chrome. I drew on it with Apple Pencil via Sidecar. Claude screenshotted the tab and read my drawings — identified handwriting, interpreted annotations, understood diagram structure. Perception worked. But Claude couldn't draw back. Excalidraw doesn't expose its API. Three write-back attempts failed.
The missing primitive
When two people whiteboard together, you don't draw something and then explain it. You draw while talking. Your collaborator watches the strokes form alongside your words. The meaning isn't in the drawing or the speech alone — it's in the pairing, fused by time.
You draw a box while saying "exchange." An arrow while saying "this calls OnRamp." A circle while saying "this is where we're stuck." No separate explanation needed.
I started calling this temporal fusion — the simultaneous capture of drawing and voice, timestamped and interleaved:
[t=0.0s] voice: "here's how the payment flow works" [t=1.2s] stroke: rectangle, blue [t=2.5s] voice: "this box is the exchange" [t=3.1s] stroke: rectangle, green [t=3.8s] voice: "and this is OnRamp" [t=4.5s] stroke: arrow connecting the two [t=5.2s] voice: "this is the KYC call" [t=6.0s] stroke: red circle around first rectangle [t=6.5s] voice: "this is where we're stuck"
The AI knows the red circle appeared while the user said "stuck." No separate explanation needed. The meaning is in the simultaneity — exactly how it works between humans. Exactly how it worked in those caves.
Nobody has built this. There are AI whiteboards, sketch-to-image apps, a patent, an academic paper from 2018 — and zero shipped products that fuse drawing and voice into a single input stream for AI.
Where it leads
The first version is a web app — not iPad-only, any device with a touchscreen and microphone — that captures strokes + voice with timestamps and sends the fused timeline to an LLM. Browser APIs already support pressure-sensitive stylus input. Web Speech API handles transcription. The MVP is a hosted webpage.
But the bigger idea comes when you put two people on the same canvas. A teacher draws a circuit diagram while explaining it. A student circles a component and asks why. The AI — watching both streams, both voices — answers without the teacher stopping. A third participant on the whiteboard who never interrupts but always follows along.
Then: the AI becomes the teacher. A student says "teach me how transistors work." The AI draws on the canvas while narrating — building the diagram stroke by stroke, like a professor at a chalkboard. The student circles something they don't understand. The AI sees, hears, adapts.
Online tutoring is a massive, growing industry — billions of dollars, double-digit growth, every major platform racing to add AI. And every single one has added it the same way: as a text chatbot in a sidebar. The whiteboard — the surface where the best teaching actually happens — has zero AI participation. Nobody's built an AI that can pick up a marker.