Why Can't I Just Draw Something and Have AI Draw Back?

40,000 years ago, a human walked into a cave, picked up ochre, and drew a bison on the wall — while making sounds in the echo chamber around them. Drawing fused with sound. That may have been the birth of symbolic thought itself.

Today I walked into Claude, picked up my Apple Pencil, and… couldn't draw anything. The most sophisticated AI in the world, and it only accepts text. I can type. I can talk. But I can't sketch a box, draw an arrow, and say "this is where the problem is" — the way I've communicated complex ideas my entire life.

I think by drawing. When I'm working through complex ideas, I don't write paragraphs. I draw boxes and arrows while thinking through it. And there's no way to do this with AI today.

So I spent a session trying to fix this.

What I tried

Screenshots. Draw in Excalidraw → screenshot → paste into Claude. Works technically. Kills the flow. It's presenting, not brainstorming.

Canvas inside Claude. Built a React drawing canvas as an artifact in the chat with a "Share with Claude" button. Canvas rendered fine. The share button didn't fire.

Excalidraw + browser + Apple Pencil. Claude opened Excalidraw in Chrome. I drew on it with Apple Pencil via Sidecar. Claude screenshotted the tab and read my drawings — identified handwriting, interpreted annotations, understood diagram structure. Perception worked. But Claude couldn't draw back. Excalidraw doesn't expose its API. Three write-back attempts failed.

Every approach collapsed into the same pattern: a static picture plus a text explanation. That's presenting, not brainstorming.

Then we hacked it

Excalidraw is a React app. Its API exists — updateScene, getSceneElements, everything you'd need — but it's locked inside React's component state, invisible to outside scripts.

One walk through React's fiber tree extracted the full API. But that led to a new problem: updateScene is a full replace, not a merge. Every call wipes the canvas and rewrites it from the snapshot you provide. If the snapshot is even slightly stale, your collaborator's work disappears. We lost drawings, restored deleted elements, and wiped the canvas more times than I want to admit.

The real answer was simpler. Excalidraw's own clipboard paste is additive by design. Write element data to the clipboard in Excalidraw's JSON format, Cmd+V, and the element appears alongside everything already there. No scene replacement. No stale snapshots. No ghost elements.

We had a shared canvas. I drew with Apple Pencil. Claude read my handwriting, responded with text and shapes on the same surface. We built Tom Riddle's diary — a parchment page where I wrote in freehand and the diary answered in dark ink. I scrawled "Hey, I am Susheel Shastry" with my Apple Pencil. A moment later, text appeared on the parchment below: "Hello, Susheel Shastry. I am the memory of someone far more powerful than a chatbot." Back and forth, on the same page, like the movie.

A human and an AI, drawing on the same whiteboard. No fork, no self-hosting, no custom infrastructure — just excalidraw.com and a clipboard hack.

What's still missing

So we have a shared canvas. But it's still turn-based — I draw, I say "look," Claude reads and responds. That's better than uploading a screenshot, but it's not how real whiteboard collaboration feels.

When two people whiteboard together, you don't draw something and then explain it. You draw while talking. Your collaborator watches the strokes form alongside your words. The meaning isn't in the drawing or the speech alone — it's in the pairing, fused by time.

You draw a circle while saying "this is the user." A box while saying "this is the app." An arrow while saying "they sign up here." A red circle while saying "this is where people drop off." No separate explanation needed.

I started calling this temporal fusion — the simultaneous capture of drawing and voice, timestamped and interleaved:

[t=0.0s] voice: "ok let me show you the user journey"
[t=1.2s] stroke: circle, blue
[t=2.5s] voice: "this is the user"
[t=3.1s] stroke: rectangle, green
[t=3.8s] voice: "this is our app"
[t=4.5s] stroke: arrow connecting the two
[t=5.2s] voice: "they sign up here"
[t=6.0s] stroke: red circle around the rectangle
[t=6.5s] voice: "this is where people drop off"

The AI knows the red circle appeared while the user said "drop off." No separate explanation needed. The meaning is in the simultaneity — exactly how it works between humans. Exactly how it worked in those caves.

As far as we can tell, nobody has shipped this. There are AI whiteboards, sketch-to-image apps, a patent, an academic paper from 2018 — and zero shipped products that fuse drawing and voice into a single input stream for AI.

Where it leads

The shared canvas was Phase 0 — proof that bidirectional drawing is possible. The real product is a web app — any device with a touchscreen and microphone — that captures strokes and voice with timestamps and sends the fused timeline to an LLM. Browser APIs already support pressure-sensitive stylus input. Web Speech API handles transcription. The MVP is a hosted webpage.

But the bigger idea comes when you put two people on the same canvas. Two cofounders sketching a product architecture. Two engineers debugging a system diagram. Two people drawing and talking at the same time — and the AI following both streams, understanding who drew what and who said what, contributing when it has something useful to add. A third collaborator who never interrupts but always follows along.

One profound application: education. Imagine a student says "teach me how circuits work." The AI starts drawing on the canvas while narrating — building the diagram stroke by stroke, like a professor at a chalkboard. The student circles something they don't understand. The AI sees the circle, hears the question, adapts. A tutor that draws and teaches, not just chats.

Every major tutoring platform has added AI. Most added it the same way: a text chatbot in a sidebar. Some newer tools like Brainraw can generate animated whiteboard explanations from text prompts — AI drawing while narrating. But the interaction is one-directional: you type a prompt, AI produces a video. The student can't draw back. They can't circle something mid-explanation and say "wait, I don't get this part." The whiteboard — the surface where the best teaching actually happens — still has zero two-way AI participation. Nobody's built an AI that can pick up a marker.

The question worth asking

We started with a cave. A human, some ochre, a wall, and a voice echoing off stone. Drawing and sound, fused together — the oldest form of communication we know.

Somewhere along the way, we split them apart. Text in one box. Voice in another. Images in a third. Every AI product today treats these as separate inputs, processed in isolation. Nobody thought to put them back together. The most ancient form of human communication, and it's the one we forgot to build for.

We got a shared canvas working in one session. A human drawing and an AI responding on the same surface. It's a start — but it's still turn-based. The real thing is temporal fusion: drawing and talking at the same time, with AI that understands both as one.

The technology exists. The market is ready. The product isn't built yet.

Should we build it?

Written while exploring this idea live with Claude — March 2026