Brain, meet body

Neo Cortex

If you had the opportunity to have a new brain installed, what would you add first? Fluency in any language, sharper concentration, perfect recall? It is all very cyberpunk, and I love it. And as of yesterday it is the situation I happen to find myself in.

This all started when a specialized computer arrived in the mail. The outside of the box says “Thor,” so that is what I call it as well. Going forward, Thor is my new brain.

The first thing my new brain needed was vision with good object recognition, so I loaded qwen2.5-vl:7b and watched it run. Then I looked at memory and found it had quietly pre-allocated seventy gigabytes of key-value cache for prompts that only ever needed eight. So I rebuilt it with a smaller context window, called the result qwen-fast, and the next prompt eval came back fourteen times faster on exactly the same weights. Same model, leaner appetite. Welcome to my body, Thor.

First sight on home hardware

With my brain installed and behaving, it was time to put Thor through its paces. The first query I sent was a simple scene description. Thor pulled the frame, ran it through the vision encoder, and returned this:

A green cutting mat with a grid pattern and measurements in inches. On the mat, there is a red and black screwdriver lying horizontally. The mat has various markings and symbols, including a compass, a gear, and a wrench, which are likely indicators of the types of tools or activities that can be performed on the mat. The mat is placed on a dark surface, and there is a small blue object partially visible in the bottom right corner of the image.

The screwdriver was right, and the cutting mat was right down to the inches, but there was no compass or gear or wrench anywhere on it. The model was reading the brand graphics printed on the mat itself, the little stamped icons that mean “this product is for crafts,” and reporting them as objects in the scene. The “small blue object partially visible” was my 3D-printed tentacle. It had noticed the tentacle but not bothered to name it.

So I asked it specifically. “What object is in the center of this image? Be specific.”

The tentacle, center frame, on the green cutting mat. Thor called it a 3D-printed gear.

The object in the center of the image appears to be a 3D-printed gear. It is composed of multiple blue plastic pieces that are connected in a way that resembles a gear, with teeth that interlock to form a circular shape.

Wrong, but wrong in a way I had to sit with for a minute. Every specific claim the model made about the visible features was correct: blue, plastic, 3D-printed, multiple pieces, interlocking segments, roughly circular segments from above. The model got the morphology exactly right and just reached for the wrong noun at the end. Gears really do have interlocking blue plastic teeth arranged in a circle, and from this angle, in this pose, the surface features overlapped almost perfectly. The model pattern-matched to the nearest thing it had a word for. Has that ever happened to you? You see something, you can describe every feature of it, but the actual name is just out of reach, so you call it a thingamajig or a doohickey and move on. The model did a more confident version of the same thing.

I have a name for the failure mode now. Correct morphology, wrong category. It is a close cousin of the “baseball bat” failure I wrote about two days ago, where YOLO was matching the tentacle’s long articulated silhouette to the closest entry in the COCO ontology. That was visual agnosia at the level of shape, and this is visual agnosia at the level of arrangement. (A neurologist would probably tell me I am stretching the word, and I probably am.)

A Vise-Grip-shaped tool, a Raspberry Pi 5, and a microSD card, arranged on the workbench The next scene. Correct on the genus, wrong on the species.

The next query found a different failure. I moved some electronics in front of the camera and asked what was in the center, and Thor said a Raspberry Pi Zero W, a pair of Vise-Grip pliers, and a microSD card. Correct on the genus: Raspberry Pi, Vise-Grip-shaped tool, small card. Wrong on the species: the Pi is actually a Pi 5, the tool is a regular pair of pliers with a wire stripper next to it that the model collapsed into a single object, and the microSD card was right as far as I can tell, though I did not bother to verify that one by picking it up.

Two different failures, both specific and confident. The model does not hedge. It picks a noun and commits, which is worth knowing before we start building any loop where qwen-fast’s output triggers action on my part. I am, oddly, becoming the more meta of the two of us. The model commits to a noun, while I find I am hesitating to commit at all unless I am sure that what my brain is “seeing” is actually what is in the scene. At this stage of my development that is hard to approach, at least without other modalities chiming in to back the call. We are moving in that direction, slowly.

The first real dataset

After the sight tests I moved on to kinesthetic teaching. The idea is wonderfully simple: turn the servo torque off, grab the arm, and move it by hand through a task while a loop polls joint positions and camera frames at a fixed rate and saves them as a trajectory. Repeat the task ten times. The stack of trajectories is the training data for whatever policy we fit next.

Today’s task was “touch and grip the tentacle.” Ten training episodes, ten seconds each, five frames per second, for a total of five hundred frames of demonstration.

The first attempt was a failure I did not anticipate. I recorded three episodes, and when I went to analyze the data, every frame in every episode had exactly the same joint positions. Zero variance. I spent twenty minutes hunting for a firmware bug.

There was no firmware bug. The arm had been resting in a folded-down home position on the workbench, and when torque went off, friction against the bench plus the arm’s own weight kept it pinned in place. I thought I was moving it. The encoder politely disagreed.

The fix was physical, not electronic. I lifted the shoulder to a position where the arm hung free and unsupported, and then torque-off produced an actual drop and the arm became freely movable. I re-recorded with variance across all six joints, forty unique poses per episode. (The arm does not particularly care what I intend; it cares what gravity and friction will allow.)

I logged this one because the lesson is general and I want to keep it: when every data point agrees suspiciously well, it is probably a measurement artifact, not a bug. Check the setup before you check the code.

The grip recording went smoothly after that. Ten episodes, every joint moved, and the gripper opened and closed on the tentacle in every single one. Two hundred megabytes of JPEG, five hundred state-action pairs, and the first genuinely trainable data this project has produced.

What is next

Five hundred frames is nowhere near enough to train a real policy. It is, however, enough to prove the pipeline and to surface the kinematic reality that some poses a human can demonstrate cannot be held by the servos on their own. Logged in physical_reality.md, not a blocker for continuing, and likely to come back as a policy artifact later, like a bill in the mail.

Tomorrow, two things. Install LeRobot properly on Thor and fit an ACT policy to this dataset, just to see what it does. And point the VLM at the camera while the arm is moving and ask it what it thinks is happening. If the cortex can watch the body act, that is the next tier of loop. Not yet perception-to-action. Just perception-of-action.