Cable Mind: Vision Training & Video Pipeline

February 12, 2026

Training Smoke Test: 20k Steps

Ran PPO on both state and vision modes to verify the full pipeline works end-to-end.

State mode (MlpPolicy)

Metric	Value
Wall time	15.7s
FPS	1,274 steps/s
Reward (start)	-422
Reward (end)	-275
Improvement	35%

Vision mode (MultiInputPolicy)

Metric	Value
Wall time	965s (~16 min)
FPS	21 steps/s
Reward (start)	-439
Reward (end)	-248
Improvement	43%
Peak RSS	1.2 GB

Both modes show clear learning signal in 20k steps. Vision is 60x slower than state (21 vs 1,274 fps) due to three 84x84 camera renders per step. The 43% improvement in vision mode is encouraging. The CNN is extracting useful features from the wrist cameras.

1M Vision Training

Kicked off a 1M-step vision training run on the original cable env (floating gripper). Completed in 17.2 hours at 16 fps.

Metric	20k steps	1M steps
ep_rew_mean	-439	-188
ep_len_mean	200	154
policy std	~1.0	0.257

Reward improved from -439 to -188. Policy std narrowed from ~1.0 to 0.257; the policy has converged to a tighter action distribution. Episode length dropped below max (200 to 154), meaning some episodes are terminating early via the success condition (dist < 2cm). Still far from solved, but clear monotonic improvement over the full run.

Video Rendering Pipeline

Built a 30-second demo video renderer for social media content. 1280x720 H.264 at 20fps with PiP wrist camera panels composited in the top-right corner.

Framebuffer gotcha: MuJoCo’s default offscreen framebuffer is 640x480. Requesting 1280px renders fails silently or errors. Fix:

arm.visual.global_.offwidth = 1920
arm.visual.global_.offheight = 1080

This must be set on the MjSpec before compile(). Can’t resize after model compilation. This one cost 20 minutes of debugging.

Video writing: imageio-ffmpeg via imageio.get_writer() with codec="libx264", quality=8, pixelformat="yuv420p". The pyav plugin wasn’t installed, so the ffmpeg subprocess backend was the path of least resistance.

PiP compositing: Pure numpy array slicing. Three 200x150 camera renders pasted onto the main frame with 2px dark borders between panels.

Render performance: 5.1 fps (4 camera renders per frame x 600 frames = ~2 minutes).

Four-Camera Montage

UR5e four-camera montage

Third-person hero shot, overhead, and both wrist side cameras. The yellow cable and bright green socket read well from every angle.

Files

File	What
`experiments/smoke_train_ur5e.py`	20k-step PPO smoke test, both modes
`experiments/visualize_ur5e.py`	4-camera montage + optional interactive viewer
`experiments/render_ur5e_video.py`	30s demo video with PiP wrist cameras

Takeaways

Vision mode trains 60x slower than state mode (21 vs 1,274 fps) but shows stronger relative improvement (43% vs 35%). The CNN learns useful features even from 84x84 renders.
Set the offscreen framebuffer before compile. visual.global_.offwidth on the MjSpec, not after model compilation.
1M steps is not enough to solve vision-based cable insertion, but the learning curves are monotonic. The policy is headed in the right direction. Next step is scaling up to longer runs on the UR5e env.