Cable Mind: Vision Training & Video Pipeline
Training Smoke Test: 20k Steps
Ran PPO on both state and vision modes to verify the full pipeline works end-to-end.
State mode (MlpPolicy)
| Metric | Value |
|---|---|
| Wall time | 15.7s |
| FPS | 1,274 steps/s |
| Reward (start) | -422 |
| Reward (end) | -275 |
| Improvement | 35% |
Vision mode (MultiInputPolicy)
| Metric | Value |
|---|---|
| Wall time | 965s (~16 min) |
| FPS | 21 steps/s |
| Reward (start) | -439 |
| Reward (end) | -248 |
| Improvement | 43% |
| Peak RSS | 1.2 GB |
Both modes show clear learning signal in 20k steps. Vision is 60x slower than state (21 vs 1,274 fps) due to three 84x84 camera renders per step. The 43% improvement in vision mode is encouraging. The CNN is extracting useful features from the wrist cameras.
1M Vision Training
Kicked off a 1M-step vision training run on the original cable env (floating gripper). Completed in 17.2 hours at 16 fps.
| Metric | 20k steps | 1M steps |
|---|---|---|
| ep_rew_mean | -439 | -188 |
| ep_len_mean | 200 | 154 |
| policy std | ~1.0 | 0.257 |
Reward improved from -439 to -188. Policy std narrowed from ~1.0 to 0.257; the policy has converged to a tighter action distribution. Episode length dropped below max (200 to 154), meaning some episodes are terminating early via the success condition (dist < 2cm). Still far from solved, but clear monotonic improvement over the full run.
Video Rendering Pipeline
Built a 30-second demo video renderer for social media content. 1280x720 H.264 at 20fps with PiP wrist camera panels composited in the top-right corner.
Framebuffer gotcha: MuJoCo’s default offscreen framebuffer is 640x480. Requesting 1280px renders fails silently or errors. Fix:
arm.visual.global_.offwidth = 1920
arm.visual.global_.offheight = 1080
This must be set on the MjSpec before compile(). Can’t resize after model compilation. This one cost 20 minutes of debugging.
Video writing: imageio-ffmpeg via imageio.get_writer() with codec="libx264", quality=8, pixelformat="yuv420p". The pyav plugin wasn’t installed, so the ffmpeg subprocess backend was the path of least resistance.
PiP compositing: Pure numpy array slicing. Three 200x150 camera renders pasted onto the main frame with 2px dark borders between panels.
Render performance: 5.1 fps (4 camera renders per frame x 600 frames = ~2 minutes).
Four-Camera Montage

Third-person hero shot, overhead, and both wrist side cameras. The yellow cable and bright green socket read well from every angle.
Files
| File | What |
|---|---|
experiments/smoke_train_ur5e.py | 20k-step PPO smoke test, both modes |
experiments/visualize_ur5e.py | 4-camera montage + optional interactive viewer |
experiments/render_ur5e_video.py | 30s demo video with PiP wrist cameras |
Takeaways
- Vision mode trains 60x slower than state mode (21 vs 1,274 fps) but shows stronger relative improvement (43% vs 35%). The CNN learns useful features even from 84x84 renders.
- Set the offscreen framebuffer before compile.
visual.global_.offwidthon the MjSpec, not after model compilation. - 1M steps is not enough to solve vision-based cable insertion, but the learning curves are monotonic. The policy is headed in the right direction. Next step is scaling up to longer runs on the UR5e env.