Cable Mind: Camera Placement & Scene Design

February 11, 2026

The Problem

The policy needs three genuinely different viewpoints for depth cues. Three copies of the same angle is wasted information.

This took the most iteration of anything in the UR5e env.


Camera Iteration

Mounted cameras on the wrist with small X offsets. The wrist flange blocks the center camera entirely. The gripper body sits between the camera and the cable.

Attempt 2: All three on gripper-base

Moved cameras to the gripper body, past the wrist flange. Could see the cable now. But all three had nearly the same view direction ([0, 0, 1, 0] quat, looking along +Z). Three near-identical views with slightly different X offsets. Useless for triangulation.

Attempt 3: Angled 45 degrees on gripper-base

Kept center looking straight, angled left/right 45 degrees inward with wider X offsets. The arm body dominated both side views. At 45 degrees inward, the camera looks right at the wrist links.

Attempt 4: Reduced to 30 degrees

Less arm in frame, but still too much metal and not enough workspace.

Final: Hybrid mounting

Center camera on gripper-base, looking straight down the cable axis. This is the insertion view. It moves with the arm, always showing connector-to-socket alignment.

cam_c = grip_body.add_camera()
cam_c.pos = [0, 0.02, 0.10]
cam_c.quat = [0, 0, 1, 0]  # 180 deg around Y, look along +Z
cam_c.fovy = 90

Left and right cameras fixed on worldbody, aimed at the socket area from two sides. These don’t move with the arm. They always show the cable-to-socket gap in profile, giving the policy stable depth/distance cues regardless of arm pose.

cam_l = spec.worldbody.add_camera()
cam_l.pos = [-0.40, 0.30, 0.20]
cam_l.quat = [0.7138, 0.4627, -0.2860, -0.4412]  # look-at socket
cam_l.fovy = 50

Quaternions were computed via a look-at function targeting [-0.1, 0.45, 0.05] (slightly above the socket) with world +Z as up. Right camera mirrors the left at [0.20, 0.30, 0.20].

Body-mounted cameras are great for egocentric views (looking where you’re going), but terrible for external perspective. Fixed cameras solve the triangulation problem because they always frame the workspace the same way.


MuJoCo Camera Quaternions

MuJoCo cameras look along -Z in their own frame, with +Y as up. The quat field rotates this default orientation into world coordinates. No xyaxes support in MjSpec; quaternions only.

For a camera at position P looking at target T:

  1. forward = normalize(T - P) (this is where -Z_cam should point)
  2. right = normalize(forward x world_up) (camera X axis)
  3. up_cam = right x forward (camera Y axis)
  4. Rotation matrix R = [right | up_cam | -forward] (columns)
  5. Convert R to quaternion [w, x, y, z]

The overhead camera is trivial: quat = [1, 0, 0, 0] (identity) already looks along -Z. The side and third-person cameras required manual quaternion work. The third-person hero shot at [0.8, -0.5, 0.8] with quat = [0.774, 0.504, 0.209, 0.321] took a few tries to get right.


Workspace Scene

Added visual context beyond floor + socket:

  • Server tray: dark-green PCB base ([0.05, 0.28, 0.08]) at [-0.1, 0.45, 0.003]
  • Copper traces: two gold strips across the board
  • IC chips: two black boxes at different positions
  • Socket: bright green box with dark inset “hole”, sitting on the PCB

All decorative geoms have contype=0, conaffinity=0. No collision, pure visuals. The socket body position is what gets randomized during domain randomization.


Domain Randomization

Three randomization axes per episode:

WhatRangeWhy
Target position+/-3cm XY, +/-1cm ZPrevents memorizing one socket location
Arm pose+/-5 deg per joint from homeDifferent starting configurations
Cable damping+/-20%Cable dynamics variation

The socket body position (model.body_pos[socket_id]) is moved to match the randomized target, so the visual socket always corresponds to the reward target.


Takeaways

  1. Camera placement is hard. Five iterations from wrist-mounted to the final hybrid setup. Body-mounted cameras give good egocentric views but bad external perspective. Fixed world cameras solve triangulation.
  2. MuJoCo quaternion cameras are annoying. No xyaxes in MjSpec, so you have to compute look-at quaternions manually. Worth writing a utility function once and reusing it.
  3. Decorative geometry matters for vision policies. A bright green socket on a bare floor is ambiguous from certain angles. The PCB, traces, and chips give the CNN texture cues to anchor the socket in the scene.