Cable Insertion RL: Coordinate Frame Bugs & Reward Ablations

January 9, 2026

Session Context

Continuing from Session 1’s MuJoCo environment. Had a working scene with 838 successes on the fixed task, but the agent was approaching from the side instead of above, and median distance was 5.7cm. Goal: push toward <2cm median with better generalization.


Critical Bug: Coordinate Frame Mismatch

Symptom: Brute force achieved 0.6cm, but training plateaued at 4cm. Policy appeared to hover without purposeful movement.

Debugging process:

# Commanded action
action = [0.26, 0, 0.0]

# Actual gripper position after settling
gripper_pos = [0.36, 0, 0.2]  # Way off!

Root cause: MuJoCo position actuators control joint offsets, not world coordinates. The gripper body starts at [0.1, 0, 0.2] in world space. Joint qpos = [0, 0, 0] means “at body origin.”

# Bug: action interpreted as joint offset
self.data.ctrl[:] = action  # ctrl=[0.26, 0, 0]
# Result: gripper goes to [0.1+0.26, 0, 0.2+0] = [0.36, 0, 0.2]

# Fix: convert world coords to joint offsets
self.gripper_origin = np.array([0.1, 0.0, 0.2])
ctrl = np.array(action) - self.gripper_origin
self.data.ctrl[:] = ctrl

Result: Brute force now achieves 0.21cm at gripper position [0.18, 0, 0.23].


Physics Stability Fix

Symptom: WARNING: Nan, Inf or huge value in QACC at DOF 9

DOF 9 = third ball joint in the cable chain. The cable was “whipping” and exploding during fast gripper movements.

ParameterBeforeAfter
Timestep0.002s0.001s
Ball joint damping0.020.2
Ball joint stiffness0.050.1
Substeps per action50100

Kept 100ms of sim time per environment step (100 × 0.001s).


Experiment 1: Observation Normalization

Hypothesis: VecNormalize wrapper would help PPO by standardizing inputs.

env = DummyVecEnv([lambda: CableInsertionEnv()])
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0)
ConfigSuccessesMedian
No normalization8385.7cm
Obs + reward norm45419.2cm
Obs only norm50020.4cm

Analysis: Our observation space was already well-scaled:

  • Positions: 0.1–0.4m
  • Velocities: ±1 m/s
  • Target: fixed at [0.18, 0, 0.04]

Normalization introduced non-stationarity (running stats shift during training) without benefit. Reverted.


Experiment 2: Domain Randomization

What we randomized each episode:

# Target position: ±1cm x/y, ±0.5cm z
target_noise = rng.uniform([-0.01, -0.01, -0.005], [0.01, 0.01, 0.005])
self.target = self.target_base + target_noise

# Initial gripper position: ±2cm x/z, ±1cm y
gripper_noise = rng.uniform([-0.02, -0.01, -0.02], [0.02, 0.01, 0.02])
self.data.qpos[0:3] = gripper_noise

# Cable physics: damping ±15%
damping_scale = rng.uniform(0.85, 1.15)
self.model.dof_damping[:] = self.default_damping * damping_scale
MetricWithout DRWith DR
Successes838359
Median5.7cm4.2cm
<5cm23382811

Analysis: DR trades peak performance for consistency. The policy can’t memorize a single trajectory; it must actually use the target observation.


Experiment 3: Graduated Reward Bonuses

Hypothesis: One-time bonuses at intermediate thresholds create “breadcrumbs.”

reward = -dist * 10  # Base: penalize distance

# Improvement shaping (continuous)
if self.prev_dist is not None:
    improvement = self.prev_dist - dist
    reward += improvement * 50

# Graduated bonuses (one-time per episode)
if dist <= 0.05 and 5 not in self.thresholds_crossed:
    reward += 10.0
    self.thresholds_crossed.add(5)
if dist <= 0.03 and 3 not in self.thresholds_crossed:
    reward += 25.0
    self.thresholds_crossed.add(3)
if dist <= 0.02 and 2 not in self.thresholds_crossed:
    reward += 100.0
    self.thresholds_crossed.add(2)
MetricOriginal Reward+ Graduated
Successes359510
Median4.2cm10.8cm
<5cm28111659

Analysis: More successes but worse median. Policy became “all or nothing,” either hitting 2cm or staying far away. The graduated bonuses may have conflicted with continuous improvement shaping.


Experiment 4: Remove Improvement Shaping

Hypothesis: Simplify to just distance penalty + threshold bonuses.

reward = -dist * 10 + threshold_bonuses  # No improvement term
MetricWith ImprovementWithout
Successes510412
Median10.8cm24.6cm
<5cm16591057

Analysis: Worst run. The continuous improvement signal provides crucial gradient information. Threshold bonuses alone are too sparse.


Hyperparameters

PPO configuration (Stable Baselines3):

PPO(
    "MlpPolicy",
    env,
    ent_coef=0.01,        # Entropy bonus
    learning_rate=0.0001,  # Conservative
    n_steps=2048,          # Steps per rollout
    batch_size=64,         # Minibatch size
    n_epochs=10,           # PPO epochs per update
)

Failed hyperparameter changes:

  • ent_coef=0.05 + learning_rate=0.0003 + DR → training collapsed (0 successes)
  • Lesson: don’t change multiple things at once when adding a harder task (DR)

Final Comparison

ConfigSuccessesMedian<5cmSteps
Baseline (no DR)8385.7cm23382M
+ Domain Random3594.2cm28112M
+ Graduated + 4M51010.8cm16594M
− Improvement Shaping41224.6cm10574M

Key Takeaways

  1. Coordinate frames matter. World coords vs joint offsets is a classic robotics bug. Always verify actuators do what you expect.
  2. Simpler rewards often win. Distance + improvement shaping beat graduated thresholds. More reward terms = more ways to conflict.
  3. Normalization isn’t universal. Skip it when observations are already well-scaled.
  4. Domain randomization has tradeoffs. Use conservative ranges. Start without DR, verify it works, then add DR with same hyperparameters.
  5. Change one variable at a time. DR + new hyperparameters simultaneously → training collapse.

Observation & Action Space Reference

Observation (12 dims):

IndexNameDescription
0:3connector_posWorld coordinates
3:6connector_velLinear velocity from cvel
6:9gripper_posWorld coordinates
9:12targetRandomized each episode

Action (3 dims):

IndexNameRange
0gripper_x[-0.2, 0.4] world coords
1gripper_y[-0.3, 0.3]
2gripper_z[0.0, 0.6]