Cable Insertion RL: Coordinate Frame Bugs & Reward Ablations

January 9, 2026

Session Context

Continuing from Session 1’s MuJoCo environment. Had a working scene with 838 successes on the fixed task, but the agent was approaching from the side instead of above, and median distance was 5.7cm. Goal: push toward <2cm median with better generalization.

Critical Bug: Coordinate Frame Mismatch

Symptom: Brute force achieved 0.6cm, but training plateaued at 4cm. Policy appeared to hover without purposeful movement.

Debugging process:

# Commanded action
action = [0.26, 0, 0.0]

# Actual gripper position after settling
gripper_pos = [0.36, 0, 0.2]  # Way off!

Root cause: MuJoCo position actuators control joint offsets, not world coordinates. The gripper body starts at [0.1, 0, 0.2] in world space. Joint qpos = [0, 0, 0] means “at body origin.”

# Bug: action interpreted as joint offset
self.data.ctrl[:] = action  # ctrl=[0.26, 0, 0]
# Result: gripper goes to [0.1+0.26, 0, 0.2+0] = [0.36, 0, 0.2]

# Fix: convert world coords to joint offsets
self.gripper_origin = np.array([0.1, 0.0, 0.2])
ctrl = np.array(action) - self.gripper_origin
self.data.ctrl[:] = ctrl

Result: Brute force now achieves 0.21cm at gripper position [0.18, 0, 0.23].

Physics Stability Fix

Symptom: WARNING: Nan, Inf or huge value in QACC at DOF 9

DOF 9 = third ball joint in the cable chain. The cable was “whipping” and exploding during fast gripper movements.

Parameter	Before	After
Timestep	0.002s	0.001s
Ball joint damping	0.02	0.2
Ball joint stiffness	0.05	0.1
Substeps per action	50	100

Kept 100ms of sim time per environment step (100 × 0.001s).

Experiment 1: Observation Normalization

Hypothesis: VecNormalize wrapper would help PPO by standardizing inputs.

env = DummyVecEnv([lambda: CableInsertionEnv()])
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0)

Config	Successes	Median
No normalization	838	5.7cm
Obs + reward norm	454	19.2cm
Obs only norm	500	20.4cm

Analysis: Our observation space was already well-scaled:

Positions: 0.1–0.4m
Velocities: ±1 m/s
Target: fixed at [0.18, 0, 0.04]

Normalization introduced non-stationarity (running stats shift during training) without benefit. Reverted.

Experiment 2: Domain Randomization

What we randomized each episode:

# Target position: ±1cm x/y, ±0.5cm z
target_noise = rng.uniform([-0.01, -0.01, -0.005], [0.01, 0.01, 0.005])
self.target = self.target_base + target_noise

# Initial gripper position: ±2cm x/z, ±1cm y
gripper_noise = rng.uniform([-0.02, -0.01, -0.02], [0.02, 0.01, 0.02])
self.data.qpos[0:3] = gripper_noise

# Cable physics: damping ±15%
damping_scale = rng.uniform(0.85, 1.15)
self.model.dof_damping[:] = self.default_damping * damping_scale

Metric	Without DR	With DR
Successes	838	359
Median	5.7cm	4.2cm
<5cm	2338	2811

Analysis: DR trades peak performance for consistency. The policy can’t memorize a single trajectory; it must actually use the target observation.

Experiment 3: Graduated Reward Bonuses

Hypothesis: One-time bonuses at intermediate thresholds create “breadcrumbs.”

reward = -dist * 10  # Base: penalize distance

# Improvement shaping (continuous)
if self.prev_dist is not None:
    improvement = self.prev_dist - dist
    reward += improvement * 50

# Graduated bonuses (one-time per episode)
if dist <= 0.05 and 5 not in self.thresholds_crossed:
    reward += 10.0
    self.thresholds_crossed.add(5)
if dist <= 0.03 and 3 not in self.thresholds_crossed:
    reward += 25.0
    self.thresholds_crossed.add(3)
if dist <= 0.02 and 2 not in self.thresholds_crossed:
    reward += 100.0
    self.thresholds_crossed.add(2)

Metric	Original Reward	+ Graduated
Successes	359	510
Median	4.2cm	10.8cm
<5cm	2811	1659

Analysis: More successes but worse median. Policy became “all or nothing,” either hitting 2cm or staying far away. The graduated bonuses may have conflicted with continuous improvement shaping.

Experiment 4: Remove Improvement Shaping

Hypothesis: Simplify to just distance penalty + threshold bonuses.

reward = -dist * 10 + threshold_bonuses  # No improvement term

Metric	With Improvement	Without
Successes	510	412
Median	10.8cm	24.6cm
<5cm	1659	1057

Analysis: Worst run. The continuous improvement signal provides crucial gradient information. Threshold bonuses alone are too sparse.

Hyperparameters

PPO configuration (Stable Baselines3):

PPO(
    "MlpPolicy",
    env,
    ent_coef=0.01,        # Entropy bonus
    learning_rate=0.0001,  # Conservative
    n_steps=2048,          # Steps per rollout
    batch_size=64,         # Minibatch size
    n_epochs=10,           # PPO epochs per update
)

Failed hyperparameter changes:

ent_coef=0.05 + learning_rate=0.0003 + DR → training collapsed (0 successes)
Lesson: don’t change multiple things at once when adding a harder task (DR)

Final Comparison

Config	Successes	Median	<5cm	Steps
Baseline (no DR)	838	5.7cm	2338	2M
+ Domain Random	359	4.2cm	2811	2M
+ Graduated + 4M	510	10.8cm	1659	4M
− Improvement Shaping	412	24.6cm	1057	4M

Key Takeaways

Coordinate frames matter. World coords vs joint offsets is a classic robotics bug. Always verify actuators do what you expect.
Simpler rewards often win. Distance + improvement shaping beat graduated thresholds. More reward terms = more ways to conflict.
Normalization isn’t universal. Skip it when observations are already well-scaled.
Domain randomization has tradeoffs. Use conservative ranges. Start without DR, verify it works, then add DR with same hyperparameters.
Change one variable at a time. DR + new hyperparameters simultaneously → training collapse.

Observation & Action Space Reference

Observation (12 dims):

Index	Name	Description
0:3	connector_pos	World coordinates
3:6	connector_vel	Linear velocity from cvel
6:9	gripper_pos	World coordinates
9:12	target	Randomized each episode

Action (3 dims):

Index	Name	Range
0	gripper_x	[-0.2, 0.4] world coords
1	gripper_y	[-0.3, 0.3]
2	gripper_z	[0.0, 0.6]