Cable Insertion RL: Coordinate Frame Bugs & Reward Ablations
Session Context
Continuing from Session 1’s MuJoCo environment. Had a working scene with 838 successes on the fixed task, but the agent was approaching from the side instead of above, and median distance was 5.7cm. Goal: push toward <2cm median with better generalization.
Critical Bug: Coordinate Frame Mismatch
Symptom: Brute force achieved 0.6cm, but training plateaued at 4cm. Policy appeared to hover without purposeful movement.
Debugging process:
# Commanded action
action = [0.26, 0, 0.0]
# Actual gripper position after settling
gripper_pos = [0.36, 0, 0.2] # Way off!
Root cause: MuJoCo position actuators control joint offsets, not world coordinates. The gripper body starts at [0.1, 0, 0.2] in world space. Joint qpos = [0, 0, 0] means “at body origin.”
# Bug: action interpreted as joint offset
self.data.ctrl[:] = action # ctrl=[0.26, 0, 0]
# Result: gripper goes to [0.1+0.26, 0, 0.2+0] = [0.36, 0, 0.2]
# Fix: convert world coords to joint offsets
self.gripper_origin = np.array([0.1, 0.0, 0.2])
ctrl = np.array(action) - self.gripper_origin
self.data.ctrl[:] = ctrl
Result: Brute force now achieves 0.21cm at gripper position [0.18, 0, 0.23].
Physics Stability Fix
Symptom: WARNING: Nan, Inf or huge value in QACC at DOF 9
DOF 9 = third ball joint in the cable chain. The cable was “whipping” and exploding during fast gripper movements.
| Parameter | Before | After |
|---|---|---|
| Timestep | 0.002s | 0.001s |
| Ball joint damping | 0.02 | 0.2 |
| Ball joint stiffness | 0.05 | 0.1 |
| Substeps per action | 50 | 100 |
Kept 100ms of sim time per environment step (100 × 0.001s).
Experiment 1: Observation Normalization
Hypothesis: VecNormalize wrapper would help PPO by standardizing inputs.
env = DummyVecEnv([lambda: CableInsertionEnv()])
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.0)
| Config | Successes | Median |
|---|---|---|
| No normalization | 838 | 5.7cm |
| Obs + reward norm | 454 | 19.2cm |
| Obs only norm | 500 | 20.4cm |
Analysis: Our observation space was already well-scaled:
- Positions: 0.1–0.4m
- Velocities: ±1 m/s
- Target: fixed at
[0.18, 0, 0.04]
Normalization introduced non-stationarity (running stats shift during training) without benefit. Reverted.
Experiment 2: Domain Randomization
What we randomized each episode:
# Target position: ±1cm x/y, ±0.5cm z
target_noise = rng.uniform([-0.01, -0.01, -0.005], [0.01, 0.01, 0.005])
self.target = self.target_base + target_noise
# Initial gripper position: ±2cm x/z, ±1cm y
gripper_noise = rng.uniform([-0.02, -0.01, -0.02], [0.02, 0.01, 0.02])
self.data.qpos[0:3] = gripper_noise
# Cable physics: damping ±15%
damping_scale = rng.uniform(0.85, 1.15)
self.model.dof_damping[:] = self.default_damping * damping_scale
| Metric | Without DR | With DR |
|---|---|---|
| Successes | 838 | 359 |
| Median | 5.7cm | 4.2cm |
| <5cm | 2338 | 2811 |
Analysis: DR trades peak performance for consistency. The policy can’t memorize a single trajectory; it must actually use the target observation.
Experiment 3: Graduated Reward Bonuses
Hypothesis: One-time bonuses at intermediate thresholds create “breadcrumbs.”
reward = -dist * 10 # Base: penalize distance
# Improvement shaping (continuous)
if self.prev_dist is not None:
improvement = self.prev_dist - dist
reward += improvement * 50
# Graduated bonuses (one-time per episode)
if dist <= 0.05 and 5 not in self.thresholds_crossed:
reward += 10.0
self.thresholds_crossed.add(5)
if dist <= 0.03 and 3 not in self.thresholds_crossed:
reward += 25.0
self.thresholds_crossed.add(3)
if dist <= 0.02 and 2 not in self.thresholds_crossed:
reward += 100.0
self.thresholds_crossed.add(2)
| Metric | Original Reward | + Graduated |
|---|---|---|
| Successes | 359 | 510 |
| Median | 4.2cm | 10.8cm |
| <5cm | 2811 | 1659 |
Analysis: More successes but worse median. Policy became “all or nothing,” either hitting 2cm or staying far away. The graduated bonuses may have conflicted with continuous improvement shaping.
Experiment 4: Remove Improvement Shaping
Hypothesis: Simplify to just distance penalty + threshold bonuses.
reward = -dist * 10 + threshold_bonuses # No improvement term
| Metric | With Improvement | Without |
|---|---|---|
| Successes | 510 | 412 |
| Median | 10.8cm | 24.6cm |
| <5cm | 1659 | 1057 |
Analysis: Worst run. The continuous improvement signal provides crucial gradient information. Threshold bonuses alone are too sparse.
Hyperparameters
PPO configuration (Stable Baselines3):
PPO(
"MlpPolicy",
env,
ent_coef=0.01, # Entropy bonus
learning_rate=0.0001, # Conservative
n_steps=2048, # Steps per rollout
batch_size=64, # Minibatch size
n_epochs=10, # PPO epochs per update
)
Failed hyperparameter changes:
ent_coef=0.05+learning_rate=0.0003+ DR → training collapsed (0 successes)- Lesson: don’t change multiple things at once when adding a harder task (DR)
Final Comparison
| Config | Successes | Median | <5cm | Steps |
|---|---|---|---|---|
| Baseline (no DR) | 838 | 5.7cm | 2338 | 2M |
| + Domain Random | 359 | 4.2cm | 2811 | 2M |
| + Graduated + 4M | 510 | 10.8cm | 1659 | 4M |
| − Improvement Shaping | 412 | 24.6cm | 1057 | 4M |
Key Takeaways
- Coordinate frames matter. World coords vs joint offsets is a classic robotics bug. Always verify actuators do what you expect.
- Simpler rewards often win. Distance + improvement shaping beat graduated thresholds. More reward terms = more ways to conflict.
- Normalization isn’t universal. Skip it when observations are already well-scaled.
- Domain randomization has tradeoffs. Use conservative ranges. Start without DR, verify it works, then add DR with same hyperparameters.
- Change one variable at a time. DR + new hyperparameters simultaneously → training collapse.
Observation & Action Space Reference
Observation (12 dims):
| Index | Name | Description |
|---|---|---|
| 0:3 | connector_pos | World coordinates |
| 3:6 | connector_vel | Linear velocity from cvel |
| 6:9 | gripper_pos | World coordinates |
| 9:12 | target | Randomized each episode |
Action (3 dims):
| Index | Name | Range |
|---|---|---|
| 0 | gripper_x | [-0.2, 0.4] world coords |
| 1 | gripper_y | [-0.3, 0.3] |
| 2 | gripper_z | [0.0, 0.6] |