Cable Insertion RL: MuJoCo Environment Design
The Task
Position a cable connector into a target socket using a floating gripper. This is a simplified version of the Intrinsic AI competition task, building intuition before their toolkit drops (Feb 11).
Scene Design (cable_scene.xml)
Gripper
Floating 3-DOF Cartesian gripper. No arm kinematics, just direct position control.
<body name="gripper" pos="0.1 0 0.2">
<joint name="x" type="slide" axis="1 0 0" range="-0.3 0.3"/>
<joint name="y" type="slide" axis="0 1 0" range="-0.3 0.3"/>
<joint name="z" type="slide" axis="0 0 1" range="-0.2 0.4"/>
<geom type="box" size="0.02 0.02 0.015" rgba="0.5 0.5 0.5 1"/>
</body>
Position actuators with kp=500:
<actuator>
<position joint="x" kp="500"/>
<position joint="y" kp="500"/>
<position joint="z" kp="500"/>
</actuator>
Cable (5 Segments)
Chain of capsules connected by ball joints. Each segment is 3cm long.
<body name="seg1" pos="0 0 -0.03">
<joint type="ball" damping="0.2" stiffness="0.1"/>
<geom type="capsule" size="0.008" fromto="0 0 0 0 0 -0.03"/>
<body name="seg2" pos="0 0 -0.03">
<!-- ... nested chain continues ... -->
</body>
</body>
Why 5 segments?
- Fewer segments → cable too stiff, doesn’t hang naturally
- More segments → more DOFs, slower simulation, harder to control
- 5 segments (4 cable + 1 connector) gives ~15cm of flexible cable, enough to demonstrate dynamics without excessive complexity
Joint parameters (final values):
| Parameter | Value | Purpose |
|---|---|---|
| damping | 0.2 | Energy dissipation, prevents oscillation |
| stiffness | 0.1 | Restoring force toward neutral pose |
Initial values (0.02 damping, 0.05 stiffness) caused instability during fast movements.
Socket (Target)
<body name="socket" pos="0.18 0 0.04">
<geom type="cylinder" size="0.02 0.015" contype="0" conaffinity="0"/>
</body>
contype="0" conaffinity="0" makes it a ghost: visual only, no collision. The connector can pass through. Success is measured by distance, not physical insertion.
Connector
<body name="connector" pos="0 0 -0.03">
<joint type="ball" damping="0.2" stiffness="0.1"/>
<geom type="cylinder" size="0.015 0.02"/>
</body>
Cylinder at the end of the cable chain. Slightly smaller than socket (radius 0.015 vs 0.02) for visual clarity.
Simulation Parameters
<option gravity="0 0 -9.81" timestep="0.001"/>
Timestep: Started at 0.002s, reduced to 0.001s for stability. Smaller timestep = more accurate integration of cable dynamics.
Substeps per action: 100 (= 100ms sim time per env step)
Environment Design (cable_env.py)
Observation Space (12 dims)
self.observation_space = spaces.Box(
low=-np.inf, high=np.inf,
shape=(12,), dtype=np.float32
)
| Index | Content | Source |
|---|---|---|
| 0:3 | Connector position | sensordata[0:3] |
| 3:6 | Connector velocity | data.cvel[connector_id][3:6] |
| 6:9 | Gripper position | sensordata[3:6] |
| 9:12 | Target position | self.target |
Why include velocity? The cable swings. Velocity helps the policy predict where the connector will be, not just where it is.
Why include target? Enables domain randomization. Target moves each episode, policy must read it.
Action Space (3 dims)
self.action_space = spaces.Box(
low=np.array([-0.2, -0.3, 0.0], dtype=np.float32),
high=np.array([0.4, 0.3, 0.6], dtype=np.float32),
dtype=np.float32
)
Actions are world coordinates for the gripper target position. Converted to joint offsets before applying:
ctrl = np.array(action) - self.gripper_origin # [0.1, 0, 0.2]
self.data.ctrl[:] = ctrl
Reward Function
# Base: penalize distance to target
reward = -dist * 10
# Shaping: reward improvement (getting closer)
if self.prev_dist is not None:
improvement = self.prev_dist - dist
reward += improvement * 50
# Success bonus
if dist <= 0.02:
reward += 100.0
Why improvement shaping? Pure distance reward has weak gradient far from target. Improvement shaping rewards progress, not just proximity.
Coefficients:
-dist * 10: At 10cm, penalty is -1.0. At 1cm, penalty is -0.1.improvement * 50: Moving 1cm closer gives +0.5 bonus.+100: Success bonus dominates when achieved.
Episode Termination
done = dist <= 0.02 # Success: within 2cm
truncated = self.current_step >= self.max_steps # Timeout: 200 steps
Max steps: 200 × 100ms = 20 seconds of sim time per episode.
Sensors
<sensor>
<framepos objtype="body" objname="connector" name="connector_pos"/>
<framepos objtype="body" objname="gripper" name="gripper_pos"/>
</sensor>
MuJoCo sensors provide clean position data in data.sensordata. Alternative would be reading data.xpos[body_id] directly.
Brute Force Validation
Before training, verified the task is solvable:
for x in np.arange(0.10, 0.26, 0.01):
for z in np.arange(0.05, 0.25, 0.01):
# Move gripper, let cable settle, measure distance
Best achievable: 0.21cm at gripper position [0.18, 0, 0.23]
This confirms:
- Gripper directly above target (x=0.18)
- Gripper at z=0.23 lets cable hang down to target at z=0.04
- Cable length (~15cm) matches the geometry
Initial Training Results
PPO hyperparameters:
PPO(
"MlpPolicy",
env,
ent_coef=0.01,
learning_rate=0.0001,
n_steps=2048,
batch_size=64,
n_epochs=10,
)
Results (2M steps, no domain randomization):
- Successes: 838 / 5000 eval steps
- Median distance: 5.7cm
- Min distance: 0.37cm
Policy learned to approach the target but often from the side rather than above. Valid solution for fixed task, but brittle.
Key Design Decisions
- Floating gripper: Simplifies the problem. Real competition will have arm kinematics, but this builds cable manipulation intuition first.
- Ghost socket: No collision means I’m training on the reaching problem, not the insertion physics problem. One thing at a time.
- Ball joints for cable: More realistic than hinge joints. Cable can twist and swing in any direction.
- 100ms action intervals: Slow enough for cable to partially settle, fast enough for reactive control. Tradeoff between stability and responsiveness.
- Velocity in observation: Critical for predicting cable dynamics. Without it, policy would be purely reactive.