Cable Insertion RL: MuJoCo Environment Design

January 8, 2026

The Task

Position a cable connector into a target socket using a floating gripper. This is a simplified version of the Intrinsic AI competition task, building intuition before their toolkit drops (Feb 11).

Scene Design (cable_scene.xml)

Gripper

Floating 3-DOF Cartesian gripper. No arm kinematics, just direct position control.

<body name="gripper" pos="0.1 0 0.2">
  <joint name="x" type="slide" axis="1 0 0" range="-0.3 0.3"/>
  <joint name="y" type="slide" axis="0 1 0" range="-0.3 0.3"/>
  <joint name="z" type="slide" axis="0 0 1" range="-0.2 0.4"/>
  <geom type="box" size="0.02 0.02 0.015" rgba="0.5 0.5 0.5 1"/>
</body>

Position actuators with kp=500:

<actuator>
  <position joint="x" kp="500"/>
  <position joint="y" kp="500"/>
  <position joint="z" kp="500"/>
</actuator>

Cable (5 Segments)

Chain of capsules connected by ball joints. Each segment is 3cm long.

<body name="seg1" pos="0 0 -0.03">
  <joint type="ball" damping="0.2" stiffness="0.1"/>
  <geom type="capsule" size="0.008" fromto="0 0 0 0 0 -0.03"/>

  <body name="seg2" pos="0 0 -0.03">
    <!-- ... nested chain continues ... -->
  </body>
</body>

Why 5 segments?

Fewer segments → cable too stiff, doesn’t hang naturally
More segments → more DOFs, slower simulation, harder to control
5 segments (4 cable + 1 connector) gives ~15cm of flexible cable, enough to demonstrate dynamics without excessive complexity

Joint parameters (final values):

Parameter	Value	Purpose
damping	0.2	Energy dissipation, prevents oscillation
stiffness	0.1	Restoring force toward neutral pose

Initial values (0.02 damping, 0.05 stiffness) caused instability during fast movements.

Socket (Target)

<body name="socket" pos="0.18 0 0.04">
  <geom type="cylinder" size="0.02 0.015" contype="0" conaffinity="0"/>
</body>

contype="0" conaffinity="0" makes it a ghost: visual only, no collision. The connector can pass through. Success is measured by distance, not physical insertion.

Connector

<body name="connector" pos="0 0 -0.03">
  <joint type="ball" damping="0.2" stiffness="0.1"/>
  <geom type="cylinder" size="0.015 0.02"/>
</body>

Cylinder at the end of the cable chain. Slightly smaller than socket (radius 0.015 vs 0.02) for visual clarity.

Simulation Parameters

<option gravity="0 0 -9.81" timestep="0.001"/>

Timestep: Started at 0.002s, reduced to 0.001s for stability. Smaller timestep = more accurate integration of cable dynamics.

Substeps per action: 100 (= 100ms sim time per env step)

Environment Design (cable_env.py)

Observation Space (12 dims)

self.observation_space = spaces.Box(
    low=-np.inf, high=np.inf,
    shape=(12,), dtype=np.float32
)

Index	Content	Source
0:3	Connector position	sensordata[0:3]
3:6	Connector velocity	data.cvel[connector_id][3:6]
6:9	Gripper position	sensordata[3:6]
9:12	Target position	self.target

Why include velocity? The cable swings. Velocity helps the policy predict where the connector will be, not just where it is.

Why include target? Enables domain randomization. Target moves each episode, policy must read it.

Action Space (3 dims)

self.action_space = spaces.Box(
    low=np.array([-0.2, -0.3, 0.0], dtype=np.float32),
    high=np.array([0.4, 0.3, 0.6], dtype=np.float32),
    dtype=np.float32
)

Actions are world coordinates for the gripper target position. Converted to joint offsets before applying:

ctrl = np.array(action) - self.gripper_origin  # [0.1, 0, 0.2]
self.data.ctrl[:] = ctrl

Reward Function

# Base: penalize distance to target
reward = -dist * 10

# Shaping: reward improvement (getting closer)
if self.prev_dist is not None:
    improvement = self.prev_dist - dist
    reward += improvement * 50

# Success bonus
if dist <= 0.02:
    reward += 100.0

Why improvement shaping? Pure distance reward has weak gradient far from target. Improvement shaping rewards progress, not just proximity.

Coefficients:

-dist * 10: At 10cm, penalty is -1.0. At 1cm, penalty is -0.1.
improvement * 50: Moving 1cm closer gives +0.5 bonus.
+100: Success bonus dominates when achieved.

Episode Termination

done = dist <= 0.02           # Success: within 2cm
truncated = self.current_step >= self.max_steps  # Timeout: 200 steps

Max steps: 200 × 100ms = 20 seconds of sim time per episode.

Sensors

<sensor>
  <framepos objtype="body" objname="connector" name="connector_pos"/>
  <framepos objtype="body" objname="gripper" name="gripper_pos"/>
</sensor>

MuJoCo sensors provide clean position data in data.sensordata. Alternative would be reading data.xpos[body_id] directly.

Brute Force Validation

Before training, verified the task is solvable:

for x in np.arange(0.10, 0.26, 0.01):
    for z in np.arange(0.05, 0.25, 0.01):
        # Move gripper, let cable settle, measure distance

Best achievable: 0.21cm at gripper position [0.18, 0, 0.23]

This confirms:

Gripper directly above target (x=0.18)
Gripper at z=0.23 lets cable hang down to target at z=0.04
Cable length (~15cm) matches the geometry

Initial Training Results

PPO hyperparameters:

PPO(
    "MlpPolicy",
    env,
    ent_coef=0.01,
    learning_rate=0.0001,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
)

Results (2M steps, no domain randomization):

Successes: 838 / 5000 eval steps
Median distance: 5.7cm
Min distance: 0.37cm

Policy learned to approach the target but often from the side rather than above. Valid solution for fixed task, but brittle.

Key Design Decisions

Floating gripper: Simplifies the problem. Real competition will have arm kinematics, but this builds cable manipulation intuition first.
Ghost socket: No collision means I’m training on the reaching problem, not the insertion physics problem. One thing at a time.
Ball joints for cable: More realistic than hinge joints. Cable can twist and swing in any direction.
100ms action intervals: Slow enough for cable to partially settle, fast enough for reactive control. Tradeoff between stability and responsiveness.
Velocity in observation: Critical for predicting cable dynamics. Without it, policy would be purely reactive.