VL-JEPA Reproduction

January 1, 2026

The Goal

Reproduce Meta’s VL-JEPA paper on edge hardware.

Part 1: Architecture (Mac M4)

Reproduced Meta’s VL-JEPA architecture from their December paper. The approach predicts text embeddings from video in a shared latent space rather than autoregressive token generation. Cleaner for real-time robotics.

Stack: V-JEPA 2 (video encoder, frozen) → 6-layer transformer predictor (learned) → EmbeddingGemma (text encoder, frozen). Trained with bidirectional InfoNCE loss.

Overfit test on two videos converged from 0.64 → 0.03. Zero-shot retrieval correctly ranks candidates.

Part 2: Edge Deployment (Jetson Orin Nano)

The paper works on a Mac with the latest torch. Getting it on a Jetson Orin Nano Super (8GB unified memory, 67 TOPS) was a different story.

The problem: V-JEPA 2 was added to transformers in v4.53, which requires torch≥2.6. Jetson’s L4T image ships torch 2.2.

The fix: The version check is just a gate - the actual ops work fine.

pip install --no-deps transformers==4.54.0

Other gotchas:

eva-decord has no ARM wheels → swapped to OpenCV
HF cache on mounted volume causes segfaults → separate cache mount
tokenizers version mismatch → pin to >=0.21,<0.22

Results:

26s warm start, ~5s inference
Correct retrieval on both test videos
Runs in 8GB unified memory

What’s Next

YOLO vs VL-JEPA comparison (why semantic understanding beats boxes)
Live camera integration
Sentry voice + vision unified