VL-JEPA Reproduction
The Goal
Reproduce Meta’s VL-JEPA paper on edge hardware.
Part 1: Architecture (Mac M4)
Reproduced Meta’s VL-JEPA architecture from their December paper. The core insight is predicting text embeddings from video in a shared latent space rather than autoregressive token generation - cleaner for real-time robotics.
Stack: V-JEPA 2 (video encoder, frozen) → 6-layer transformer predictor (learned) → EmbeddingGemma (text encoder, frozen). Trained with bidirectional InfoNCE loss.
Overfit test on two videos converged from 0.64 → 0.03. Zero-shot retrieval correctly ranks candidates.
Part 2: Edge Deployment (Jetson Orin Nano)
The paper works on a Mac with the latest torch. Getting it on a Jetson Orin Nano Super (8GB unified memory, 67 TOPS) was a different story.
The problem: V-JEPA 2 was added to transformers in v4.53, which requires torch≥2.6. Jetson’s L4T image ships torch 2.2.
The fix: The version check is just a gate - the actual ops work fine.
pip install --no-deps transformers==4.54.0
Other gotchas:
eva-decordhas no ARM wheels → swapped to OpenCV- HF cache on mounted volume causes segfaults → separate cache mount
- tokenizers version mismatch → pin to
>=0.21,<0.22
Results:
- 26s warm start, ~5s inference
- Correct retrieval on both test videos
- Runs in 8GB unified memory
What’s Next
- YOLO vs VL-JEPA comparison (why semantic understanding beats boxes)
- Live camera integration
- Sentry voice + vision unified