Starting to compile a list of papers I read!  Alongside each paper’s link and abstract; engaging conversation generated using NotebookLM is also available. I listen to these while traveling. It’s crafted to spark curiosity; hope you find it useful too!

1. Gemini Robotics: AI in the Physical World (June, 2025)
https://arxiv.org/abs/2503.20020

This report introduces Gemini Robotics, a family of AI models built upon the multimodal foundation model Gemini 2.0, designed to bring advanced AI capabilities into the physical world by enabling robots to perform general and dexterous tasks. The research highlights the development of Gemini Robotics-ER, a Vision-Language Model (VLM) with enhanced embodied reasoning, and Gemini Robotics, a Vision-Language-Action (VLA) model that integrates robot action data for high-frequency, dexterous control.

The authors present ERQA, a new open-source benchmark to evaluate embodied reasoning, and demonstrate Gemini Robotics’ superior performance in dexterous manipulation, instruction following, and generalization across diverse environments and tasks, including complex, long-horizon scenarios and rapid adaptation to new challenges. The document also addresses the responsible development and safety considerations inherent in deploying such advanced robotic systems.

2. VJEPA: Video Joint-Embedding Predictive Architectures (February 15, 2024)
https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/

The research introduces V-JEPA, a novel approach for unsupervised learning of visual representations directly from video data. This method uniquely employs a feature prediction objective, training models to predict masked regions in videos without relying on pre-trained encoders, text, or negative examples. The authors demonstrate that V-JEPA models learn versatile representations, excelling in both motion and appearance-based tasks with frozen backbones and proving more label-efficient than pixel prediction methods. Through comprehensive ablations, the paper highlights the critical role of feature space prediction, diverse pretraining data, and an attentive pooling strategy for achieving robust performance in various downstream image and video applications.

Conversation (English):

Conversation (Hindi):

Scroll to Top