Starting to compile a list of papers I read!  Alongside each paper’s link and abstract; I will be sharing an engaging conversation generated using NotebookLM. Slowly it is taking form of a technical podcast which I have named as explainability. I listen to these while traveling. It’s crafted to spark curiosity; hope you find it useful too!

1. Gemini Robotics: AI in the Physical World (June, 2025)
https://arxiv.org/abs/2503.20020

This report introduces Gemini Robotics, a family of AI models built upon the multimodal foundation model Gemini 2.0, designed to bring advanced AI capabilities into the physical world by enabling robots to perform general and dexterous tasks. The research highlights the development of Gemini Robotics-ER, a Vision-Language Model (VLM) with enhanced embodied reasoning, and Gemini Robotics, a Vision-Language-Action (VLA) model that integrates robot action data for high-frequency, dexterous control.

The authors present ERQA, a new open-source benchmark to evaluate embodied reasoning, and demonstrate Gemini Robotics’ superior performance in dexterous manipulation, instruction following, and generalization across diverse environments and tasks, including complex, long-horizon scenarios and rapid adaptation to new challenges. The document also addresses the responsible development and safety considerations inherent in deploying such advanced robotic systems.

2. VJEPA: Video Joint-Embedding Predictive Architectures (February 15, 2024)
https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/

The research introduces V-JEPA, a novel approach for unsupervised learning of visual representations directly from video data. This method uniquely employs a feature prediction objective, training models to predict masked regions in videos without relying on pre-trained encoders, text, or negative examples. The authors demonstrate that V-JEPA models learn versatile representations, excelling in both motion and appearance-based tasks with frozen backbones and proving more label-efficient than pixel prediction methods. Through comprehensive ablations, the paper highlights the critical role of feature space prediction, diverse pretraining data, and an attentive pooling strategy for achieving robust performance in various downstream image and video applications.

Conversation (Hindi):

3. An Introduction to the Kalman Filter by: Greg Welch and Gary Bishop (July 24, 2006)
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf

This paper provides a comprehensive introduction to the Kalman filter, a mathematical tool for estimating the state of a process by minimizing squared error, even when the system’s precise nature is unknown. It details the discrete Kalman filter, explaining its time and measurement update equations and the recursive nature that makes it computationally efficient. Furthermore, the text introduces the extended Kalman filter (EKF), which adapts the standard filter for nonlinear systems by using Jacobian matrices for linearization. The paper concludes with a simple example demonstrating the filter’s operation and the impact of tuning parameters like process and measurement noise covariance.

Conversation (in Hindi):

4. Ark: Unifying Robotics and AI for Autonomous Systems

The paper introduces Ark, an open-source, Python-based framework designed to bridge the gap between machine learning and robotics software development. It addresses common challenges in robotics, such as steep learning curves, fragmented tooling, and complex hardware integration, by offering a Python-first approach with a Gym-style environment interface. Ark enables seamless transitions between high-fidelity simulation and physical robots, supporting data collection, policy training with state-of-the-art imitation learning algorithms, and deployment of autonomous systems. The framework also includes reusable modules for control, SLAM, motion planning, and visualization, alongside native ROS interoperability, aiming to accelerate research and commercial deployment of autonomous robots.

Scroll to Top