RADIO-ViPE

License Research (NVIDIA)

Platform

Language

Category vision

Online semantic SLAM system combining open-vocabulary grounding with monocular RGB-only input. Uses RADIO foundation model embeddings for dense per-pixel meaning vectors, tightly coupled with geometric bundle adjustment. Turns ordinary phone video into searchable 3D maps queryable in natural language. No LiDAR, depth sensors, or camera intrinsics needed. Handles dynamic environments with people moving.

Visit Page →