Reflex-Based Open-Vocabulary Navigation

without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models

Advanced Robotics

  • Kento Kawaharazuka
  • Yoshiki Obinata
  • Naoaki Kanazawa
  • Naoto Tsukamoto
  • Kei Okada
  • Masayuki Inaba
  • JSK Robotics Laboratory, The University of Tokyo, Japan

Various robot navigation methods have been developed, but they are mainly based on Simultaneous Localization and Mapping (SLAM), reinforcement learning, etc., which require prior map construction or learning. In this study, we consider the simplest method that does not require any map construction or learning, and execute open-vocabulary navigation of robots without any prior knowledge to do this. We applied an omnidirectional camera and pre-trained vision-language models to the robot. The omnidirectional camera provides a uniform view of the surroundings, thus eliminating the need for complicated exploratory behaviors including trajectory generation. By applying multiple pre-trained vision-language models to this omnidirectional image and incorporating reflective behaviors, we show that navigation becomes simple and does not require any prior setup. Interesting properties and limitations of our method are discussed based on experiments with the mobile robot Fetch.


Reflex-Based Open-Vocabulary Navigation

The concept of this study: simple reflex-based open-vocabulary navigation is enabled by splitting the expanded omnidirectional image and applying multiple pre-trained large-scale vision-language models.


Omnidirectional Camera

Dual-fisheye stitching for the expanded 360-degree omnidirectional image. The upper figure shows the fisheye images before processing, the middle figure shows the expanded image, and the lower figure shows the input image to vision-language models with unnecessary parts removed from the expanded image.


Application of Pre-Trained Vision-Language Models

The preliminary experiments of using large-scale vision-language models for open-vocabulary navigation. The left figure shows the split image and the recognition result, and the right graph shows the average of the transformed similarity for 10 repetitions of each instruction for CLIP and Detic: kitchen - "Go to the kitchen", microwave - "Please look at the microwave oven", bookshelf - "See the bookshelf".


Basic Experiments

The environmental setup of the basic experiment. The mobile robot Fetch is placed in a small area surrounded by the kitchen, microwave oven, and desk with chairs.

The trajectories of the mobile robot Fetch and the error between the robot's final and target positions in the basic experiment. We prepared three instructions: kitchen - "Go to the kitchen", microwave - "Please look at the microwave oven'', and desk - "See the desk with chairs", and compared the proposed method ALL with CLIP and Detic.


Advanced Experiment

The advanced navigation experiment. The instruction is continuously changed in the following order: (1) "Look at the large TV display on the wooden table", (2) "Look at the multiple small PC monitors on the white desk near the bookshelf", and (3) "Check the microwave oven".


Bibtex

@article{kawaharazuka2024omnivlm,
  author={K. Kawaharazuka and Y. Obinata and N. Kanazawa and N. Tsukamoto and K. Okada and M. Inaba},
  title={{Reflex-Based Open-Vocabulary Navigation without Prior Knowledge Using Omnidirectional Camera and Multiple Vision-Language Models}},
  journal={Advanced Robotics},
  pages={1--12},
  year=2024,
}
            

Contact

If you have any questions, please feel free to contact Kento Kawaharazuka.