By Consultants Review Team
Google demonstrated multimodal capabilities in the Gemini 1.5 AI model during I/O 2024. This implies that the big language model can interpret textual inputs, including images, videos, and audio, and produce replies. This skill is now being used by the company's AI division to teach robots how to navigate their environment.
Recently, Google Deepmind released a research paper and many videos demonstrating how to teach a robot to comprehend multimodal instructions, such as those in spoken language and images, and carry out practical navigation.
Google stated that the development of Vision Language Models (VLMs) has shown a promising path toward accomplishing this goal. "To achieve this, we study a widely useful category of navigation tasks we call Multimodal Instruction Navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video," the company said.
How Google used the Gemini 1.5 AI model to teach a robot
Google stated in a post on X (previously Twitter) that Gemini's 1.5 Pro's 1 million token context length helps the business train robots for navigation since many AI models find it difficult to recall environments when context lengths are constrained.
Robots "may successfully navigate a space by using human instructions, video tours, and common sense reasoning," it stated.
While showing the robots around particular locations in a real-world environment, the trainers pointed out important locations to remember. They were then instructed to guide us to these places.
For instance, a robot can be trained to serve as a guide in an office environment. It may be configured to find the position of the play area, canteen, chairman's office, and other areas of the business. The robot will comply when visitors ask it to "take me to the Chairman's office" while they are meeting with an official of the firm. In a similar vein, new hires may ask the robot to show them the way to the canteen, and it will lead the worker there.
"After receiving these inputs, the architecture of the system constructs a topological graph, which is a condensed representation of a space. In order to locate a way without a map, this is built from frames inside tour recordings that capture the overall connection of their surroundings, according to Google Deepmind.
Google reports that the organization assessed the results in a setting that resembled a house and an office, and attained 86% and 90% success rates, respectively. The business also stated that in the future, consumers will only need to film a tour of their surroundings using a smartphone for their personal robot helper to comprehend and navigate.