🚀 Gate Square Creator Certification Incentive Program Is Live!
Join Gate Square and share over $10,000 in monthly creator rewards!
Whether you’re an active Gate Square creator or an established voice on another platform, consistent quality content can earn you token rewards, exclusive Gate merch, and massive traffic exposure!
✅ Eligibility:
You can apply if you meet any of the following:
1️⃣ Verified creator on another platform
2️⃣ At least 1,000 followers on a single platform (no combined total)
3️⃣ Gate Square certified creator meeting follower and engagement criteria
Click to apply now 👉
Google's AGI robot big move: 54-person team held back for 7 months, strong generalization and strong reasoning, new results after the merger of DeepMind and Google Brain
Original source: Qubit
The explosive big model is reshaping Google’s DeepMind’s robot research.
One of the latest achievements is the robot project RT-2, which took them 7 months to build, and it became popular on the Internet:
Just give an order in human language, and the little guy in front of him can wave his robotic arm, think and complete the “master’s task”.
Like giving water to pop singer Taylor Swift, or identifying the logo of a star team:
In the words of netizens, don’t underestimate this ability. This is a logical leap from “extinct animals” to “plastic dinosaurs”.
What’s more “frightening” is that it can easily solve the multi-stage reasoning problem of “choosing a drink for a tired person” that needs to be combined with the chain of thought–as soon as the order is heard, the little hand will go straight to the Red Bull, Just don’t be too smart.
Some netizens lamented after reading:
Plug the multi-modal large model into the robotic arm
The robot project, called RT-2 (Robotic Transformer 2), is an “evolutionary version” of the RT-1 released at the end of last year.
Compared with other robot research, the core advantage of RT-2 is that it can not only understand “human words”, but also reason about “human words” and convert them into instructions that robots can understand, so as to complete tasks in stages.
Specifically, it has three major capabilities - symbol understanding (Symbol understanding), reasoning (Reasoning) and human recognition (Human recognition).
The first ability is “symbolic understanding”, which can directly extend the knowledge of large model pre-training to data that the robot has never seen before. For example, although there is no “Red Bull” in the robot database, it can understand and grasp the appearance of “Red Bull” from the knowledge of the large model, and handle objects.
The second ability is “reasoning”, which is also the core advantage of RT-2, which requires the robot to master the three major skills of mathematics, visual reasoning and multilingual understanding.
Skill 1, including the command of mathematical logical reasoning, “put the banana in the sum of 2+1”:
So, how are these three abilities realized?
To put it simply, it is to combine the “reasoning”, “recognition”, and “mathematics” capabilities of the visual-text multimodal large model (VLM) with the operating capabilities of robots.
For example, the data such as the degree of rotation and the coordinate point to be placed are converted into text “put to a certain position”.
In this way, the robot data can also be used in the visual-language dataset for training. At the same time, in the process of reasoning, the original text instructions will be re-converted into robot data to realize a series of operations such as controlling the robot.
That’s right, it’s that simple and rude (manual dog head)
In this research, the team mainly “upgraded” based on a series of basic large-scale models of Google, including 5 billion and 55 billion PaLI-X, 3 billion PaLI And 12 billion PaLM-E.
In order to improve the ability of the large model itself, the researchers have also put in a lot of effort, using the recently popular thinking chain, vector database and no-gradient architectures.
This series of operations also gives RT-2 a lot of new advantages compared with the RT-1 released last year.
Let’s take a look at the specific experimental results.
Up to three times the performance of RT-1
RT-2 uses the data of the previous generation robot model RT-1 for training (that is to say, the data has not changed, but the method is different).
The data was collected over a period of 17 months using 13 robots in a kitchen environment set up in the office.
In the actual test (a total of 6,000 times), the author gave RT-2 many previously unseen objects, requiring RT-2 to perform semantic understanding beyond the fine-tuning data to complete the task.
The results are all done pretty well:
Including simple recognition of letters, national flags, and characters to recognition of terrestrial animals from dolls, selection of the one with a different color, and even complex commands such as picking up snacks that are about to fall off the table.
As mentioned earlier, the two variants are trained on PaLM-E with 12 billion parameters and PaLI-X with 55 billion parameters, respectively.
In order to better understand how different settings of RT-2 affect the generalization results, the author designed two categories of evaluations:
First, in terms of model size, only the RT-2 PaLI-X variant uses 5 billion parameters and 55 billion parameters for training;
The second is the training method, which adopts the method of training the model from scratch vs fine-tuning vs collaborative fine-tuning.
The final results show that the importance of VLM pre-trained weights and the generalization ability of the model tend to increase with the model size.
Finally, since the RT-2 PaLM-E variant is a vision-language-action model that can act as an LLM, VLM, and robot controller in a single neural network, RT-2 can also perform controlled thought-chain reasoning.
Among the five reasoning tasks shown in the figure below (especially the last one is very interesting: choose an item that can replace the hammer), it will output the natural language steps after receiving the command, and then give the specific action token.
One More Thing
Google’s focus on robotics research on large models does not seem to be “groundless”.
Just in the past two days, a paper on “Using Large Models to Help Acquire More Robot Operation Skills” co-authored with Columbia University is also very popular:
Reminiscent of the embodied intelligence achievements of Li Feifei’s team not long ago, it can be said that using large models to drive robots has become a research trend, and we have seen a wave of very promising progress.
What are your expectations for this research direction?
project address:
Reference link:
[1]
[2]
[3]
[4]