Google's latest technology: through the search engine, greatly enhance the accuracy of models such as ChatGPT

Original source: AIGC Open Community

Image source: Generated by Unbounded AI

Due to the emergence of Transformer, the ability of large language models such as ChatGPT to process natural language tasks has been greatly improved. However, the generated content contains a lot of incorrect or outdated information, and there is no factual evaluation system to verify the authenticity of the content.

In order to comprehensively evaluate the adaptability of large language models to changes in the world and the authenticity of content, the Google AI research team published a paper called “Enhancing the accuracy of large language models through search engine knowledge”. A FRESH method is proposed to improve the accuracy of large language models such as ChatGPT and Bard by obtaining real-time information from search engines.

The researchers constructed a new question-and-answer benchmark set FRESHQA, which contains 600 real questions of various types, and the frequency of answers is divided into four categories: “never change”, “slow change”, “frequent change” and “false premises”**.

At the same time, two assessment methods, strict mode, which requires that all information in the answers must be accurate and up-to-date, and relaxed mode, are also designed, which only evaluates the correctness of the main answers.

The experimental results show that FRESH significantly improves the accuracy of large language models on FRESHQA. For example, GPT-4 is 47% more accurate than the original GPT-4 with the help of FRESH’s strict mode.

In addition, this method of fusing search engines is more flexible than directly expanding the parameters of the model, and can provide a dynamic external knowledge source for existing models. The experimental results also show that FRESH can significantly improve the accuracy of large language models on problems requiring real-time knowledge.

Paper Address:

Open Source Address: Big Language Model S/FreshQA (in the pipeline, will be open source soon)

From the content of Google’s paper, FRESH’s method is mainly composed of 5 modules.

Build FRESHQA benchmark set

In order to comprehensively assess the adaptability of large language models to the changing world, the researchers first constructed the FRESHQA benchmark set, which contains 600 real open-domain questions, which can be divided into four categories according to the frequency of answer changes: “never change”, “slow change”, “frequent change” and “false premises”.

  1. Never change: The answer to questions that basically won’t change.

  2. Slow change: The answer to the question changes every few years.

  3. Frequent change: Answers to questions that may change every year or less.

  4. Incorrect premise: A problem that contains an incorrect premise.

The questions cover a variety of topics and have different levels of difficulty. The key feature of FRESHQA is that the answer may change over time, so the model needs to be sensitive to changes in the world.

Strict Mode vs. Relaxed Mode Evaluation

The researchers proposed two evaluation modes: the strict mode, which requires that all information in the answers must be accurate and up-to-date, and the relaxed mode, which only evaluates the correctness of the main answers.

This provides a more comprehensive and nuanced way to measure the factual nature of language models.

Evaluate different large language models based on FRESHQA

On FRESHQA, the researchers compared large language models covering different parameters, including GPT-3, GPT-4, ChatGPT, and others. Assessments are conducted in both strict mode (error-free is required) and permissive mode (only primary answers are evaluated).

It is found that all models perform poorly on problems that require real-time knowledge, especially problems with frequent changes and incorrect premises. This shows that the current large language model has limitations in its adaptability to a changing world.

Retrieving relevant information from search engines

To improve the factual nature of the big language model, the core idea of FRESH is to retrieve real-time information about the problem from the search engine.

Specifically, given a question, FRESH will query Google’s search engine as a keyword to get multiple types of search results including answer boxes, web page results, “other users also asked”, etc.

Retrieve information through sparse training integration

FRESH uses few-shot learning to integrate the retrieved evidence into the input prompt of the large-language model in a unified format, and provides several demonstrations of how to synthesize the evidence to arrive at the correct answer.

This can teach big language models to understand the task and integrate information from different sources to come up with up-to-date and accurate answers.

Google said that FRESH is of great significance to improve the dynamic adaptability of large language models, which is also an important direction for future technology research of large language models.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)