GPT-4 hybrid large model? Research has proven that MoE+instruction tuning does indeed outperform large models

Machine Heart ReportEditor: Xiaozhou, Chen PingGoogle, UC Berkeley, and others have proven that MoE+instruction tuning has worked 1+1> The effect of 2.Since the introduction of GPT-4, people have been amazed by its powerful emergence ability, including excellent language comprehension, generative ability, logical reasoning ability, and so on

Machine Heart Report

Editor: Xiaozhou, Chen Ping

Google, UC Berkeley, and others have proven that MoE+instruction tuning has worked 1+1> The effect of 2.

Since the introduction of GPT-4, people have been amazed by its powerful emergence ability, including excellent language comprehension, generative ability, logical reasoning ability, and so on. These capabilities make GPT-4 one of the most cutting-edge models in the field of machine learning. However, OpenAI has not yet disclosed any technical details of GPT-4.

Last month, George Hotz, a "genius hacker", mentioned GPT-4 in an interview with an AI technology podcast called LatentSpace, and said that GPT-4 is actually a Mixture model. Specifically, George Hotz stated that GPT-4 adopts an integrated system consisting of eight expert models, each with 220 billion parameters (slightly more than GPT-3's 175 billion parameters), and these models have been trained for different data and task distributions.

Interview content for LatentSpace.

This may only be a speculation by George Hotz, but this model does have some rationality. Recently, a paper jointly published by researchers from Google, UC Berkeley, MIT and other institutions confirmed that the combination of mixed expert model (MoE) and instruction tuning can significantly improve the performance of large language model (LLM).

Paper address: https://arxiv.org/pdf/2305.14705.pdf

Sparse mixed expert model is a special neural network architecture that can add learnable parameters to large language models (LLMs) without increasing inference costs. Instruction tuning is a technique for training LLM to follow instructions. This study found that the MoE model can benefit more from instruction tuning than the dense model, so it is proposed to combine MoE and instruction tuning.

This study conducted empirical research under three experimental settings, including

  • Perform direct fine-tuning on a single downstream task without instruction tuning;
  • After instruction tuning, perform in context generalization with fewer or zero samples on downstream tasks;
  • Further fine-tuning of individual downstream tasks after instruction tuning.

In the first case, the MoE model is generally not as good as a dense model with the same computing power. However, with the introduction of instruction tuning (second and third cases), FLAN-MoE_ 32B (Fine tuned LanguageNet, abbreviated as Flan, is an instruction tuned model, with Flan MoE being instruction tuned MoE) outperforming FLAN-PALM in performance on four benchmark tasks_ 62B, but only one-third of the FLOPs were used.

It seems that GPT-4 adopts a Mixture model, and MoE can really get more benefits from the command tuning:

Method Overview

The researchers used Sparse Activated MoE (Mixture of Experts) in the FLAN-MOE (a sparse mixed expert model that has been fine-tuned with instructions) model. In addition, they also replaced the feedforward components of other Transformer layers with MoE layers.

Each MoE layer can be understood as an "expert". Then, use the softmax Activation function to model these experts and obtain a probability distribution.

Although each MoE layer has many parameters, experts are sparsely activated. This means that for a given input token, only a limited subset of experts can complete the task, providing greater capacity for the model.

For MoE layers with E experts, this actually provides O (E ^ 2) different combinations of feedforward networks, thereby achieving greater computational flexibility.

Since FLAN-MoE is a model that has undergone instruction tuning, instruction tuning is very important. This study fine-tuning FLAN-MOE based on the FLAN set dataset. In addition, the study adjusted the input sequence length of each FLAN-MOE to 2048 and the output length to 512.

Experiment and Analysis

On average, without adding any additional calculations, Flan MoE outperforms dense similar products (Flan T5) at all model scales.

Number of experts. Figure 4 shows that as the number of experts increases, the model initially benefits from a richer specialized sub network, each capable of handling different tasks or aspects in the problem space. This approach enables MoE to have strong adaptability and efficiency in handling complex tasks, thereby improving overall performance. However, as the number of experts continues to increase, the performance gain of the model begins to decrease and eventually reaches saturation point.

Figure 3 and Table 1 investigate in detail how different routing decisions affect instruction tuning performance: by comparing the FLAN-Switch and FLAN-GS strategies, it can be concluded that activating more experts will improve performance in the four benchmark tests. In these benchmark tests, the MMLU Direct model showed the most significant improvement, increasing from 38.0% to 39.9% for the BASE/LARGE size model.

It is worth noting that instruction tuning significantly amplifies the performance of the MoE model in preserving MMLU, BBH, and internal QA and inference benchmarks compared to the dense model with equivalent capacity. For larger MoE models, these advantages are further amplified. For example, instruction tuning enables ST_ The performance of 32B has been improved by 45.2%, while for FLAN-PALM_ 62B, this improvement is relatively small, about 6.6%.

When expanding the model, Flan MoE (Flan ST-32B) outperforms Flan PaLM-62B.

In addition, this study conducted some analytical experiments by freeze the gating function, expert module, and MoE parameters of the given model. As shown in Table 2 below, the experimental results indicate that the freeze expert module or MoE component has a negative impact on model performance.

On the contrary, the freeze gating function will slightly improve the performance of the model, although not significantly. The researchers speculate that this observation is related to the underfitting of FLAN-MOE. The study also conducted ablation experiments to explore the efficiency of fine-tuning data ablation as described in Figure 5.

Finally, in order to compare the difference between directly fine-tuning MoE and FLAN-MOE, this study conducted experiments on single task fine-tuning MoE, single task fine-tuning FLAN-MoE, and dense models. The results are shown in Figure 6:

Interested readers can read the original paper to learn more about the research content.


Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])