Two lines of code solve the limitations of large language model conversations! Hong Kong Chinese Jia Jia Ya team collaborates with MIT to release ultra long text extension technology

Code and Demo Address:https://github.com/dvlab-research/LongLoRAPaper address:https://arxiv

Code and Demo Address:
https://github.com/dvlab-research/LongLoRA

Paper address:
https://arxiv.org/pdf/2309.12307.pdf

Lost midway, lazy models, and larger contexts make models more clumsy... If users have experienced large language model products, they may feel the limitations of text input length. For example, when they want to discuss some slightly longer content with the large model, they need to split the input, and the key points inputted earlier will soon be forgotten by the large model.

This is a typical big language model dialogue flaw! Just like children born with attention deficit, it is difficult to focus on finishing a new book. The key flaw lies in the lack of long text processing capability in the model. This situation is now being broken.

Recently, the new technology and model released by Jia Jiaya's team in collaboration with MIT quietly made it to the hot lists of major open source websites: first on the huggingface hot list, first on the paper with code hot list, fifth on all Github Python projects hot list, Githubstars breaking thousands within a week, and nearly 180000 views of related technical posts on Twitter

Nearly 180000 views of related technical posts on Twitter

This technology called LongLoRA is practical but surprisingly simple: with just two lines of code and an 8-card A100 machine, the text length of the 7B model can be extended to 100ktokens, and the text length of the 70B model can be extended to 32ktokens; At the same time, the research team also released the first long text dialogue large language model, LongAlpaca, with 70B parameter quantities.

The world's first 70B long text large language modelrelease

The proposal of LongLoRA has solved the dialogue gap in global language models for the first time, and since then, dozens of pages of papers, hundreds of pages of reports, and giant works have no longer become blind spots for big models.

In response, some professionals are excited to say that LongLoRA is a beacon of hope in the maze of big language models! It represents the industry's rethinking and attention to the long text big language model, effectively expanding the context window of the big language model, allowing the model to consider and process longer text sequences, and is an innovative invention of the big language model.

In addition to technological innovation, a major challenge for dealing with long text problems in large language models is the lack of publicly available long text dialogue data.

For this purpose, the research team specifically collected 9k long text question and answer corpus, including various types of questions and answers for classic works, papers, in-depth reports, and even financial statements.

Just being able to answer long questions is not enough. The team also selected a 3K short Q&A corpus and a 9K long Q&A corpus for mixed training, allowing the long text model to have short text dialogue ability at the same time. This complete dataset is called LongAlpaca-12k and is currently open source.

On the basis of the LongAlpaca-12k dataset, the research team trained and evaluated different parameter sizes 7B, 13B, and 70B. The open source models include LongAlpaca-7B, LongAlpaca-13B, and LongAlpaca-70B.

Reading novels, revising papers, and providing guidance on the economy can be considered a versatile king

Without further ado, blindly select a few demos and let's take a look at the LongAlpaca effect of the large model using LongLoRA technology and 12K Q&A corpus.

Ask the system to read a new paper and provide modification suggestions according to the ICLR review guidelines, in order to improve the acceptance rate of the paper. LongAlpaca's opinion is that by more accurately elucidating novelty, providing more rigorous and comparative experimental results (including specific datasets and indicators), wider applications, and future development directions, with a focus on presenting key contributions and impacts, the chances of the paper being accepted will be improved.

Now, let the system read two new and different papers and have LongAlpaca summarize the stylistic differences between the ICLR and CVPR conferences. LongAlpaca concludes that CVPR papers tend to have a more structured and experimental style, focusing on practicality and technicality. ICLR's paper style is more flexible, focusing on key theoretical analysis and mathematical derivation, rather than standard format.

It can be seen that the trained LongAlpaca model can easily accept new long academic papers and is quite accurate in answering academic related questions.

Next, let's take a look at the interpretation performance of the LongAlpaca model in the economic field with high reading and understanding barriers.

Based on the summary collection of the International Monetary Fund's global economic outlook from 2012 to 2023, the economic situation is summarized year by year and future economic trends are predicted. LongAlpaca told us that the global economic situation in 2023 is uncertain, and growth is expected to slow down compared to 2022; In the coming years, the global economic expansion rate will be moderate and lower than expected, while major challenges such as geopolitical tensions and financial market fluctuations still exist, which need to be carefully addressed through multilateral cooperation and structural reforms.

LongAlpaca can also read new novels and analyze the content after reading them. The comparison models selected below are LongAlpaca and Llama2 with 13B parameters, and let's see their respective performance.

In the famous work "Journey to the West", why was Sun Wukong so fierce during the chaos in the Heavenly Palace, but he repeatedly suffered setbacks on his way to obtain scriptures? LongAlpaca gave five reasons, which can be roughly summarized as "Sun Wukong's immaturity, strong opponents, limitations in power, deception from enemies, and initial lack of companions led to his failure on the journey. However, over time, he gained wisdom, experience, and comrades, which ultimately enabled him to defeat the strongest enemy

And Llama2's answer is relatively simple. Although he did make a big fuss in the Heavenly Palace, it is not necessarily accurate to say that he represents the invincible power. In fact, he was ultimately defeated by the powerful Tang Sanzang in Journey to the West. Llama2 did not provide the correct answer and also suggested the need for more content information.

This time, let the system read "Three Body" again and ask why Ye Wenjie contacted aliens in the first part, and why did he regret it later? LongAlpaca believes that Ye Wenjie's initial contact with aliens was driven by idealism and revenge, but over time, her view of humanity has changed and she realizes the danger she may bring to all humanity. Her remorse has grown day by day. The answer was very clear.

And Llama2's answer is, "Through interaction with aliens, she hopes to gain a deeper understanding of the essence of the universe and her own position in the universe. Ye Wenjie gradually realizes that aliens and their technology cannot solve her problems." The answer is general, and then she begins to comment on the novel as a whole, answering off topic.

From the answers provided by the models, it can be observed that some models, such as Llama2 [2], may have seen relevant novels during the pre training process, but if short text questions are only based on the novel title during questioning, the answer is not ideal.

Compare the answers of the two models, with a high and low perspective. LongAlpaca is a skilled writer in revising academic papers, commenting on global economic trends, and reading novels, all of which have won the Llama2.

Two lines of codeAnd three key conclusions

Llama2 can be said to be one of the most powerful open source big models in the AI community, with a leading position in the industry, and LongAlpaca being able to achieve a complete victory. How did the LongLoRA technology behind it successfully attract the attention of netizens?

In the original large language model, the main computational cost for long text processing was focused on the self attention mechanism, which increased exponentially with the length of the text.

In response to this issue, the research team proposed the LongLoRA technology and simulated the global self attention mechanism using grouping and offset methods.

Simply put, it means dividing the tokens corresponding to long text into different groups and performing self attention calculations within each group, with the grouping method being biased at different attention heads. This approach can significantly save computational complexity while maintaining the transmission of global receptive fields.

Two lines of code

LongLoRA also explored low rank training methods. The original low rank training methods, such as LoRA [5], cannot achieve good results in text length transfer. On the basis of low rank training, LongLoRA introduces embedding layers (Embedding layers and Normalization layers) for fine-tuning, thus achieving the effect of being able to approximate Fullfine tune.

When expanding and training text of different lengths, the specific effects of LongLoRA, LoRA, and full parameter fine-tuning techniques can be reflected in three dimensions:

In terms of complexity confusion, the performance of the original LoRA method is constantly deteriorating, while LongLoRA and full parameter fine-tuning can maintain good results under various text lengths;

Compared to full parameter fine-tuning, LongLoRA and the original LoRA have significant savings in graphics memory consumption. For example, for model training with a length of 8k, LongLoRA reduces graphics memory consumption from 46.3GB to 25.6GB compared to full parameter fine-tuning;

In terms of training time, for model training with a length of 64k, LongLoRA reduces the training time from about 90 to 100 hours to 52.4 hours compared to conventional LoRA, while fine-tuning all parameters for over 1000 hours.

The minimalist training methods, minimal computational resources and time consumption, and excellent accuracy make it possible for LongLoRA to be widely promoted. At present, all relevant technologies and models have been open source, and interested users can deploy and experience them themselves.

89release LISA LongLoRA

References

[1] LLaMAteam. Llama: Openandefficientfoundationlanguage models Arxiv, 2302.139712023a

[2] Llama2team. Llama2: Openfoundation and fine tuned chatmodels. Arxiv, 2307.092882023b

[3] ShouyuanChen, ShermanWong, LiangjianChen, and YuandongTian. Extendingcontextwindow of largelanguage models via position integration Arxiv, 2306.155952023

[4] SzymonTWorkowski, KonradStaniszewski, MikolajPacek, YuhuaiWu, HenrykMichalewski, and PiotrMilos. Focusedtransformer: Contrastively training for contextscaling Arxiv, 2307.031702023

[5] Edward J. Hu, YelongShen, PhillipWallis, ZeyuanAllen Zhu, YuanzhiLi, SheenWang, LuWang, and WeihuChen. Lora: Low bankadaptation of large-scale language models InICLR, 2022


Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])