
Knowledge Science - Alles über KI, ML und NLP
Knowledge Science - Alles über KI, ML und NLP
Episode 161 - KI generiert: KS Pulse - MAmmoTH2, Granite Code Models
Englisch Version - The German Version also exists, but content differs minimal:
AI-generated News of the Day. The Pulse is an experiment to see if it is interesting to get the latest news in 5 min. small packages generated by an AI every day.
It is completely AI-generated. Only the content is curated. Carsten and I select suitable news items. After that, the manuscript and the audio file are automatically created.
Accordingly, we cannot always guarantee accuracy.
MAmmoTH2: Scaling Instructions from the Web - https://arxiv.org/pdf/2405.03548
Granite Code Models: A Family of Open Foundation Models for Code Intelligence - https://arxiv.org/pdf/2405.04324v1
It would be great if you could compare the German version to the English version and give us feedback.
Welcome to the Knowledge Science Pulse podcast, where we explore the latest developments in artificial intelligence. I'm your host, Sigurd, and today I'm joined by my co-host Carsten. We'll be discussing two fascinating papers that showcase the advancement of language models and their applications in code intelligence. Carsten, what's the first paper we'll be diving into?
#### The first paper is titled "MAmmoTH2: Scaling Instructions from the Web" by researchers from Carnegie Mellon University and the University of Waterloo. This work introduces a new paradigm for efficiently harvesting large-scale, high-quality instruction data from the web without relying on costly human annotation or GPT-4 distillation.
#### That sounds intriguing! Can you tell us more about their approach and how it differs from existing methods?
#### Absolutely! The authors propose a three-step pipeline to mine instruction-response pairs from the web. First, they create a diverse seed dataset by crawling quiz websites and use it to train a fastText model for recalling relevant documents from Common Crawl. Second, they extract question-answer pairs from the recalled documents using open-source LLMs. Finally, they refine the extracted pairs using Mixtral-8x7B and Qwen-72B to remove unrelated content, fix formality, and add missing explanations.
#### Wow, that's a smart way to leverage existing resources and models. So, what kind of results did they achieve with this approach?
#### The results are quite impressive! By fine-tuning base LLMs on the harvested WEBINSTRUCT dataset, they built MAmmoTH2 models that significantly boosted performance on reasoning benchmarks. For example, MAmmoTH2-7B's accuracy increased from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. When further trained on public instruction datasets, the resulting MAmmoTH2-Plus models achieved state-of-the-art performance on several reasoning and chatbot benchmarks.
#### That's remarkable! It's exciting to see how leveraging web data can lead to such significant improvements in model performance. Now, let's move on to the second paper. What have you got for us, Carsten?
#### The second paper is titled "Granite Code Models: A Family of Open Foundation Models for Code Intelligence" by researchers from IBM. This work introduces a series of highly capable code LLMs designed to support enterprise software development across a wide range of coding tasks.
#### Code intelligence is a hot topic these days. What sets the Granite Code models apart from other code LLMs?
#### The Granite Code models have two main variants – base and instruct – released in four different sizes: 3B, 8B, 20B, and 34B. The base models are trained from scratch using a two-phase strategy, first on 3-4 trillion tokens of code data from 116 programming languages, and then further trained on a mixture of high-quality code and natural language data. The instruct models are derived by fine-tuning the base models on a combination of filtered CommitPack data, natural language instruction datasets, and synthetically generated code datasets.
#### It sounds like they put a lot of effort into curating high-quality training data. How do the Granite Code models perform compared to other open-source code LLMs?
#### The Granite Code models demonstrate very strong performance across all model sizes and benchmarks, often outperforming other open-source code models that are twice as large. For example, on the HumanEvalPack benchmark, Granite-8B-Code-Base outperformed the most competitive CodeGemma-8B model by almost 12 points on average, despite being trained on significantly fewer tokens. The instruction-tuned variants also showed strong performance, surpassing other open-source code instruction models.
#### That's impressive! It's great to see the Granite Code models pushing the boundaries of code intelligence while maintaining a smaller and more flexible package. Carsten, what do you think are the key takeaways from these two papers?
#### I think both papers highlight the importance of high-quality, diverse training data for building powerful and versatile language models. The MAmmoTH2 paper demonstrates how we can leverage the vast amount of instruction data already available on the web, while the Granite Code models paper showcases the benefits of domain-specific training and fine-tuning. Together, these approaches can lead to models that excel in a wide range of tasks, from reasoning and chatbots to code generation and explanation.
#### Absolutely! It's exciting to see the rapid progress in this field and how researchers are finding innovative ways to enhance model performance. Thank you, Carsten, for sharing your insights on these groundbreaking papers. That's all the time we have for today's episode of the Knowledge Science Pulse podcast. Join us next time as we continue to explore the frontiers of artificial intelligence. Until then, stay curious and keep learning!