In recent months, various big players, such as Google, Meta or OpenAI, have presented and further developed different models and architectures, which makes it difficult to overview the market. In addition, there are also a number of smaller organizations, particularly from the open source sector, which have achieved remarkable results. Therefore, this article presents some of the most well-known and powerful LLMs and explains the differences.
What is an LLM and how does it work?
A Large Language Model is a sub-category of machine learning models that are trained to understand, process and generate human language. In most cases, these architectures have billions of learnable parameters, which make the model “large” and offer the possibility that complex structures in the data can be learned. In addition, huge amounts of text are used to train the model and understand the language with all its peculiarities, such as grammar or synonyms, as well as possible.
Today, many models are “multimodal.” They are able to process not only text but also audio, video, and other file formats. That is why in many cases people no longer speak of “large language models”, but of so-called “foundation models”, as they cannot only process language and have a broad knowledge base.
What are the most important LLMs?
Since ChatGPT was released by OpenAI in November 2022, there has been some movement in the area of large language models and many larger tech companies have released their own models. In this section, we therefore look at some of the most important LLMs and their characteristics.
1st OpenAI GPT
(current model: GPT-4o, May 2024)
GPT-4o is the latest generation of Generative Pretrained Transformers (GPT for short) published by OpenAI, which is behind ChatGPT. The “o” at the end stands for “omni”, as the latest version is able to combine audio, image and word processing capabilities and is even more powerful and efficient than previous versions.
The interesting thing about this architecture is that it is not a single, large model, but a variety of “smaller” models that work together in a targeted manner. This process is known as “Mixture of Experts (MoE).” Although OpenAI keeps the exact architecture under wraps, it is assumed that there are a total of 16 so-called expert models that have been trained for various sub-areas. For each prediction, two of these models are then activated and provide the output (). Such exact figures are not yet available for the new GPT-4o version, but it is assumed that it works in a similar way (spring).
In addition, in July 2024, a smaller version of GPT-4o presented with the name GPT-4o mini, which has a smaller architecture with fewer parameters. In general, this results in advantages for many applications that do not require the highest output quality, as significantly less computing capacity is required, which reduces costs, and in addition, the models can also be used on weaker devices, such as smartphones or tablets. In addition, such smaller models are also often used for real-time applications where the response time is weighted higher and performance compromises can also be made in return. Despite this smaller architecture, GPT-4o mini still manages to outperform larger models in individual benchmarks. For example, it performs better in programming and math benchmarks than the Llama 3 model with eight billion parameters or as Mistral Large.
2. Mistral/Mixtral
(Last updated: April 2024)
The company Mistral AI is a start-up based in France that offers a wide variety of large language models. It was founded by former employees of Google and Meta, among others, and has well-known investors such as Microsoft. In contrast to other providers, some Mistral models are open-source, so they can not only be used free of charge, but can also be easily modified. In doing so, they are pursuing the goal of making the development of artificial intelligence more transparent and credible. These freely available models include:
- Mistral 7B: This model has around seven billion parameters (English: 7 trillion = 7B) and is the smallest model in the Mistral family. Although it has fewer parameters than comparable LLMs, it can still keep up with larger models. Thanks to its comparatively compact architecture, it impresses with quick forecasting and rather low computing costs. However, this also limits the number of applications and can only be used for English language processing or programming.
- Mixtral 8x7B: This model follows a so-called “Mixture of Experts” approach and consists of eight individual models. As a result, it can also be operated with comparatively little computing effort and yet offers a wide variety of applications. Mixtral 8x7B can therefore not only generate program code, but is also fluent in English, French, Spanish, Italian and German. In some benchmarks, this architecture even performs better than GPT3.5, i.e. the initial model behind ChatGPT.
- Mixtral 8x22B: This is an even larger collection of models, which consists of eight individual models, each of which has 22 billion parameters. It is currently the most advanced model that Mistral AI offers in the open-source sector and can perform significantly more complex tasks thanks to its large architecture. For example, the model is particularly suitable for summarizing very long texts or generating large amounts of text. Compared to the Mistral models mentioned so far, it can process twice the amount of text, i.e. 64,000 tokens. In the English language, that's around 48,000 words, as a token corresponds to around four letters.
However, all of these models are pure large language models and do not support multimodality, i.e. they are purely text-based. In addition to these open-source models, Mistral also offers these commercial models:
- Mistral Large: This model is one of Mistral AI's most powerful and is in second place behind GPT4 in various performance tests. It can be used to generate various languages and programming code.
- Mistral Small: This architecture is suitable for fast and not computationally intensive forecasts that require a fast response time. This includes, for example, classification in customer support to determine whether the customer is annoyed or not. The model can also be used to generate text for shorter answers in this context. However, for more complex tasks that require a certain amount of conclusions, such as data extraction or the preparation of text summaries, the larger models should be used.
- Mistral Embed: This model can be used to create so-called word embeddings in English. Natural text is passed as input and the prediction then comprises numeric representations of these words, which in turn can be used by computers.
In addition to this wide variety of models, Mistral AI also offers the so-called “Le Chat”, i.e. an AI chatbot with which conversations can be conducted and content can be created, similar to ChatGPT.
3. Llama model family
(Last updated: July 2024)
In February 2023, Facebook's parent company, Meta, also entered the world of large language models and presented their so-called large language model Meta AI, or Llama for short. This release was already expected, as Meta was able to make considerable progress in the area of Natural Langauge Processing at a very early stage. Back in 2019, for example, they presented a tool called Laser, which could transfer sentences and their content in various languages into a vector space. Since the presentation of the large language model, the focus has been on presenting the best possible foundation model, which can be adapted for various applications using natural language. In order to boost research in this area, Meta decided to make the programming code for the model family publicly available and also published a paper in which, among other things, a series of benchmarks should also identify the weaknesses. The associated computing power, which is used for training, comes from Power Meta Corporation. According to statements from Mark Zuckerberg dated January 18, 2024, which he published on Instagram, the company intends to buy an additional 350,000 NVIDIA H100 GPUs by the end of 2024, which have a unit price of around 30,000€. Meta would then have around 600,000 NVIDIA H100 GPUs available. Mark Zuckerberg comments on this, among others, in this Instagram Reel.
Since the initial release in 2023, a total of three model families have been presented:
- Llama: The original version of the model was offered in various sizes, which were designed in such a way that even smaller infrastructures with lower computing power could train the model. There were four variants, each with seven, 13, 33 or 65 billion parameters, all of which were trained on at least one trillion tokens.
- Llama 2: The Llama 2 variant followed in July 2023 and included three different models with seven, 13 and 70 billion parameters, which were trained with a significantly larger data set of two trillion tokens. As a result, Llama 2 with 70 billion parameters was also able to perform significantly better in many benchmarks than Llama (1) with 65 billion parameters.
- Llama 3: After just under another year in April 2024, Meta released the third and most recent version of Llama in variants with eight and 70 billion parameters. Compared to Llama 2, several improvements were made, including a new so-called tokenizer, which converts natural language into tokens and was significantly more efficient and has a larger vocabulary of a total of 128 thousand tokens. According to their own statements, the 70 billion parameter model beats other models such as GPT3.5 or Mistral Medium.
4. Google models
(Last updated: June 2024)
The Google Research Department delivered the first large language models back in 2018, which were based on the transformer approach from 2017 and made considerable progress. Although the first models did not achieve the popularity of ChatGPT, they were very popular among experts.
- BERT: The BERT (Bidirectional Encoder Representations from Transformers) model was presented in a scientific article in 2018. His main goal was to better understand the relationships and contexts between words. It worked bidirectionally, i.e. included both the words before and after a phrase in the prediction. This property opened up various fields of application, such as question-answer generation or sentiment analysis of texts.
- T5: The term T5 comprises a number of large language models from Google, which are characterized by the fact that they transform various tasks in the area of natural language processing into a text-to-text task. This is also reflected in the name “Text-to-Text Transfer Transformer.” The difference to other models is that although T5 also uses text as input, it is preceded by the fact that the task can already be called as text. The same input text can therefore be translated once by the same model by preceding “translate English to German:...” and summarized again by using the keyword “summarize:...”.
- Google Gemini: After Google had already delivered important models for the development of natural language processing with BERT and T5, but these were only known to experts, ChatGPT stole the show from the general public. In 2023, the Gemini series was therefore presented, which can be used as a multimodal chatbot and competes with the GPT model family. There are currently four versions of them, which differ in size and required computing power and have therefore been designed for different applications. Google states that the largest variants “Flash” and “Pro” can maintain a context of up to one million tokens, which is a unique value within the Foundation models. Together with the multimodal capability, this model is particularly suitable for applications in the field of education, in which explanations can be created with texts, diagrams and images that include a wide range of context.
- Gemma 2: Gemma 2 is the latest version of Google's open source LLMs and was presented in June 2024. The model has two variants, one with nine and 27 billion parameters, which can be used for different complex tasks. Compared to the first generation, Gemma 2 has, among other things, the ability to combine several LLMs. It also uses the so-called Sliding Window Attention, which also uses Mistral, for example, and ensures that the models require significantly less time and memory to calculate attention.
5. Other large models
Although a wide variety of models have already been presented in this article, the market is far from saturated and there are also new, supposedly less well-known companies that present very powerful models. Some of these models are discussed in more detail in the following sections.
- Grok AI: The language model of X, formerly Twitter, called Grok AI is unfortunately not only known for its performance. Although the exact data set from Grok AI is not disclosed, different reports assume that content from X was also used to a significant extent for training, which is not the case with other models such as ChatGPT. In terms of performance, Grok AI lags behind other, current LLMs in various benchmarks and also has a lower range of services than its competitors. The model also caused some excitement when it began to hallucinate and published a story about American basketball player Klay Thompson, in which he was accused of throwing bricks at houses. The alleged reason for this was a weak basketball game, which had led to some fans referring to him as a “brick thrower” in their tweets, something from Model taken too literally was.
A new chip for training LLMs with the similar name Groq also creates a risk of confusion. This processing unit is referred to as an LPU, i.e. as a language processing unit. It offers significantly more computing capacity and can therefore also prevent storage bottlenecks. Individual performance tests were able to prove that ChatGPT could provide 13 times faster predictions with GROQ's LPUs.
- Claude 3.5: The LLMs in the “Claude series” come from the AI research company Anthropic, which was founded in 2021 by some former developers of OpenAI, the company behind ChatGPT. The goal of Anthropic is to develop AI that serves people and complies with ethical principles. The developers had already worked on the GPT-2 and GPT-3 models and had extensive know-how in this area. The latest Anthropic Claude 3.5 model is a serious competitor compared to GPT-4 and beats it in various benchmarks. There are a total of three versions called Haiku, Sonnet and Opus, which differ in performance and size.
What developments are there outside Europe and the USA?
In public perception, a large part of AI development in the area of large language models is taking place primarily in Europe and the USA, which already have large, established companies and also have sufficient data sets in their languages. For example, training data from “western” models is often rather blind to languages and cultures from other continents. According to the senior AI director of AI Singapore, the training data from LLama-2 comprises only 0.5 percent data that is specific to South East Asian countries. This is particularly problematic because over 1,200 dialects and languages are spoken in this region, which an LLM can best respond to.
With the SEA-LION model, a large language model was therefore presented for the first time in 2024, which was specially trained for the ASEAN region. Although it is only a fraction of the size of GPT-4, for example, it can be more helpful in specific applications, such as customer support, as it can address the cultural differences of the individual countries more specifically.
In addition, there are also various efforts in China to train their own large language models, as ChatGPT is not available in China, for example. The Baidu group has released its own series of models called ERNIE. The latest version was presented in June 2023 and, according to its own statement, beats GPT 3.5 in general comprehension tasks and GPT-4 when dealing with the Chinese language. According to the OpenCompass Benchmark However, the models already presented also perform better in the Chinese benchmarks.