What is a large language model?

Table of contents

About this guide

Computer programs that allow you to write and speak like a normal person have for a long time only taken place in the science fiction genre, but have been accessible to many people for a few years now. One reason for this rapid development and enormous progress is the so-called Large Language Models (LLMs). These algorithms have changed the way we can interact with machines and now go beyond simply processing text and speech. This article explains what sets LLMs apart and how they work. It also shows ways in which different models and their predictions can be compared and answers the question of how far away artificial general intelligence may still be.

What is a Large Language Model (LLM)?

A large language model is a particular type of artificial intelligence, which is originally able to “understand” human language and to be able to output it itself. It uses a sub-form of artificial intelligence, known as deep learning, and uses neural networks that are inspired by structures in the human brain. They get their name from the fact that large language models require a variety of parameters to be able to perform these tasks. They are also trained on huge amounts of data to learn the language. In most cases, texts from the Internet are used for this purpose.

However, the majority of LLMs that are used today are not only able to use text as input, but also process other media, such as images, videos, or audio files. Such models are referred to as multimodalbecause they are able to understand, process and output various communication methods. In this sense, these models are moving away from pure large language models, as they can not only understand and generate language themselves. In the past, such models were known as so-called Foundation models However, large language models today primarily use the “multimodal” attribute, which describes that the model can not only use language as input and output, but can also use it to handle other file formats.

What are Foundation models?

Foundation models are predominantly large neural networks that have already been pre-trained for a certain range of applications, such as speech processing. This pre-training requires huge data sets so that the Foundation Model can recognize relationships in the data and transfer them to new applications. In addition, large computing resources must be available, as the underlying calculations are extensive and are required in large numbers.

After the pre-training has taken place, the models can then be fine-tuned for specific applications (English: Finetuning). Such fine-tuning is then used, for example, so that a foundation model can summarize texts, translate them into different languages, or even generate completely new passages.

This foundation model architecture is particularly advantageous because significantly less data is required for fine-tuning than for the original pre-training. It also makes it possible to draw on the results and progress of the model from previous training courses.

How does a large language model work?

In one of our previous Articles about language models in general The basic types of language models and their rough functioning have already been discussed. However, this text is intended to address the specific characteristics and challenges that arise from an LLM. In the simplest form, a large language model receives a text as input, which then serves as a basis for further calculations. Many of today's models are multimodal and can process other forms of input, such as images or audios, in addition to text. However, we will limit ourselves to text inputs and outputs for the following explanations.

The input is divided into so-called tokens, which can represent either individual words or even syllables. As a result, the text is split into smaller units, which can then be analyzed and processed more easily. However, computers cannot handle words natively, which is why they are encoded in the so-called encoding layer before further processing, i.e. converted into numeric vectors. During training, individual layers of the network learned the so-called embeddings. A good Embedding is characterized by the fact that the numeric representation of a word contains as much information as possible about the content and therefore vectors that are close together also represent tokens that have a similar meaning, such as synonyms.

A vector is a multi-dimensional data type with numbers that represent a direction. The easiest way to imagine a vector is like a directional arrow. This gives instructions on how to get from one point to another point in space. The vector (5, 2) can mean, for example, that you must first take five steps straight ahead and then two steps to the left in order to reach the target point.

Each of the tokens is converted into such a vector so that the computer can calculate the numbers. Another advantage of this representation is that thematically similar words, such as “play” and “the player,” point in a similar direction and the model therefore recognizes that although they are not the same, they have a similar meaning.

The majority of current LLMs are based on a so-called transformer architecture, which is Paper by Vaswani et. al (2017) was presented for the first time. The key point is that there is a so-called attention layer, which assigns each of the tokens a value that states how important the token is for completing the task. In addition, these models are characterized by the fact that they store the exact position of each token in order to include this information in the prediction.

This model is then trained with a variety of texts, which mostly come from the Internet and ensure that the model understands language and its use. Through this training, structures are identified, which can then be used for new forecasts. However, the term “large” is not only derived from the size of training data, but also from the enormous computational effort that is required during training, but also when using the model.

In order to package these immense computing costs into direct figures, various sources are trying to estimate the costs that a single training run of a large language model generates. As a basis for these calculations, only hardware costs are often considered, i.e. how much it cost to use the GPUs, i.e. the processing units. For example, Forbes estimates that a single training run of GPT-3 costs at least five million USD for GPUs alone (https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cost-you--there-is-no-free-ai-lunch/). Other cost items, such as personnel costs, are not included at all.

This makes it clear why the market of LLM providers who actually train models from scratch is so small and only a few market participants are joining, as the training costs for small and medium-sized companies cannot be managed at all without numerous investors in the background.

Current models have several billions of parameters that perform calculations. As a result, conventional computers are not sufficient for the calculation and a variety of processors, as well as graphics cards, are used to cope with this quantity.

Although such large models can already produce very good results with intensive calculations, performance cannot be improved simply by increasing the parameters. The newer models, such as GPT-4 or Mixtral, which is why they also rely on innovative approaches, such as the so-called Mixture of Experts. A large model is divided into various sub-areas, the so-called experts, which are used for various applications. Depending on the user's input, it is then decided which specific part of the network needs to be controlled so that the best possible prediction can be made. This approach can also be used to target several experts, but it prevents each request from running through the entire network with billions of parameters. As a result, the models can be larger and more complex while saving computing costs.

How can models be compared?

When dealing with large language models (LLMs) and their outputs, it is often the user's subjective perception that determines whether the model's performance is good or bad. Compared to other deep learning applications, using natural language offers significantly fewer rational indicators that can express the quality of a model. It is often subjective perception and the application itself that determine whether the sitter Performs well enough.

In order to make the performance of the models independently and rationally comparable, there are so-called benchmarks, which provide reliable figures. Standardised tests and data sets are used to evaluate the performance of an LLM and compare it with the results of other models. Depending on the application, various benchmarks have developed over time. The most widely used include the following:

  • IfEval Dataset: The Instruction Following Evaluation Dataset includes more than 500 user inputs with commands. This data set is intended to measure how well the model responds to the details of the command and actually implements them.
  • BIG-Bench Hard: This comprises a series of a total of 23 different tasks, which were taken from the BIG-Bench evaluation test and were rated as “tough” because previous language models were unable to beat human opponents in these tasks. One task includes, for example, understanding geometric figures.
  • Graduate-level Google Proof Q&A Benchmark (GPQA): This data set contains over 400 multiple-choice questions that ask for understanding in the areas of chemistry, physics and biology. Even well-trained test subjects were able to answer 65% of the questions correctly on average. In addition, the questions are “Google proof,” which means that you cannot solve the answers even with unlimited Internet access and that they are therefore based exclusively on your own knowledge.

From these and other benchmarks, a ranking of large language models can then be created, which shows how well individual LLMs are performing. A widely used list For example, offers the Large Model Systems Organization (LMSYS), which, among other things, the models according to their so-called Elo Score orders. This is a chess concept that evaluates individual players depending on how many games they have won and how strong their opponents were. LLMs also involve duels between two models, in which users decide anonymously which model delivered the better result.

How far away are LLMs from Artificial General Intelligence?

Artificial General Intelligence describes a machine that has all the capabilities that are attributed to a human being. In addition to understanding language, this also includes, for example, the ability to learn things, draw conclusions or implement creative projects. Since many LLMs not only understand language well but can also generate creative and new texts, it is easy to get the impression that they are already very close to human intelligence. In addition, the models are evolving at a rapid pace and pretend to be able to solve even simpler calculation problems in mathematics or physics. The model Gemini from Google For example, finds errors in various math problems and can already offer the right solution. (https://www.youtube.com/watch?v=K4pX1VAxaAI).

However, there are also well-known advocates within the AI community who doubt that LLMs as they work today can achieve such general intelligence. Meta's AI boss Yann LeCun, for example, sees the reason for this in the fact that LLMs draw their knowledge from the huge amounts of text with which they are trained. He argues that we humans get a large part of our intelligence not from texts and books, but from interactions with our environment. He therefore suggests that the models should rather learn through their exchange with the physical world by recording reactions via sensors or cameras. (https://thenextweb.com/news/meta-yann-lecun-ai-behind-human-intelligence) For LeCun, so-called “world models”, on which human intelligence is based, are also of great importance. The majority of decisions in everyday life insist on observing the environment and then assessing which next action would be best and what future state this action would cause. According to Yann LeCun, however, these world models cannot be built using language alone, but also include physical aspects that can be learned, for example, through physical feedback, such as gripping. (https://lexfridman.com/yann-lecun-3/)

Research is also already investigating whether today's LLMs are even technically set up in such a way that they can lead to the development of Artificial General Intelligence (AGI). Some papers, such as Chen et al. (2024) (Transformer-based large language models are not general learners: A universal circuit perspective), come to the conclusion that today Transformer architecture Which most LLMs consist of are technically limited, as their capacities are not sufficient to represent the different presentation options of knowledge. They therefore demand that future research must find new architectures to fill this gap. Otherwise, it is not possible to create such generally learning models.

Chatbot in 4 Schritten erstellen. Erstellen Sie einen KI-Chatbot, der genau zu Ihrem Use-Case passt. Kostenlos und unverbindlich. Chatbot erstellen

Happier customers through faster answers.

See for yourself and create your own chatbot. Free of charge and without obligation.