Training data and AI: How a chatbot is trained

Table of contents

About this guide

Imagine training a robot that not only executes commands but also learns how to solve problems on its own. But there is a trick here: The robot doesn't learn from boring manuals, but from a collection of “examples.” It's like splashing around in a huge pool of experience and finding out what's right and wrong in the process. We're talking about the magic of training data! In this text, you'll find out why they're the secret recipe for successful AI. And don't worry, our Head of AI named Kai Don't get lost in complicated terms, but takes you on an entertaining journey through topics such as “supervised learning,” “self-supervised learning,” and the secret of “test data vs. training data.” Discover how the quality and quantity of training data influence the AI world and why it's sometimes better to have some good training data than a huge amount of bad data.

Machine learning training data

What artificial intelligence or something more academic Machine learning The distinguishing feature is that not every result is the result of a completely defined system of rules, but that the artificial intelligence has learned what is highly likely to be the correct result. This is particularly important when it is hardly possible or impossible to incorporate all possible cases into a set of rules, for example when the complexity of the problem is too great, for example when a photo of a face is to be attributed to a person.

A large number of AI models learn from examples. In the case of face recognition, for example, this can be a lot of pictures of a person in which you recognize them.

Supervised Learning | Supervised Learning

Face recognition is a classic example of so-called supervised learning. This means that the training data consists of pairs, the input (the photo of the face) and the correct result (the name of the person who is in the picture). Other examples of supervised learning include AI models for the sentiment (mood: good or bad) of a text, in which the training data consists of pairs (text + sentiment).

This learning is then called supervised because the correct result, also known as a “gold label”, comes from a secure source (e.g. from a human monitor). Figuratively speaking, the supervisor is the teacher with a solution booklet (problem + gold label or “solution”) and can then evaluate the model's attempts so that the practiced model makes fewer mistakes.

‍

Self-supervised learning | Self-supervised Learning

Self-supervised learning also requires training data and examples. Here, however, the training algorithm draws the important information solely from the given input. This learning method is often used, for example, in the currently very popular generative models (ChatGPT, Midjourney). In language models such as the ones behind ChatGPT, for example, a part of a complete text is then “blackened” (you say “masked”) and the model tries to guess what could be behind the blackened (masked) part. Because it has already “seen” many other texts, it can form statistics, so to speak. And here it can check the correct answer for itself by cheating behind the blackened out. Because GPT-like models mask the last section of the text and then virtually predict the continuation of a text, they are also referred to as generative or predictive models.

‍

Quality and quantity of training data

“A lot helps a lot” was also an appropriate mantra in machine learning. Even if a few training data were not optimal (e.g. not in 100% correct form, or with a less than optimal gold label), that doesn't matter much on average. In addition, ever larger neural networks have been trained, which tend to simply “memorize” the correct results if they do not receive enough input that they have to recognize the features in order to make correct predictions. One then speaks of so-called “overfitting.”

However, a lot of training data also means long training times. Especially if you run the training on expensive hardware, you can ask yourself questions about a suitable cost-benefit ratio. There are also studies that make a recommendation between the size of neural networks (i.e. in particular the number of trainable parameters) and the optimal number of training data (e.g. https://arxiv.org/abs/2203.15556).

But if our “much helps a lot” is no longer absolute, then we should probably also strive for better training data in order to save time and money and still obtain well-trained models. Especially when it comes to fine-tuning (more on that later), this can be essential for success.

What are typical quality features?
First of all, the training data should well represent the “reality”, i.e. the queries that the model should process later. If my customers all ask questions in German, then a purely English training data set is at least not optimal. It also means that the data should be versatile to cover the range of subsequent inputs. In addition, different types of inquiries should not be over-represented. In sentiment classifications, for example, the examples should include not just 10 good-humored and 1000 ill-tempered texts.

To put it more clearly: If I want to use an image recognition model to detect cancer, for example, I should have neither exclusively images of healthy tissue nor only images of cancers in the training data set.

Test data vs. training data

When you read or hear about training data, the term “test data” is sometimes used, more rarely also “dev” or “developer data.” This data differs (in addition to the scope) from the training data set only in its function. The idea behind it: If you want to verify whether an AI model works well, you should test it with different data than with the data you've trained with. For example, if we remember the overfitting, we would otherwise just check whether the model has learned well by heart. The situation is similar with the development data set - AI developers use it in the phase when all the subtleties (e.g. the so-called hyperparameters) of a model have not yet been worked out and you are looking for “good settings” or even the best model type. However, the idea is the same as with the test data.

How you now divide your data into training/test and dev data also depends on the type of model you are training or its purpose. Training data often takes up the largest part of the data, the dev data the smallest. Especially with the Large language models like GPT-4 And so the proportion of training data is orders of magnitude larger.

Test data and training data simply explained using an example:

Imagine that you are a math teacher and know that the math thesis on the same topic from last year is circulating among students and is being learned diligently by heart. Now, of course, you can do the same math problem again and hope that memorizing is better for everyone than understanding the problem. I hope you get a good exam result. But the teacher in you certainly wouldn't be happy. He wants to know how much the class has understood. So I don't think you're going to do the exact same job again.
This is also the case when training AI models. You have enough “tasks”, i.e. training data, up your sleeve so that you can use different data for learning (“training”) and class work (“testing”). One is the training data and the other is the test data.

In AI, there is a further division of the data (“task”) sets, because before you really train a model, you want to check whether you have chosen the right type of model, or whether you can improve certain fine settings (this is referred to as “hyperparameter tuning”). Because “real” training can take longer or be expensive. For this purpose, people like to use other data, the so-called “development data sets,” which are often not as extensive as the later training and test data.

As good “AI model teachers,” you therefore have:

development rates,
training sets,
test data sets.

Finetuning

In times when there are ever larger models such as GPT-4 or even open variants such as Alpaca that are so large that smaller and medium-sized to large companies do not have the resources to train them, the concept of so-called fine-tuning is on the rise. The aim is to specialize a very large model that copes very well with general tasks “out of the box”. This then enables you to achieve even better performance in the special task. It is of course not surprising that the fine-tuning is also carried out using examples. Because of the outstanding general skills of, for example, large language models (LLMs), much less data is sufficient here than was previously used to train the actual model. This is what makes fine-tuning so attractive, because it is often also manageable for companies, institutions or even private individuals with smaller budgets.

And with the chatbot?

There is no such thing as “the chatbot.” At moinAI, we use in a chatbot There are quite different types of models that work together to achieve excellent quality. And there are also various concepts behind the flagship ChatGPT. The basis is GPT-3.5 or GPT-4, which uses self-supervised learning. But it is known that so-called reinforcement learning was also used here, where people give direct feedback on whether the output was good or bad, which would then not be described as training data.

In the simplest case, a chatbot is based on so-called intent recognition. This is a model based on supervised learning, where it requires pairs of chat texts and the correct intent to be recognized (in our language “topic”). How many examples are needed here also depends on the quality of the underlying models.

In this case, it is perhaps worth mentioning that this is really just training to recognize what kind of topic the customer or chat partner is currently interested in. In such a case, hopefully an experienced editor has written the answer in advance; it is not part of the training data. They look more like this:

1) Text: “Hello you!” , theme: “Greeting”
2) Text: “Do it well!” , theme: “Adoption”

But if you want to train a chatbot like ChatGPT that also answers “freely”, the training data looks more like this:

Input:

“Context”: “You're a chatbot from a company that offers software as a service and is trying to help customers find their way around the product.”
“History so far”: “Customer: Hello; chatbot: Hi, how can I help you?”
“Request”: “How much does a subscription cost with you per month?”

Output:

“It depends on the license you choose. The entry-level license starts at 100 euros per month and our premium licenses cost 2000 euros for one month”

The models that work like ChatGPT have fortunately already been “pre-trained” by others and already understand a whole lot “off the shelf”, i.e. you don't start from scratch and then you don't need the trillions of training and test data that have already made the model smart. See the section on fine-tuning. Imagine a dog that has already been house-trained by the breeder and may understand simple commands such as “sit” or “stop.” But then you still have to teach him how to bring the newspaper to your sofa or pick up the children from school on his own.

Or just admit to yourself that a dog is not suitable for picking up children from school. This is something like ChatGPT and its use in customer communication, for example. It is a versatile tool that meets many requirements — but not the customer communication requirements of companies.

Conclusion

We see: Even though there are more and more very good, “generally-educated” large language models (LLMs), the quantity and even more the quality of training data also determine the success of an “AI ecosystem.” A chatbot whose training data is combined with experience and industry knowledge (e.g. ecommerce, finance or publishers), achieves a significantly higher level of automation than a bot that has only been trained with generic, uncontrolled data.