What is SimpleQA?
Everyone knows it: You ask an AI a question and the answer sounds convincing, but it's wrong. This is exactly where SimpleQA comes in, as it checks whether AI models can answer reliably and correctly. The new OpenAI tool can therefore be seen as a kind of test, with which various language models, such as ChatGPT or Gemini, can be checked that their answers are correct. In other words, SimpleQA is a reality check for AI models.
The models are tested by answering short, factual questions. These questions are designed in such a way that they have clear answers that do not become obsolete even over time. SimpleQA is therefore not about poetic metaphors or creative word games. Instead, the test is intended to determine whether large language models, so-called Large Language Models (LLMs), can answer precisely and based on facts. OpenAI has provided SimpleQA as open source, which means that everyone has access to this test. This allows small and large developers to check their AI and test it for factual accuracy.
What is a large language model?
In order to understand what SimpleQA is actually testing, it is helpful to know what is behind the technology — in this case the so-called Large Language Model (LLM). It is precisely these models that SimpleQA is taking a close look at:
A large language model (LLM) is a machine learning model that is trained to understand, process and generate human language itself. These models consist of billions of parameters that enable them to recognize complex structures and relationships in texts. Through the huge amounts of text in training, an LLM learns the many facets of language, such as grammar or synonyms.
Today, many of these models are “multimodal,” meaning that they not only process texts but also formats such as audio and video. This is why they are also called “foundation models” because they offer not only language but also a broad knowledge base in various media.

What is the goal of SimpleQA?
Now that it is clear what a large language model actually is and how it works, the question remains: What is SimpleQA needed for? OpenAI has a clear answer:
“An unsolved problem in artificial intelligence is how to train language models so that they provide factually correct answers. Current cutting-edge models sometimes produce incorrect outputs or answers that are not supported by evidence — a problem known as 'hallucinations. ' Such hallucinations are one of the biggest obstacles to the wider use of general forms of AI, such as large language models. ”
But what are hallucinations anyway?
The term hallucinations is used in the world of AI when AI models wrong or misleading scores deliver. These errors can occur for various reasons, such as incomplete training data, incorrect model assumptions, or biases in the data with which it was trained. Hallucinations can severely impair reliability and trust in AI — and are a real obstacle to the widespread use of AI in everyday life.
With SimpleQA, OpenAI therefore aims to test and evaluate the accuracy and reliability of LLMs. The tool checks the models' answers very carefully and shows how precise they really are. In this way, SimpleQA uncovers strengths and at the same time shows where there is still room for improvement. SimpleQA should therefore ensure that AI models respond reliably and precisely — exactly what users want.
What does SimpleQA measure?
SimpleQA measures how well LLMs perform when answering clear, fact-based questions. It is not only a question of whether the answers are correct, but also whether the model “knows what it knows” — i.e. whether it can assess how certain an answer it is.
The ability of the model to realistically assess its own confidence is called calibration. The measurement for this can be done in two ways:
- The model itself assesses how certain it is of an answer.
- The same question is asked several times and a check is made to see whether the model remains consistent with its answer.
Here is an example of how to verify the calibration:

Source: Riafy Stories
How were SimpleQA's questions created?
The total of 4,326 questions from SimpleQA were created and reviewed through a multi-stage and careful process. Every question has a clear, unequivocal answer that will not change in the future. For this purpose, the questions were selected in such a way that they remain independent of time. They are based, for example, on general knowledge or specific time frames, such as historical events or TV series.
To ensure that the answers are actually correct, each answer is backed up by a link that proves the information. The questions are deliberately designed to be so sophisticated that even advanced models such as GPT-4 are often wrong.
According to OpenAI, strict quality controls ensure consistency and reliability. Each question goes through several independent checks and additional tests to rule out potential ambiguities or errors. Through this process, SimpleQA not only ensures the quality of questions & answers, but also provides a solid basis for evaluating the accuracy of AI models.
How do the best AI models score in the SimpleQA test?
There was already some teasing in the previous section: Even the best still have their difficulties in providing precise and correct answers.
The field at OpenAI is led by o1-preview with a hit rate of 42.7%, followed closely by 38.2%. However, smaller variants such as o1-mini and GPT-4o-mini fall significantly at around 8%. There are also differences in the Anthropic models: Claude-3.5-Sonnet reaches 28.9%, while Claude-3 Opus is 23.5%.
It is also interesting how the Claude models deal with questions they are uncertain about: They simply leave them unanswered. In many cases, this could even be the safer route. As AI becomes more and more part of our everyday lives — be it in healthcare, in forming or in the legal system — the ability to say, “I'm not sure,” can really be decisive.
So does size make the difference?
At the very least, it appears that larger models tend to produce better results, which suggests that size actually has an effect on accuracy. Nevertheless, even the best models fail to break the 50% mark and more than half of the questions are answered incorrectly — meaning the error rate remains high. It becomes clear once again: When it comes to reliable information, users of AI models should continue to thoroughly check.
What is the significance of SimpleQA for AI development?
SimpleQA is an important tool that helps make AI models more reliable and trustworthy. OpenAI's decision to open source SimpleQA makes the whole thing even more exciting. Because that means that researchers and developers worldwide have access to SimpleQA and can test and further improve their models.
Another valuable aspect of SimpleQA is the recognition that the focus on intelligence in AI may need to be reconsidered. Perhaps the “most intelligent” AI is not the one that has an answer to everything, but the one that knows when to double check or not answer, like Claude.
Are there limits to SimpleQA?
Quite clearly: Yes. While SimpleQA is a great tool for testing the accuracy of AI, it also has its limits. As already mentioned, SimpleQA focuses primarily on short, factual questions with clear answers. That means complex tasks such as writing long texts, having multi-level conversations, or processing contradictory information — SimpleQA isn't really for that.
Another point: The questions were selected in such a way that they are particularly suitable for Models such as GPT-4 are challenging. This could falsify the results, as the test is not necessarily representative of all AI models. And the fact that an AI model (such as ChatGPT) evaluates other models can also be problematic. Why Because the evaluating model itself is part of the test — which results in a “circle evaluation” and makes it difficult to get a truly independent and objective opinion.
SimpleQA: The reality check for LLMs
SimpleQA is an exciting new tool that provides a look behind the scenes of LLMs. Through open source access, it promotes exchange and brings users closer to AI systems that are less likely to hallucinate. But SimpleQA also shows that there is still a long way to go to 100% compliance with facts. While generative AI models such as ChatGPT are often convincing, inaccurate information occasionally creeps in.
Precise and up-to-date information is particularly crucial — especially for companies that rely on absolute reliability in their customer communication. But ChatGPT is not only unsuitable for customer communication because of possible hallucinations: You can find out more reasons on our website, in direct comparison to moinAI.