What is behind Google Gemini?
Google Gemini comprises a family of multimodal large language models that should be able to understand and generate texts, images, videos and programming code themselves. There are two terms in this definition that should be better explained so that you can understand Google Gemini better.
As Large Language Models (short: LLM) in the field of artificial intelligence primarily refers to neural networks that are able to understand, process and generate human language themselves in various ways. The term “large” describes the property that these models are trained on vast amounts of data and have several billion neurons or parameters that recognize the underlying structures in the text.
Multimodal models are part of machine learning and include architectures that can process several variants of data, the so-called modalities. Until now, most models could only process a single type of data, such as text or images. Multimodal models, on the other hand, are able to record and process various formats.
GRAPH
Just like GPT-4 Google Gemini is also multimodal, meaning it can process various types of input, such as texts, images or programming code, and also provide them as output. In contrast to GPT-4, however, Gemini is built multimodally from the ground up and does not use different models for the different inputs. It remains to be seen which architecture will ultimately prevail.
The new thing about Google Gemini is not only the ability to process texts, audios, videos, images and even programming code, but also to use them to make your own conclusions. From now on, conclusions in fields such as mathematics or physics should no longer be a problem. In Google's examples, for example, errors are found in a math calculation and the corrected solution is also created and explained.
What can Google Gemini do?
Google Gemini was unveiled for the first time at a virtual press conference on December 06, 2023. At the same time, both the Google blog and the website of the AI company Google DeepMind, items online, which describe the functionalities of the new AI family.
According to these reports and the additional YouTube videos published, the following applications, for example, should be possible:
Google Gemini should be able to create programming code simply from an image of the finished application. This allows websites to be recreated, for example, by simply using a screenshot of the current page. Although this was already the case with GPT-4 and Google Bard possible, but the skills have been improved again. Nevertheless, no big leaps should be expected here, as a large part of the complexity of a website or a computer program cannot be shown via a screenshot. However, it can be a good starting point for further programming.
Examples are also shown in which two images are combined to form a new image and a corresponding text is written. In the example from Google, the AI is asked what the user can do with two balls of yarn. As an additional input, an image of the two differently colored balls is shown. The model provides a finished image of a woolen octopus that can be made from the two balls.
SCREENSHOT
By far the most impressive application is not only interesting for all pupils, students and parents, as you might expect at first glance. The video shows how Gemini is used to correct homework in physics. It not only determines which tasks were solved correctly and which were solved incorrectly, but it can also explain which mistakes were made and how they can be corrected. Such reasoning is actually a remarkable achievement for a language model.
Just a few days after the initial presentation, some users discovered the important information hidden in the video descriptions of the YouTube videos. Google had been tricking with their presentation videos, for example by working with still images and text inputs when the model was supposed to recognize that the video was showing a batch of scissor rock paper. This approach was met with some criticism, as the presentation in her blog suggested significantly more capabilities, which the model was then unable to demonstrate.
Which versions of Gemini are there?
At the start, Google Gemini will be available in three different variants, which have been optimized for different devices. Gemini Ultra is the largest and most powerful model that is also used for the majority of applications. Since it is very computationally intensive, it will only be available for powerful devices, i.e. not on mobile devices such as cell phones or tablets. It is currently still undergoing internal security tests to prevent AI hacking. This variant is comparable in performance to GPT-4 and beats the OpenAI competitor's performance in the areas of reasoning, programming and mathematics in most tests. However, OpenAI's successor GPT-4 Turbo is already in the starting blocks, so it will be interesting to see how this model performs compared to Gemini Ultra.
Gemini Pro is the all-rounder in the AI family and should be able to be used for a wide range of applications. However, Google leaves a few questions unanswered as to what exactly will be possible with it. It is currently already being used in the Google Chatbot Bard. However, it is to be replaced by Gemini Ultra in 2024. In terms of performance, this variant is comparable to GPT-3.5, which is currently used for ChatGPT.
Die Gemini Nano Finally, version has been optimized for applications that can be calculated on the device. This allows Gemini to be used on Android devices and apps can be developed that benefit directly from Google Gemini. The advantage is that no connection to Google servers is required for the calculation, so that sensitive data, such as messages, can also be worked with. Google is actually presenting an innovation in this area, as it is completely self-sufficient without a connection to a server or the Internet and is also powerful enough to run on mobile devices, which are usually less powerful than computers or notebooks.
How can Google Gemini be used?
Google Gemini is not a standalone app or application and therefore cannot be used or tested directly. However, it will improve various Google services and thus reach users indirectly.
The Gemini Pro version is already being used in Google's own chatbot Bard. This chatbot is part of the Google search engine and can also be be used. From the beginning of 2024, there will be a further development of Bard, in which the big brother Gemini Ultra will be used.
On Google's new Android smartphone, the Pixel 8 Pro, the smallest Gemini Nano version is already running locally on the mobile phone and can be used in the app for voice recordings, among other things, to get a summary of audio files. Gemini Nano also predicts the suggested answers from the Google keyboard on the smartphone.
Google Bard vs. OpenAI GPT-4
When OpenAI with the application in November 2022 ChatGPT and the associated GPT3 model was launched, the hype was huge and the expected answer from Google had been a while in coming.
It took until March 2023 when the chatbot Bard, developed by Google, was released. However, this person initially draws attention more due to incorrect or funny answers. It seems that this race has now become much closer, as Google Bard has received a real boost from Gemini.
Especially on X, formerly Twitter, some tweets were created that show the sometimes funny and sometimes frightening mistakes that were very common in the previous version of Google Bard:
Bard about the monopoly lawsuit against Google: