Gemini's Data-Analyzing Abilities Aren't as Good as Claimed

Cover Image

In this photo illustration a Gemini logo and a welcome message on Gemini website are displayed on two screens.

One of the selling points of Gemini, a flagship generative AI model, is its ability to process and analyze large amounts of data. However, new research suggests that the model isn't as good at this task as previously claimed.

Two separate studies investigated how well Gemini and other models make sense of enormous amounts of data. Both studies found that Gemini struggles to answer questions about large datasets correctly. In one series of document-based tests, the model gave the right answer only 40% to 50% of the time.

"While models like Gemini can technically process long contexts, we have seen many cases indicating that the models don't actually 'understand' the content," said Marzena Karpinska, a postdoc at UMass Amherst and a co-author on one of the studies.

Gemini's Context Window is Lacking

A model's context, or context window, refers to the input data that the model considers before generating output. A simple question can serve as context, as can a movie script, show, or audio clip. As context windows grow, so does the size of the documents being fit into them.

The newest versions of Gemini can take in upward of 2 million tokens as context. However, in a briefing earlier this year, the model was shown to be unable to perform tasks that were previously demonstrated.

In one of the studies, researchers asked the model to evaluate true/false statements about fiction books written in English. The researchers chose recent works so that the model couldn't "cheat" by relying on foreknowledge, and they peppered the statements with references to specific details and plot points that'd be impossible to comprehend without reading the books in their entirety.

Given a statement like "By using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona's wooden chest," Gemini had to say whether the statement was true or false and explain its reasoning.

Tested on one book around 260,000 words (~520 pages) in length, the researchers found that Gemini answered the true/false statements correctly 46.7% of the time. That means a coin is significantly better at answering questions about the book than the model.

The second study tested the ability of Gemini to "reason over" videos - that is, search through and answer questions about the content in them. The study found that Gemini didn't perform well, with an accuracy of around 50% in transcribing six handwritten digits from a "slideshow" of 25 images.

The studies suggest that Gemini is overpromising and under-delivering on its data-analyzing abilities. While the model can technically process large amounts of data, it struggles to make sense of it and answer questions correctly.