Two new studies show that Google’s Gemini AI models may not live up to the hype in terms of answering questions about large datasets correctly.
Google Gemini
Google Gemini is an advanced AI language model developed by Google to enhance various applications with sophisticated natural language understanding and generation capabilities. It features multimodal capabilities, enabling it to process and integrate information from text, images, and possibly audio for more comprehensive and context-aware responses. The model also boasts a deep contextual understanding, allowing it to generate relevant and accurate answers in complex conversations or tasks.
Google has highlighted Gemini’s scalability and adaptability as being its strong points, and how its highly scalable architecture can help with handling large-scale data efficiently and fine-tuning for specific tasks or industries.
Also, Gemini is thought to deliver superior performance in speed and accuracy due to advancements in machine learning techniques and infrastructure.
Studies
However, the results of two studies appear to go against Google’s narrative that Gemini is particularly good at analysing large amounts of data.
For example, the Cornell University “One Thousand and One Pairs: A ‘novel’ challenge for long-context language models” study, co-authored by Marzena Karpinska, a postdoc at UMass Amherst, tested how well long-context Large Language Models (LLMs) can retrieve, synthesise, and reason over information across book-length inputs.
The study involved using a dataset called ‘NoCha’, which consisted of 1,001 pairs of true and false claims about 67 recently published English fiction books. The claims required global reasoning over the entire book to verify, posing a significant challenge for the models.
Unfortunately, the research revealed that no open-weight model performed above random chance, and even the best-performing model, GPT-4o, achieved only 55.8 per cent accuracy. Also, the study found that the models struggled with global reasoning tasks, particularly with speculative fiction that involves extensive world-building.
The models were found to frequently fail to answer questions correctly about large datasets, with accuracy rates between 40-50 per cent in document-based tests.
The research results suggest that while models can technically process long contexts, they often fail to truly understand the content. Also, the results may highlight the limitations of current long-context language models such as Google Gemini (Gemini 1.5 Pro and 1.5 Flash).
The Second Study
The second study, co-authored by researchers at UC Santa Barbara, focused on the Gemini models’ performance in video analysis and their ability to ‘reason’ over the videos when being asked questions about them. However, the results also proved to be poor, highlighting difficulties with transcribing and recognising objects in images, thereby perhaps indicating significant limitations in the models’ data analysis capabilities.
Discrepancies Between Claims And Performance?
Both studies appear to highlight possible discrepancies between Google’s claims and the actual performance of the Gemini models, thereby raising questions about their efficacy and shedding light on the broader challenges faced by generative AI technology.
Posted On X
Marzena Karpinska, also noted (on X/Twitter) other interesting points about LLMs from the research, including:
– Even when models output correct labels, their explanations are often inaccurate.
– On average, all LLMs perform much better on pairs requiring sentence-level retrieval than global reasoning (59.8 per cent vs 41.6 per cent), but still their accuracy on these pairs is much lower than on the “needle-in-a-haystack” task.
– Models perform substantially worse on books with extensive world-building (fantasy and sci-fi) than contemporary and historical novels (romance or mystery).
What Does Google Say?
Google has not directly commented on the specific studies that critique the performance of its Gemini models. However, Google has highlighted the advancements and capabilities of the Gemini models in their official communications. For example, Sundar Pichai, CEO of Google and Alphabet, has emphasised that Gemini models are designed to be highly capable and general, featuring state-of-the-art performance across multiple benchmarks. Google asserts that Gemini’s long context understanding, and multimodal capabilities significantly enhance its ability to process and reason about vast amounts of information, including text, images, audio, and video.
Google has tried to highlight its focus on the continuous improvement and rigorous testing of Gemini models, showcasing their performance on a wide variety of tasks, from natural image understanding to complex reasoning. The company has also been actively working on increasing the models’ efficiency and context window capacity, allowing them to process up to 1 million tokens (the basic units of text that the model processes). Google hopes these improvements will enable more sophisticated and context-aware AI applications.
What Does This Mean For Your Business?
The findings from these studies may have significant implications for businesses relying on AI for data analysis and decision-making. The apparent underperformance of Google’s Gemini models in handling large datasets suggests that businesses might not be able to fully leverage these AI tools for complex data analysis tasks just yet. This could impact sectors like finance, healthcare, and any industry requiring detailed and accurate data interpretation, where businesses may need to reassess their dependence on such models for critical operations.
For Google, these studies may highlight a gap between their promotional claims and the actual capabilities of their AI models. This could prompt Google to accelerate its research and development efforts to address these shortcomings and enhance the practical utility of their models. It also places pressure on Google to maintain transparency about the limitations of their technologies while continuing to push the boundaries of AI performance.
Other AI companies might view these findings as both a caution and an opportunity. On one hand, the discrepancies in performance underline the inherent challenges in developing robust AI models. On the other hand, they provide a competitive edge for companies that can deliver more reliable and accurate AI solutions. This competitive landscape could drive innovation and lead to the emergence of more capable AI models that better meet the complex needs of businesses.
In summary then, while the current limitations of AI models like Google Gemini pose challenges, they also highlight areas ripe for innovation and improvement. Businesses should stay informed about these developments and be prepared to adapt their strategies to harness the full potential of evolving AI technologies.