LLM contenders at the end of 2023: Gemini, Mixtral, Orca-2, Phi-2

Heka.ai
11 min readFeb 1, 2024

The year 2023 concluded with a series of noteworthy advancements in the domain of Large Language Models (LLMs), marking a significant elevation in the standards of NLP. This analysis concentrates on four particularly innovative releases, each contributing distinctive concepts and approaches to the realm of LLMs. These are: Mixtral, Orca-2, and Phi-2. Additionally, the analysis includes Gemini, a Large Multimodal Model (LMM), which not only establishes new benchmarks within the LLM sector but also across different modalities such as audio understanding.

This comprehensive examination aims to elucidate the unique attributes and potential implications of these models in the field of language technology. Therefore we will adress the following part

I. Models’ brief presentation and methodology

II. Multi modal capabilities

  • Gemini Pro Vision vs GPT4-V
  • New use cases introduced with the multimodality in Gemini

III. Focus on textual tasks : Gemini Pro vs Mixtral 8x7B

IV. Going further into models

  • Gemini Pro
  • Mixtral

Conclusion

I. Models’ brief presentation and methodology

Gemini is a family of multimodal models released by Google in December 2023 which target three different sizes, an Ultra model which is the largest and sets the state-of-the-art in over 50 benchmarks for LMM capabilities, followed by a Pro model which can be seen as an in-between GPT-3.5 and GPT-4 and two Nano models which are designed to function on Android phones.
Gemini Pro is currently available both as a standalone LLM and as an LLM equipped with vision capabilities. The usage of Gemini is only available in some parts of the world. For more details, check the “How to access Gemini models from France” section.

Mixtral is a larger model thatbuilds upon the smaller Mistral-7B LLM model, both released by Mistral AI with open weights.By leveraging the Mixture of Experts technique, Mixtral is faster and cheaper than models of the same size.
Mixtral-8x7B has a permissive Apache 2.0 license allowing users to deploy it in commercial use cases. The model is also available on the HuggingFace Hub and Mistral AI’s platform.

Orca-2 is a progression from the original Orca, a 13-billion parameter language model. It focuses on smaller language models (around 10 billion parameters or less), aiming to enhance their reasoning abilities. Orca-2 comes in two variants (7 and 13 billion parameters) and is based on fine-tuned LLAMA-2 base models using high-quality synthetic data using instruction tuning as well as some other specific techniques.

The Phi-2 model is a small model of 2.7 billion-parameter language model by Microsoft Research which focused on high-quality “textbook-quality” data for training and an effective scaling approach from a smaller 1.3 billion parameter model, Phi-1.5. It demonstrates remarkable capabilities for a model of its size.

Orca-2 and Phi-2 are available on the HuggingFace Hub to download and use, they’re aimed for research purposes only and are not to be deployed in commercial use cases. Thus, they will not be tested in this article.

The evaluation compares the capabilities of Gemini Pro Vision, GPT4-V, Gemini Pro, and Mixtral using a qualitative human assessment to rank their performance across various tasks. Criteria included understanding and contextualization, general knowledge, complex reasoning, and creative writing for general AI abilities, plus instructional interpretation, detail recognition, and text interpretation for vision capabilities. Meticulous human evaluation ranked model responses to tasks based on these criteria on anonymous randomized models outputs. Each output was considered on the levels of correctness, completeness, relevance, accuracy, and creativity. The evaluations provided insights into each model’s practical performance, emphasizing human-like understanding and problem-solving, with the aim of closely mirroring real-world applicability and utility across tasks.

II. Multi-modal capabilities

For the moment only two models have multi-modal capabilities, mainly GPT4-V being bi-modal by supporting images, and the Gemini family of models. So, we’re going to compare these two models on a variety of capabilities across vision.

Gemini Pro Vision vs GPT4-V

It seems like GPT4-V is a bit better at image understanding than Gemini Pro Vision while also being better at text extraction from less clear parts of images. Otherwise, the capabilities seem to be almost the same for reasoning upon correct text extracted from an image and both models benefit largely from prompting them to explicit their reasoning.

Although Gemini Pro seems to be a little bit behind GPT4-V on image understanding, it can support more modalities like video and audio, which unlocks a whole spectrum of use cases.

New use cases introduced with the multimodality in Gemini

There are many use cases introduced by supporting additional modalities alongside the text capabilities of the large models. The gap between image and text was not bridged in a satisfactory manner before the introduction vision support in GPT4 with the GPT4-V model. The Gemini, with its Gemini Vision extends the bridge between image and text by natively supporting videos as well. It also expands on the family of at least bimodal models that can be deployed in industry use cases.

Multimodal models can combine text, images, audio, and video to automatically generate concise summaries of long videos, making it easier for users to quickly grasp the content and highlights, which is particularly useful in news and entertainment but also in education and training by providing comprehensive learning experiences by analyzing video lectures, transcribing speech, and presenting relevant text, images, and additional materials for a deeper understanding of the subject matter. For example, the famous use case Sentiment Analysis can now be extended by combining audio and video with text which allows for more accurate sentiment analysis in social media monitoring, market research, and customer feedback analysis, as it considers tone of voice and facial expressions, providing deeper insights into user sentiments. So, here are some examples on how to use a multimodal model:

Social media content moderation.
Gemini models, by supporting all modalities will also be able to analyze all type of content that is published on social media, enabling them to identify and flag harmful or inappropriate content more accurately whether it is hidden in the audio of a video or some frames in a video. This generalizes to using such models in compliance use cases that use different modalities.

E-commerce
To provide more personalized and effective shopping experiences. By analyzing the visual features of products and understanding customer preferences through text, these models can make highly relevant product suggestions, leading to increased customer satisfaction and sales.

Healthcare sector
It can also benefit from multimodal models. Patient records can now be extended to not only contain medical images, ECGs tests, doctor comments etc., but also video consultations and doctor conversations. The multimodal models can analyze this data and help structure it in the case of telemedicine. Google also has a healthcare sector solution called MedLM, which consists of AI models developed by Google for healthcare. These models are mainly based on PaLM-2, Google’s previous LLM and finetuned on healthcare data, but now will also include Gemini models. This will enhance medical documentation, aid clinical decision-making, accelerate research, optimize patient care, and improve healthcare system efficiency. MedLM partners with organizations to transform healthcare through safe and responsible AI integration, showcasing its potential across various healthcare aspects.

Law enforcement
These models can be employed for analyzing body camera footage, transcribing conversations, and extracting relevant text and images to assist law enforcement agencies in investigations, streamlining the process of gathering evidence.

Sports analysis
Enabling coaches and sports analysts to analyze sports footage by combining commentary, visuals, and audio to gain deeper insights into player performance and strategy, aiding in improving team performance, or helping referees with decision making.

Monitor Environment
With multimodal models we can extend our environmental monitoring capabilities as these models can analyze multimedia data from sensors and cameras to monitor ecosystems, identify anomalies, and make informed decisions for conservation efforts, helping to protect our environment more effectively.

While the exploration of multimodal capabilities in models like Gemini and GPT4-V reveals a fascinating intersection of vision and language, it’s also important to delve into the purely linguistic domain since a lot of human interaction and communication relies on documents and text content. We will compare our biggest contenders, Gemini Pro and Mixtral 8x7B, though one should keep in mind that Gemini Pro is probably a much bigger model than Mixtral.

III. Focus on textual tasks : Gemini Pro vs Mixtral 8x7B

Gemini, known for its robust multimodal functionalities, also boasts significant textual capabilities. On the other hand, Mixtral, been designed with a deep focus on textual understanding and generation but boasts powerful capabilities aligned with the largest open-source models that are less than a 100B parameters, and sometimes is even better than even larger open-source models, while being very efficient at compute time.

We tested both models on several tasks with the following results:

Mixtral 8x7b seems to be more verbose than Gemini Pro. This could be one of the reasons why it performs better on some of the riddles since maybe its verbosity is tied to some kind of tuning making it explain and output its reasoning. But the performance of Mixtral on the riddles could also be due to seeing more similar examples in its training data. Gemini Pro is definitely the go-to model for content creation though.

IV. Going further into models

Gemini

Gemini refers to a family of natively multimodal models released by Google on the 6th of December 2023. There are three sizes, the Ultra, which is the largest model, with the best performances and being state-of-the-art at the date of its release on various important benchmarks. The Pro model is smaller than the Ultra but optimizes both costs and latency while keeping a strong performance over the multimodal capabilities. The Pro model is in-between GPT 3.5 and GPT 4 in terms of performance. Finally, there are two Nano models which are Nano-1 with 1.8B parameters and Nano-2 3.25B parameters which are designed to run on Android devices, so they’re designed for devices with high memory and compute constraints. They’re obtained through the distillation and quantization of the larger Gemini models. The performances of these models are on par with Llama-7B for Nano-2, while having the advantage of being multimodal. The reported performances on audio are very good, though a bit lacking behind the specialized state-of-the-art in audio.

The Gemini models were trained using 32k context length and were found to use it effectively, with little to no bias regarding positions in the context. These models are also multilingual and natively multimodal and support text, images, video and audio. As can be show by the following figure from their technical report:

Gemini supports interleaved sequences of text, image, audio, and video as inputs. It can output responses with interleaved image and text.

Google emphasizes responsibility and safety, conducting comprehensive evaluations for bias, toxicity, and content safety. Google also emphasizes the work done on factuality making their models more prone to generate responses that are factually correct and restrain from generating responses when they can’t assess its factuality. It is important to note that improvements in some of these aspects are only reported for some models and not all of them; a more objective and rigorous assessment will require interacting with the models directly.

What happened with the video demonstration?
Google manipulated the video demonstration of Gemini. The video showcased interacting with the model in a seamless way and the model understanding and responding back to drawings, voice queries and recognizing gestures. Though Google manipulated the video, it also revealed in its blog that the image frames and text prompts were carefully chosen and that the video was not real time interaction with the model.

Though the concerns this fabrication raised, the only misleading aspect of it is the real-time capability and the seamless way of interaction with the model that was depicted in the video. The text, image, video and audio capabilities of the model are still strong though, like many other LLMs, extracting satisfactory outputs from the model within production will require careful and precise prompting as well as integrating the model inside more complete workflows that uses guardrails, output verification, reranking etc.

How to access Gemini models from France

Mixtral

Mixtral-8x7B is the second large language model (LLM) developed by the French startup Mistral AI. The first model is “Mistral-7B-v0.1”. Which is a decoder based LLM that excels in generating coherent and contextually relevant text. It integrates Sliding Window Attention to balance computational efficiency with the capability to process extensive data sequences, enhancing its linguistic proficiency across various languages and character sets. Its technical specifications include a vocabulary size of 32,000, a hidden size of 4096, and 32 attention layers, making it highly capable and versatile.

Mixtral-8x7B builds on the foundation of Mistral and introduces substantial advancements with a Mixture of Experts (MoE) approach. This innovative architecture integrates eight ‘expert’ models, which can be thought of as Mistral models, allowing for dynamic computation allocation with two experts selected per token. This design achieves the efficiency of a smaller model while boasting four times more effective parameters, rivaling a 12B parameter-dense model. Mixtral demonstrates proficiency in multilingual tasks and coding applications, showing its versatile capabilities. The Router Auxiliary Loss Coefficient is set at 0.001, reflecting the model’s adeptness in selecting the most appropriate expert for each token. Mixtral shares Mistral’s vocabulary size of 32,000 and maintains comparable hidden sizes and parameters as Mistral, since it uses its architecture in its experts.

Conclusion

Gemini Pro is beaten by Mixtral 8x7B on textual tasks and by GPT4-V on image understanding , but its strength lies in multimodality like video and audio, offering a broader spectrum of applications.

The pros and cons of every model tested are tracked in the following table

As of the publication of this article, two new contenders emerged: the proprietary Mistral Medium model and new Gemini Pro checkpoints in Bard. These models are quickly becoming key players in the ChatBot Arena Leaderboard, rivaling GPT4. While we couldn’t test these models for this analysis, their potential impact is undeniable. Our audience can expect further insights and evaluations from us soon as these models become more accessible and integrated into the AI landscape.

--

--

Heka.ai

We design, deploy and manage AI-powered applications