What do cloud providers offer as LLM? And are there any Open source alternatives?
Generative AI models have come a long way nowadays ushering in a new era of convenience, efficiency, and innovation. From simplifying email composition and providing instant answers, to offering summaries and fostering creative thinking, their use became crucial in our everyday life.
With OpenAI’s groundbreaking release of ChatGPT, the race to implement the best text generation model has reached its climax. All major cloud providers like Google Cloud Platform (GCP) and Amazon Web Services (AWS) have embarked on significant investments in their own Large Language Models (LLMs) to enhance their performance and establish themselves as industry frontrunners.
In the study we conducted, we put through their paces 7 LLMs:
- 4 closed-source LLMs: two of GCP mainly gpt3–5 turbo and gpt-4 and two of OpenAI, we opted for chat-bison and text bison.
- 3 of the open-source models that topped the charts of the Open LLM leaderboard: Falcon-40b-instruct, MPT –30b and Llama2–13b.
Diverse Natural Language Processing (NLP) use cases were assessed, and performances have been compared to perform a benchmark of the mentioned models.
Exploring the LLM offerings of Cloud Providers
Generative AI models in Vertex AI, known as foundation models, are categorized by the type of content they generate, including text and chat, image, code, and text embeddings.
PaLM 2 is the underlying model driving the PaLM API, which is a large language model with improved multilingual, reasoning, and coding capabilities compared to the previous PaLM model. The PaLM API for text is fine-tuned for language tasks like classification, summarization, and entity extraction. The PaLM API for chat is fine-tuned for multi-turn conversations, where the model considers previous chat messages as context for generating responses. The Text Embedding API generates vector embeddings for text, useful for tasks such as semantic search, recommendation, classification, and outlier detection.
The Codey APIs consist of three models: code generation, code completion suggestions, and code-related Q&A chat support. The naming convention for models is <use case>-<model size>, such as text-bison representing the Bison text model. The model sizes available are Bison (best value in terms of capability and cost) and Gecko (smallest and lowest cost model for simple tasks).
OpenAI
On the other hand, the OpenAI API also holds within a wide range of models with different capabilities and price points to virtually perform any task that requires understanding or generating natural language and code. It can also be used in image generation and editing or speech to text conversion. GPTs or Generative Pre-trained Transformer models are decoder-only models that have been trained on a huge amount of data to understand natural language and generate code. You can feed them any prompt you like and by prompt we mean instructions or some examples on how to successfully complete a task and they will provide a response to that input.
GPTs can be used across a wide variety of tasks including content or code generation, summarization, conversation, creative writing, and more. Different versions of GPTs were released. In this article, we will be interested in gpt-3.5 turbo and gpt-4.
Focus on LLMs characteristics
Text-bison: The text-bison model is suitable for various language tasks, including classification, sentiment analysis, entity extraction, question answering, summarization etc. It takes in input text of length up to 8192 tokens and outputs a text of maximum length 1024 tokens. The user can control the output length. The training data is up to February 2023.
Chat-bison: The chat-bison model is fine-tuned for multi-turn conversation use cases, so it is useful for needs such as chat bot and AI assistants. Its maximum input token limit is 4096 tokens, and it can also output text up to 1024 tokens. Its training data is up to February 2023 as well.
GCP tokens: One token is equivalent to 4 characters and 100 tokens are about 60 to 80 English words.
Data policy: US storage and data processing for the moment, there is no Google generative AI offering where all data stays within Europe. Customer data is not used to train the models.
Language support: The PaLM suite is supposed to be multilingual, but it is mentioned that the only supported language on the PaLM 2 API is English. Our guess is that safeguards are only studied and applied to the English language.
Pricing: Generative AI support on Vertex AI charges by every 1,000 characters of input (prompt) and every 1,000 characters of output (response). Characters are counted by UTF-8 code points and white space is excluded from the count. At the end of each billing cycle, fractions of one cent ($0.01) are rounded to one cent.
- Text-bison: $0.001 per 1000 characters for both input and output.
- Chat-bison: $0.0005 per 1000 characters for both input and output.
Rate limit: There is a 60 requests per minute base quota limit for both limits which can be increased upon request and subject to Google’s approval.
OpenAI
GPT-3.5 turbo is an upgraded version of the GPT-3 model having 175 billion parameters. It builds upon the architecture and training data of its predecessor. It can take as input a text of 4096 tokens. One should mention that ChatGPT, the widely renowned dialogue model that has taken the world by storm, owes its finesse to the ingenious fine-tuning of GPT-3.5 with an added layer of Reinforcement Learning with Human Feedback (RLHF) to optimize it for dialogue.
GPT-4 is an entirely new large multimodal model (accepting image and text inputs, emitting text outputs) that can perform all NLP traditional tasks but would undoubtedly stand out in reasoning tasks. It is said to be a new design that incorporates additional improvements and advancements and allegedly, is a mixture of 8 smaller models. There are two different versions of GPT-4 which are both capable of processing up to 50 pages worth of text: the former one has a context length of 8,192 tokens and the newer one gpt-4–32k takes up to 32768 tokens. The training data of gpt-3.5-turbo and gpt-4 goes up to September 2021.
OpenAI tokens: 1 token is approximately 4 characters or 0.75 words for English text. To have a clearer vision, we might say that the context length of GPT-4 is equivalent to 6144 words, let us say approximatively 400 sentences.
Data policy: When it comes to GDPR, no user data is used to train for the suite of GPTs. The data processing and storage is held in the US for the moment. However, Azure OpenAI Service gives the possibility to test the same models as OpenAI: GPT-4, GPT-3, Codex, and DALL-E models with regional availability (France Central, West Europe) which is an interesting perk compared to GCP models.
Language support: GPTs support various languages and GPT-4 has a particular outstanding performance for handling other languages than English. In the 24 of 26 languages tested, GPT-4 outperforms the English-language performance of GPT-3.5 and other LLMs (Chinchilla, PaLM), including for low-resource languages such as Latvian, Welsh, and Swahili.
Pricing:
- Gpt-3.5: $0.002 per 1000 tokens for both input and output.
- Gpt-4–8k: $0.03 per 1000 tokens for input and $0.06 per 1000 tokens for the output.
- Gpt-4–32k: $0.06 per 1000 tokens for input and $0.12 per 1000 tokens for the output.
Rate limit: GPTs present varying rate limits to cater to different usage scenarios. for GPT-3.5, we have a rate limit of 3500 requests per minute, while GPT-4 32k offers a limit of 1000 requests per minute. However, it is important to note that GPT-4, currently positioned for experimentation and not high-volume production use cases, comes with a more limited rate limit of 200 requests per minute. While requests for rate limit increases cannot be accommodated at this stage due to capacity constraints outlined by OpenAI, we must emphasize that for our benchmark, we have not had a problem with the initial quota as opposed to GCP models where we needed a quota raise.
Summary table
Selected alternatives for Open Source LLMs
Falcon-40B-Instruct
In contrast to the closed-source LLMs lastly mentioned, Falcon models are the first open-source models that closely rival the capabilities of many current closed-source models while being comparatively lightweight and less expensive to host than other LLMs. The Falcon Family holds two base models: Falcon-40B and its little brother as they call it Falcon-7B.
Falcon-40B is a causal decoder-only model, like GPT models, launched by The Technology Innovation Institute (TII) that has been trained on 1 trillion tokens and that requires ~ 90GB of GPU memory. The quality of its training data is the key to its performance as it has been trained on RefinedWeb, a meticulously filtered and deduplicated web dataset enriched with carefully curated corpora. In this article, we will be interested in Falcon-40B-Instruct, a finetuned version of Falcon-40B on chat data (150 M from Baize mixed with 5% of RefinedWeb data). Falcon-40B-instruct can take as input a text of 2048 tokens. Beyond this limit, the inference time becomes too long to be practical, and it requires excessive memory for conventional hardware.
Language support: Falcon-40B-Instruct is mostly trained on English data. In their model card, HuggingFace mentions that it will generalize poorly to other languages.
Pricing: Falcon-40B-instruct is self-hosted. You won’t need Api credentials to use, you only need to allocate a machine in-house. A g5.12xlarge machine which is a family of large instances equipped with 4 A10 type GPUs costs 7$ per hour. However, you can only send about 10 requests per machine per minute. While this may seem too constraining, when scaled up, many machines could be allocated, and the pricing is lower, and the request rate is higher.
An Amazon service called auto-scaling could manage your machines based on the request density. When there is no activity, it will shut down, and you won’t incur any costs and in case of too many requests, it will automatically add copies of the machine.
Data Policy: That is the main advantage of using an open-source model. Following Samsung’s sensitive code leak, data security becomes a paramount concern for companies when using LLMs. With Open-source LLMs, your data is processed by a fully controlled pipeline.
MPT-30B-Chat
MPT-30B-Chat is a chatbot-like model for dialogue generation built by finetuning MPT-30B on approximately 300,000 turns of high-quality conversations. After the release of MPT-7B, the startup MosaicML launched her 30-billion parameters LLM MPT-30B on June the 27th, 2023. MPT-30B is an autoregressive model like the GPT models and it is trained on English text and code. It has an 8k token context window (half of the context window of GPT-3.5 turbo) and can run on a single GPU either one A100–80GB in 16-bit precision or one A100–40GB in 8-bit precision.
The MPT-30B model is licensed for commercial use. There are more subtleties regarding the commercial use of MPT-30B-Chat: actually, it is considered as a research artifact and not meant for commercial use.
Language support: English. Different attempts of prompting it for different use cases in French have shown that the model keeps a very good performance.
Pricing: Just like Falcon-40B, MP-30B-chat is also self-hosted. A g5.12xlarge machine is also 7$ per hour.
Data Policy: same thing as Falcon. As it is an Open-source LLM, your data is processed by a fully controlled pipeline.
LLaMa2
LLaMa2 is the latest iteration of the highly acclaimed language model Llama developed and released by Meta. At its core, Llama 2 is built upon the robust foundation of its predecessor, Llama, but with remarkable improvements that make it stand out in the realm of open-source large language models. This model is a result of a longstanding partnership between Meta and Microsoft and will lead the way for the development of such large language models, both in terms of understanding them, designing them and making them safer.
Llama2 is available for licensing in both research and business applications and is available in the Azure AI model catalog, through AWS, Huggingface and other providers. On top of that, Meta provides resources to help those who use LLama2 develop responsible AI solutions. Moments after its announcement and release, the open-source community knew of a surge in innovative products around Llama2 as well as resources for its usage. Which makes it one of the easiest open-source models to use, on top of being amongst the top performers of the open source LLM leaderboard. The actual top performer, FreeWilly2, is a fine-tuned Llama2 70b model.
The suite of Llama2 models comes in different sizes, 7b, 13b and 70b. The suite contains both foundational models and models fine-tuned for chat. The available Llama2 models on the Cloud providers support up to 4096 tokens in input.
Language support: English. Different attempts of prompting it for different use cases in French have shown that the model keeps a very good performance.
Pricing: Same as the other open-source models, the LLaMa2 models are self-hosted. The prices for the 7b, 13b and 70b models respectively per hour are 1.50$, 7$ et 16$ using A g5.12xlarge machine.
Data Policy: same thing as the other models. Your data is safe.
Methodology used to compare these LLMs
Our benchmark consisted of 6 use cases picked to evaluate the LLM (Large Language Models) offers on a wide range of interesting NLP tasks aiming to provide valuable insights for our clients and ourselves.
Summarization:
We evaluated this task using the ROUGE-1 metric. This allows us to gauge the models’ capabilities to summarize meetings notes, documents, etc. as well as its effectiveness in some chain prompting use cases where it is required to summarize previous prompts. To conduct our evaluation, we have used 100 rows from the training split of the version 3.0.0 of the CNN / DailyMail dataset which is an English-language dataset containing news articles as written by journalists at CNN and the Daily Mail.
We used parameter values to reduce the randomness of the model as much as possible while allowing it some creativity. We would like to remind the reader of the drawbacks of using the ROUGE-1 metric as it only compares n-grams (in our case unigrams) between the produced summary and the reference, as well as how subjective our references are.
Thematic analysis:
We evaluated this task using the accuracy metric. This offers insight on the capabilities of the models to successfully find and extract various themes from textual data. We used 20 rows for each of the following themes: emotion, hate, offensive, sentiment, stance climate from the training split of the TweetEval dataset.
We also tried zero shot as well as few shots learning for prompting this task. Our few shot learning examples consisted of hand-picked examples that are not part of the evaluation set. We found out that:
- performance with few-shot learning degraded compared to zero shot.
- GPT3.5 is more robust to a bad few-shot learning than the other models.
We must stress though that more elaborate strategies for choosing the few shot learning examples may lead to better results.
Content creation:
For this task, we came up with 10 examples for creating content on several topics. Since there is no conventional evaluation metric to evaluate this task, we decided to pursue the evaluation using two methods:
- The first one relies on manually scoring the models.
- The second uses a new paradigm that is more suitable in this context and that asks GPT4 to rank the answers obtained from all models according to a set of criteria we fix in the prompt.
Grammar, relevance, consistency, creativity, and audience engagement are some of the main criteria we fixed for the evaluation. To make sure to debias gpt-4’s ordering, we anonymized the answers and fed them in a random order to gpt-4. We prioritized our judgment over GPT4’s and we only used it as a proxy. One might say that in almost all cases, human annotation and evaluation of GPT aligned in nearly all instances, thereby underscoring the relevance and effectiveness of the GPT-4 evaluation.
Ideation:
For this task we generated 10 prompts by ourselves asking the models to give ideas on assorted topics. The evaluation approach was the same as the content creation task. We have also relied on human judgment for the relevance of the generated ideas with regards to the prompt as well as the quantity and quality of ideas and how well they can permit fast iteration on the topic.
Response wizard:
We fed the models a context containing technical details about a smartphone and asked the model to play the role of an assistant providing customers with clear answers then we ran conversations with the models. The questions were kept the same between the models. This approach relied purely on human evaluation.
Software development:
In this step, we came up with 10 prompts for writing code and debugging false code. The evaluation was based on the correctness of the code provided by the models in both cases as well as the justifications for the debugging. The GPT-eval framework was also used to assess the Software development task where the prompt we fed to it asks to focus on different requirements like the correctness, alignment with the desired functionality, readability, and modularity of the generated code.
And what about results for these selected use cases?
Below we provide the performance of each model through a visually appealing radar plot as well as a comparison table with color-coded circles that seamlessly transition from red to green illustrating the spectrum of performance from subpar to particularly good.
GPT-4 and GPT-3.5-turbo perform well on all these tasks and Chat-bison achieves results almost as good as GPT-3.5-turbo and GPT-4 performances on tasks like ideation, response wizard, content creation and software development.
On the remaining tasks like summary and thematic analysis, chat-bison was intentionally omitted due to its inherent limitations for such specific use cases. As for text-bison, it fell short of the capabilities demonstrated by the GPT models.
For the summarization task, while the ordering of models is correct, we must emphasize that according to the ROUGE-1 metric, the performance remains average for all tested LLMs. However, one might say that this evaluation metric that compares the unigram overlap between the generated summary and the reference summary only accounts for surface-level lexical matching and should not be fully trusted.
We notice that open-source models are a little bit lacking behind the closed source models except for Llama2–13b, and we also notice that there is a clear distinction in performance between the MPT model and the Falcon model. The MPT model has nearly similar performance to the OpenAI models on ideation, content creation and response wizard. We also notice that it is the only model for which the few-shot technique for thematic analysis increased the performance.
For Text summarization, while it conveys the key ideas of the text, one must say that the generated summary by MPT-30b is not brief and straight to the point. One good solution for this is to specify the length we want for the summary in the prompt. Our guess for the poor Falcon performance is that the instruction tuning it made it specifically good at following instructions as we notice its good performance on the Response Wizard task, at the cost of having bad performance on the other tasks.
Of the open-source models, the Llama2–13b stands out as the best model for handling creative tasks like ideation, content creation and response wizard. Despite only having 13b parameters compared to the other open-source models (30b/40b), its performances are very close to the Cloud providers’ models, which pack in hundreds of billions of parameters. This is a clear sign of the promises of the Llama2 suite of models.
Another point to add on the response wizard: While the open-source LLMs have generally very good performances, falcon tends to limit itself to the extraction of information from the context without injecting personal opinions or reasoning. On the other hand, llama 2 and mpt-30b have shown their capabilities to reason and derive conclusions while answering the questions, akin to a human assistant venting its expertise on the product and hyping you up to buy it. So, depending on your needs, you may consider different LLMs.
Why is Falcon performing poorly despite its first position on Hugging Face Open LLM Leaderboard?
We have tried various ways to prompt the Falcon model and found that it expects the inputs in a specific way to interpret them correctly and provide us with the awaited results.
For the sake of objectivity in our benchmark, we have not adapted our prompts to the Falcon model but kept the same prompts for all the models. This also shows that the Falcon model is not as well aligned to the user as the other models for the evaluated tasks that it performed poorly on.
Here’s an example of Falcon’s sensitivity to the way the prompt is worded.
First example without adapting prompt and where Falcon’s reply was wrong:
The same prompt adapted, and this time Falcon gives a correct answer
Falcon is unusually sensitive to small changes in its parameters (temperature, repetition penalty…). The model would lose its ability to abide by the instructions in high temperatures, it is therefore necessary to keep it in low temperatures to maximize the chances that it will produce the desired output. While, with high temperatures, GPT4 would also display more variability in its answers, the difference is that it will still comply with the instructions specified in the prompt.
With a low repetition penalty, the model also tends to generate more user prompts and answers and engage in an imaginary conversation with the user until it reaches its max output token limit. Thus, it is important to keep a relatively high repetition penalty (1.03 is the default) to prevent this behavior.
Here’s for example the Falcon response for the same prompt with temperature = 0.8
And the falcon response for the same prompt with temperature = 0.1
Why is Llama2–13b performing poorly on technical tasks despite its great performance on creative tasks?
Its poor performance on the thematic analysis is due to censorship which prohibits the model from answering our thematic analysis questions. These questions aim to classify offensive and hateful tweets and analyze the model’s ability to distinguish their content from other types of content. This is a double-edge sword since it makes the model far more responsible and can be deployed for the clients with less overhead than other models, but this comes at the cost of having to limit its use to cases where the prompts wouldn’t be interpreted by the model as hateful, racist, offensive or something else that will trigger its safety response.
So, for all the use cases of sentiment analysis, topic extraction, name entity recognition etc., on user data, this model is likely to fall short due to censorship issues. There are many resources on making Llama2 models uncensored, but such work is beyond the scope of our benchmark.
Conclusion
Ultimately, the best model choice depends on the use case, resources, user needs, and the goal of the project.
The choice between GPT models and GCP Bison models will depend first on the number of tokens of the input data and the requirements on the quality of the extracted data. However, when it comes to choosing between closed source models and open-source ones, the main thing to check is data safety and to what extent we are willing to preserve data privacy. It will also depend on the expertise to deploy such models: when the use case is a specific task on which closed source models have a bad performance, using an open source one will also require a finetuning expertise for that specific use case to achieve a better performance. The main point to consider though is the storage of data and the GPDR compatibility, otherwise GPT3.5 seems to offer the best trade-off between performance and cost.
The evaluation of these seven large language models across various tasks revealed notable differences in their performances. GPT4 and GPT-3.5 consistently demonstrated strong capabilities across most tasks, making them the top-performing models overall. MPT-30B-Chat showed competence in specific tasks but lacked consistency compared to the top two models. Meanwhile, Llama2–13b showed great performances for creative tasks such as ideation, content creation and response wizard which could also be considered as a reasoning and extraction task. But, this model is heavily censored which led to its poor performance on the thematic analysis task. On the other hand, Text-bison@001 had mixed results, performing reasonably well in ideation and software development but struggling with the other tasks. Chat-bison@001 is yet preferable for ideation to account for the unknown random factors that could affect the small differences between both models. Falcon-40B-Instruct generally fell below average across all tasks, except for an average score in the response wizard task.
These findings provide valuable insights to better understand the strengths and weaknesses of these language models in various application domains. It is important to note that these results are based on specific data. If the models were evaluated on different tasks or with different datasets, the outcome might be different. One should also keep in mind the limitations of the metrics used for evaluation.
As a side note, the Llama2–13b remains our team’s favorite’ response wizard. It has a unique trait in answering user questions that makes it a pleasant assistant.