Sign up for our daily and weekly newsletters for the latest updates and exclusive content on the industry’s best AI coverage. Learn more
Cohere today launched two new open models from the Aya project to bridge the language gap in foundational models.
Now available from Hugging Face, Aya Expanse 8B and 35B extend the performance gains to 23 languages. Cohere said in a blog post that the 8B parameter model “makes groundbreaking discoveries more accessible to researchers around the world,” while the 32B parameter model offers cutting-edge multilingual capabilities.
The Aya project seeks to expand access to the foundational model to more global languages than English. The company’s research arm, Cohere for AI, launched the Aya initiative last year. Last February, we launched Aya 101 Large Language Model (LLM), a 13 billion parameter model covering 101 languages. Cohere for AI has also launched the Aya dataset to help expand access to other languages for model training.
Aya Expanse uses many of the same recipes used to create Aya 101.
“The improvements to Aya Expanse are the result of our continued focus on rethinking the core components of our machine learning innovation and expanding how AI serves the world’s languages,” said Cohere. “Over the past few years, our research agenda has focused on bridging the language gap with several innovations that are important to the current recipe, such as data arbitrage, preference training for general performance and safety, and finally model merging. “
Aya is good at it
Cohere said its two Aya Expanse models consistently outperformed similarly sized AI models from Google, Mistral, and Meta.
The Aya Expanse 32B performed better than the Gemma 2 27B, Mistral 8x22B and the much larger Llama 3.1 70B in our benchmark multilingual tests. The smaller 8B also outperformed the Gemma 2 9B, Llama 3.1 8B and Ministral 8B.
Cohere developed the Aya model using a data sampling method called data arbitrage as a means to avoid generating gibberish that occurs when models rely on synthetic data. Many models use synthetic data generated from a “teacher” model for training purposes. However, finding good teacher models is difficult for other languages, especially low-resource languages.
We also focused on guiding the model towards “global preferences” and accounting for different cultural and linguistic perspectives. Cohere said it had found a way to improve performance and safety while inducing preference for models.
“We consider this the ‘last spark’ of AI model training,” the company said. “However, preference training and safety measures are often over-applied to the harm prevalent in Western-centric datasets. The problem is that these safety protocols often do not extend to multilingual settings. “Our work is one of the first to extend preference training to large-scale multilingual environments, taking into account diverse cultural and linguistic perspectives.”
Models in various languages
The Aya initiative focuses on ensuring studies for LLMs are conducted well in languages other than English.
Many LLMs are eventually available in other languages, especially those that are widely used, but there are challenges in finding data to train models in other languages. After all, English tends to be the official language of government, finance, Internet conversation, and business, so finding data in English is much easier.
Additionally, translation quality can make it difficult to accurately benchmark a model’s performance in different languages.
Other developers have released their own language datasets for further research on non-English LLMs. For example, OpenAI created a multilingual, large-scale, multitasking language understanding dataset from Hugging Face last month. The goal of the dataset is to help better test LLM performance across 14 languages, including Arabic, German, Swahili, and Bengali.
Cohere has been busy over the past few weeks. This week, the company added image search capabilities to Embed 3, its enterprise embedding product used in search augmented generation (RAG) systems. We’ve also seen more fine-tuning for the Command R 08-2024 model this month.