The growing interest in Large Language Models (LLMs) has accelerated research efforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, offering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model’s vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva’s development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.
Dettaglio pubblicazione
2024, Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), Pages -
Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data (04b Atto di convegno in volume)
Orlando Riccardo, Moroni Luca, Huguet Cabot Pere-Lluís, Barba Edoardo, Conia Simone, Orlandini Sergio, Fiameni Giuseppe, Navigli Roberto
keywords