This Science News Wire page contains a press release issued by an organization and is provided to you "as is" with little or no review from Science X staff.

Developing a neural machine translation system for all the Romance languages of the Iberian Peninsula

July 5th, 2023 Juan F. Samaniego
ai
Credit: Pixabay/CC0 Public Domain

Recent years have seen an explosion in the number and effectiveness of machine translation technologies. Thanks to artificial intelligence, we all carry in our pockets powerful tools that can easily translate any of the most widespread languages. But what happens with those with fewer speakers and resources? How can an AI get to "learn" them? For the Romance languages of the Iberian Peninsula, the answer may lie in transfer learning and multilingual system training.

The Neural Machine Translation for the Languages of the Iberian Peninsula (TAN-IBE) project, coordinated by the Universitat Oberta de Catalunya (UOC) and involving the universities of Oviedo, Lleida and Zaragoza, explores the most effective techniques for training machine translation systems based on neural networks (a type of AI), applied to seven of the Romance languages of the Iberian Peninsula: Spanish, Portuguese, Catalan, Galician, Asturian, Aragonese and Aranese.

An AI that transfers knowledge between languages

Neural network-based translation systems are trained on the basis of millions of sentences in one language with their translation into another. This is what is known as parallel corpora, vast datasets available in two languages. Once the neural network has been trained, it is able to effectively translate any text in these languages. The problem is that, while with languages such as Spanish and Portuguese it is easy to find these parallel corpora, with those languages that have less material available, such as Aranese, Aragonese and Asturian, it is hard to find enough data to train the artificial intelligence.

"The good thing is that neural systems can learn things about a language from another, similar language," explained Antoni Oliver, member of the UOC's Faculty of Arts and Humanities and researcher at the Linguistic Applications Inter-University Research Group (GRIAL-UOC), which is coordinating the TAN-IBE project.

"That's why we've chosen Romance languages. The process needs to be able to learn by transfer, using a model between two languages to construct the translation system between another two. So, for example, when it's completed, the Spanish/Aranese translation tool will have done some of its learning from the Spanish/Catalan or the Spanish/Portuguese systems."

The construction of the translation model is not the only goal of this research project. It also seeks to:

  • Compile parallel and monolingual corpora for the seven featured Romance languages, with a particular focus on Asturian, Aragonese and Aranese.
  • Explore new techniques for the training of neural machine translation systems. In addition to transfer learning, the project will study multilingual machine translation, self-supervised machine translation and unsupervised machine translation.
  • Train neural machine translation systems between Spanish and the rest of the project's languages, in both directions.
  • Train multilingual systems able to translate from and into all the project's languages.
  • Create guides and scripts to help train neural machine translation systems in general and, more specifically, for the project's languages.
  • Publish the project's results with open licenses. This includes the corpora, the machine translation models and engines, and the guides and scripts.

"Broadly speaking, the project comprises, firstly, compiling all the corpora for those languages with less material (Asturian, Aragonese and Aranese) and, secondly, training the translation systems," added Oliver. "The end result of the project will be both the open publication of the resources, insofar as this is possible, and the creation of a free-to-use neural machine translation system."

Agreements and studies to promote minority languages

The first part of the project is taking place outside of a lab environment. To obtain the data required to train the artificial intelligence models, there is a need to compile as much material as possible for Asturian, Aragonese and Aranese. "That's why this first phase focuses on securing agreements with regional governments, universities and publishers to provide the materials for creating the parallel corpora to train the neural system," said Oliver.

In this regard, this past May saw the inking of an important agreement with the Government of Asturias on assigning the entire corpus of texts translated from Spanish into Asturian held by its Directorate General of Language Policy. The agreement also stipulates that, if the Government of Asturias so requests, it can gain access to the technological and linguistic developments achieved by the TAN-IBE project for use in its own possible machine translation projects.

"Ultimately, our goal with this project is to help promote the use of these languages with fewer resources and foster more publishing in them," said Oliver.

"For example, all laws could be published in two languages, quickly and efficiently, using fewer resources, although a human review would always be required. What's more, those who don't dare to use these languages because they don't feel confident enough can use these tools as support for improving their texts. Lastly, languages like Asturian, Aragonese and Aranese need to be included in digital technologies. If not, they may start disappearing and be forgotten."

Provided by Universitat Oberta de Catalunya (UOC)

Citation: Developing a neural machine translation system for all the Romance languages of the Iberian Peninsula (2023, July 5) retrieved 29 November 2024 from https://sciencex.com/wire-news/450030985/developing-a-neural-machine-translation-system-for-all-the-roman.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.