Lexical simplification via single-word generation
Lexical simplification (LS) aims to simplify a sentence by replacing complex words with simpler words without changing the meaning of the sentence, which can facilitate comprehension of the text for people with non-native speakers and children. Traditional LS methods utilize linguistic databases or word embedding models to extract synonyms or high-similar words for the complex word, and then sort them based on their appropriateness in context.
Recently, BERT-based LS methods entirely or partially mask the complex word of the original sentence, and then feed the sentence into pretrained modeling BERT to obtain the top probability tokens corresponding to the masked word as the substitute candidates.
Researchers have made remarkable progress in generating substitutes by making full use of the context information of complex words, that can effectively alleviate the shortcomings of traditional methods. But, the paucity of annotated LS data limits the applicability of BERT, which leads to the following two limitations of BERT:
- BERT as a self-supervised pretrained model that is trained with the goal of recovering the destroyed original text, does not significantly learn the word substitution operation.
- Masking the complex word will impair the semantic information of complex word, resulting in failing to preserve the sentence's meaning.
To address those limitations mentioned above, the research team treats the LS task as a single-word generation task, and proposes an unsupervised LS method PaGeLS based on a non-autoregressive paraphrase generation. The paper is published in the journal Frontiers of Computer Science.
After training an encoder-decoder modeling on a paraphrase corpus, they feed the sentence into the encoder, and let the decoder to predict the probability distribution over the vocabulary for the hidden representation of the complex word. They choose the words with top probabilities as the candidates.
Compared with pretrained BERT, PaGeLS incorporates the following three information: the semantic information of the complex word, the context of the complex word, and the semantic information of the original word.
In general, The research team:
- Propose an LS method PaGeLS without relying on any annotated LS data. To the best of their knowledge, PaGeLS is the first LS method that can produce substitute candidates based on the essence of LS task without changing the sentence's meaning.
- Propose a novel strategy for candidate ranking. They adopt a text generation evaluation metric BARTScore to compute the relationship between the original sentence and the updated sentence. They found that BARTScore is very suitable for candidate ranking, and it outperforms the previous state-of-the-art methods when they are all provided with the same substitution candidate list on three popular LS datasets.
- Provide experimental results that show that the PeGeLS method achieves state-of-the-art results.
More information:
Jipeng Qiang et al, Lexical simplification via single-word generation, Frontiers of Computer Science (2023). DOI: 10.1007/s11704-023-2744-2
Provided by Higher Education Press