ADoBo — Automatic Detection of Borrowings
What is a lexical borrowing?
Lexical borrowing is the process of importing words from one language into another language. Lexical borrowing is a phenomenon that affects all languages and is in fact a productive mechanism of word formation. During the last decades, English in particular has produced numerous lexical borrowings (often called anglicisms) in many European languages, especially in the press. It has been estimated that a reader of French newspapers encounters a new lexical borrowing every 1,000 words, English borrowings outnumbering all other borrowings combined. In Chilean newspapers, lexical borrowings account for approximately 30% of neologisms, 80% of those corresponding to English borrowings. In European Spanish, it has been estimated that anglicisms could account for 2% of the vocabulary used in Spanish newspaper El País in 1991, a number that is likely to be higher today.
As a result, the presence of English borrowings into Spanish has attracted lots of attention, both in Linguistics studies and among the general public.
Why is lexical borrowing detection interesting?
The task of automatically extracting unadapted lexical borrowings from text is relevant both for lexicographic purposes and for NLP downstream tasks. Borrowing detection has previously been used as a preprocessing step for parsing, text-to-speech synthesis and machine translation.
In the last years, several projects have approached the problem of extracting lexical borrowings in different European languages such, as German, Italian, French, Finnish or Norwegian, with a particular focus on anglicism extraction. Lately, work on anglicism detection in Spanish language has also been done for Argentinian Spanish and European Spanish.
What makes borrowing detection challenging?
The task of extracting emergent lexical borrowings is a more challenging undertaking than it might appear to be at first. To begin with, lexical borrowings can be either single or multitoken expressions (e.g., prime time, tie break or machine learning). Second, plain dictionary lookup can be an unreliable source for borrowing detection: after all, a term like social media is a borrowing, even when the two tokens that form the term happen to be Spanish words that appear in Spanish dictionaries.
Finally, linguistic adaptation is a diachronic proccess and, as a result, what constitutes an unadapted borrowing is not clear-cut. For example, words like bar or club were unadapted lexical borrowings in Spanish at some point in the past, but have been around for so long in the Spanish language that the process of phonological and morphological adaptation is now complete and they cannot be considered unadapted borrowings anymore. On the other hand, realia words, that is, culture-specific elements whose name entered via the language of origin decades ago (like jazz or whisky) cannot be considered emergent anymore, even when their orthography has not been adapted into the Spanish spelling system.