Rule-based Automatic Multi-Word Term Extraction and Lemmatization - Selection of References on Lexicon-Grammar and NLP Dictionaries
Communication Dans Un Congrès Année : 2016

Rule-based Automatic Multi-Word Term Extraction and Lemmatization

Ranka Stanković
  • Fonction : Auteur
  • PersonId : 1092669
Cvetana Krstev
  • Fonction : Auteur
  • PersonId : 1318086
Ivan Obradović
  • Fonction : Auteur
  • PersonId : 1318087
Biljana Lazić
  • Fonction : Auteur
  • PersonId : 1318088
Aleksandra Trtovac
  • Fonction : Auteur
  • PersonId : 1318089

Résumé

In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is required for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.960. The evaluation showed that 94% of distinct multi-word forms were evaluated as proper multi-word units, and among them 97% were associated with correct lemmas.
Fichier principal
Vignette du fichier
L16-1081.pdf (1.3 Mo) Télécharger le fichier
Origine Publication financée par une institution
licence

Dates et versions

hal-04314215 , version 1 (29-11-2023)

Licence

Identifiants

  • HAL Id : hal-04314215 , version 1

Citer

Ranka Stanković, Cvetana Krstev, Ivan Obradović, Biljana Lazić, Aleksandra Trtovac. Rule-based Automatic Multi-Word Term Extraction and Lemmatization. LREC, May 2016, Portorož, Slovenia. pp.507-514. ⟨hal-04314215⟩

Collections

LIGM_LINGU_INVITE
39 Consultations
51 Téléchargements

Partager

More