This demo takes a lemmatised and part-of-speech tagged Slovene corpus as input and will extract multi-word terminological units. Please use the ToTaLe analyser to pre-process your corpus.
Alternatively, if you upload a txt, odt, pdf or doc file (UTF-8 is assumed), the file will be converted to text and sent to ToTaLe analyser for lemmatisation and part-of-speech tagging. Please check ToTaLe terms of use before you use this function.
Results will be displayed on a separate page in a two-columned table, the first column containing the lemmatised version of a multi-word term (eg. "bojna enota kopenskih sil" will become "bojen enota kopenski sila") and the second column containing the canonical term (in the nominative case), if this form was present in the uploaded corpus.
This demo was developed by Špela Vintar and Jan Jona Javoršek.
Please send us your comments, questions and bug reports: spela.vintar at ff.uni-lj.si
, jona.javorsek at ijs.si
.
References:
Vintar, Š. (2010) Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation. Terminology 16(2), in print.
Vintar, Š. (2010) Luščenje terminologije iz angleško-slovenskih vzporednih in primerljivih korpusov. V: Vintar, Š. (ur.) Slovenske korpusne raziskave. Ljubljana: Znanstvena založba Filozofske fakultete. 37-53.