Konferensartikel

Creating Data in Icelandic for Text Normalization

Helga Svala Sigurðardóttir

Anna Björk Nikulásdóttir

Jón Guðnason

Ladda ner artikel

Ingår i: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), May 31-June 2, 2021.

Linköping Electronic Conference Proceedings 178:45, s. 404-412

NEALT Proceedings Series 45:45, p. 404-412

Visa mer +

Publicerad: 2021-05-21

ISBN: 978-91-7929-614-8

ISSN: 1650-3686 (tryckt), 1650-3740 (online)

Abstract

There is no natural way to acquire normalized data so we try to create good enough data to attempt more advanced methods for text normalization. We manually annotated the first normalized corpus in Icelandic, 40,000 sentences, and developed Regína, a rule-based system for text normalization. Regína gets 90.83\% accuracy compared to the manually annotated corpus on non-standard words. Regína showed a significant improvement in accuracy when compared to an older normalization system for Icelandic. The normalized corpus and Regína will be released as open source.

Nyckelord

normalization, under-resourced languages, regular expressions

Referenser

Inga referenser tillgängliga

Citeringar i Crossref