Alexander König
Eurac Research, Italy / CLARIN ERIC, the Netherlands
Egon W. Stemle
Eurac Research, Italy
André Moreira
CLARIN ERIC, the Netherlands
Willem Elbers
CLARIN ERIC, the Netherlands
Download articlehttps://doi.org/10.3384/ecp2020172009Published in: Selected Papers from the CLARIN Annual Conference 2019
Linköping Electronic Conference Proceedings 172:9, p. 66-74
Published: 2020-07-03
ISBN: 978-91-7929-807-4
ISSN: 1650-3686 (print), 1650-3740 (online)
In recent years, the reproducibility of scientific research has increasingly come into focus, both by
external stakeholders (e.g. funders) and by the research communities themselves. Corpus linguistics,
with its methods for creating, processing and analysing corpora, is an integral part of many
other disciplines that work with language data and therefore plays a special role. Moreover, language
corpora are often living objects that are regularly improved and revised. At the same time,
tools for the automatic processing of human language are also being developed further, which
can lead to different results with the same processing steps and the same data. This article argues
that modern software technologies, such as version control and containerisation, can mitigate the
following problems: Software packaging, installation and execution and, equally important, the
tracking of corpus modifications throughout its life-cycle. All in all, this leads to transparency of
changes to raw data and software tools and thereby enhanced reproducibility.