Syntaktisesti koodattu oppijankielen korpus: mahdollisuuksia ja kysymyksiä [Syntactically encoded corpus of the learner language: Opportunities and challenges]

Ilmari Ivaska, Kirsti Siitonen


The encoding of the corpus of advanced Finnish began in autumn 2008 at the University of Turku. The corpus is being modified into the XML-format (”Extensible Markup Language) and it follows the TEI-directions (The Text Encoding Initiative) given by the Research Institute for the Languages of Finland. The encoding is done cumulatively in order to retain the flexibility of the corpus throughout the progress of the project. The first stage is to focus on the morphological and lexical level. All word types are encoded alphabetically with various morphological information. The lemmatisation of the material is executed synchronously. All the word types are encoded only once and the encoding is, then, extended to every token. This method is called the compiling of the corpus’ dictionary. The second stage is the syntactical encoding, in which the morphological encoding is contextualised separately for each informant. In this phase, the corpus is organised by the date of the texts, text entities, paragraphs, sentences and clauses. Simultaneously, each token is encoded by its syntactical function.

Alongside these steps, the material is also commented in the respect of errors and unidiomatic formal and lexical solutions. In the third step the comments are collected and sorted to generate a corpus-based error tagging system. In these manners the corpus will be labelled with a hierarchical error tagging. However, the concept of an error is highly complex and, thus, the error tagging should be considered solely as the basis of every particular study. Syntactically encoded learner language corpus can provide new knowledge about the phenomena that have so far remained too demanding to reach with non-computer based quantitative methods. It enables for example the studies of the frequencies of syntactical structures in advanced learner’s Finnish.


encoding, corpus linguistics, Finnish as a second language, syntax

Full Text:



  • There are currently no refbacks.

Published by / Kirjastaja:

ISSN 2504-6616 (print/trükis)

ISSN 2504-6624 (online/võrguväljaanne)