The majority of the English text available worldwide is generated by non-native speakers. Learner language introduces a variety of challenges and is important for the study of language acquisition as well as for Natural Language Processing. Despite the ubiquity of non-native English, there has been no publicly available syntactic treebank for English as a Second Language (ESL). The Treebank of Learning English is a collection of over 5,000 ESL sentences manually annotated with parts of speech (POS) and syntactic dependency trees, representing upper-immediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. Full syntactic analyses are provided for both the original and corrected versions of each sentence. The treebank supports linguistic and computational research and education on language learning and automatic processing of ungrammatical language.
- Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., Garza, S. & Katz, B. (2016) Universal dependencies for learner English, Annual Meeting of the Association for Computational Linguistics (ACL).
Additional Resources:
- Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., Garza, S. & Katz, B. (2016) Treebank of learner English annotation manual (draft).