The Learner Corpora of Spoken English: What Has Been Done and What Should Be Done?

Soyeon Yoon 1 ,
Author Information & Copyright
1Incheon National University
Corresponding Author: Associate Professor Department of English Language and Literature Incheon National University 119 Academy-ro, Yeonsu-gu, Incheon 22012, Korea , E-mail:

ⓒ Copyright 2020 Language Education Institute, Seoul National University. This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Received: Feb 28, 2020 ; Revised: Mar 26, 2020 ; Accepted: Mar 26, 2020

Published Online: Apr 30, 2020


The number of spoken learner corpora is smaller than that of written corpora; however, demand for spoken corpora has continuously increased. This study investigates the current state of spoken English-language learner corpora both in the world and in Korea, with a focus on their size, speakers, and genres. Based on this survey, the study discusses factors to consider when building and publishing a spoken learner corpus, especially with respect to the issues of conversation genre, proficiency level, and annotation.

Keywords: spoken corpus; learner corpus; conversation; transcription; annotation



Aijmer, K. (2011). Well I'm not sure I think… the use of well by non-native speakers. International Journal of Corpus Linguistics, 16(2), 231-254 .


Andersen, G. (2010). How to use corpus linguistics in sociolinguistics. In A. O'Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (pp.547-562). London: Routledge .


Andrade, M. (2014). TOEIC scores: How many points are enough to show progress? Sophia University Junior College Division Faculty Journal, 35, 15-23 .


Back, J. (2011). Preposition errors in writing and speaking by Korean EFL learners: A corpus-based approach. Studies in British and American Language and Literature, 99, 227-247 .


Buysee, L. (2012). So as a multifunctional discourse marker in native and learner speech. Journal of Pragmatics 44. 1764-1782 .


Carlstrom, B. & Price, N. (2012-2014). The Gachon Learner Corpus. Available online at .


Centre for English Corpus Linguistics (Feb 05, 2020). Learner corpora around the world. Louvain-la-Neuve: Université catholique de Louvain. Retrieved from https://uclouvain. be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html .


Cheng, W., Greaves, C., & Warren, M. (2005). The creation of a prosodically transcribed intercultural corpus: The Hong Kong Corpus of Spoken English (prosodic), ICAME Journal 29, 47-68 .


Cheng, W. & Warren, M. (1999). Facilitating a description of intercultural conversations: the Hong Kong Corpus of Conversational English. ICAME Journal 23, 5-20 .


Chung, H., Kim, Y.-K., & Lee, S.-K. (2016). A study on the features of English as a lingua franca in Asian contexts: Segmental features. Language and Linguistics, 71, 237-266. .


Council of Europe. (2001). The common European framework of reference for languages: Learning, teaching, assessment. Cambridge: Cambridge University Press .


Creer, S., & Thompson, P. (2004, May). Processing spoken language data: The BASE experience. Paper presented at the LREC 2004 International Conference - Workshop on compiling and processing spoken language corpora. Lisbon, Portugal .


De Cock, S. (2004). Preferred sequences of words in NS and NNS speech. Belgian Journal of English Language and Literatures, 2, 225-246 .


De Cock, S. (2007). Routinized building blocks in native speaker and learner speech: Clausal sequences in the spotlight. In M. C. Campoy & M. J. Luzon (Eds.), Spoken corpora in applied linguistics (pp. 217-234). Bern: Peter Lang .


Du Bois, J. W. (2006). Transcription Symbols by Delicacy. Retrieved from http://www. .


Du Bois, J. W., Chafe, W. L., Meyer, C., Thompson, S. A., Englebretson, R., & Martey, N. (2000-2005). Santa Barbara Corpus of Spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium .


Friginal, E., Lee, J. J., Polat, B., & Roberson, A. (2017). Corpora of spoken academic discourse and learner talk: A survey. In E. Friginal, J. J. Lee, B. Polat, & A. Roberson (Eds.), Exploring spoken English learner language using corpora (pp. 35-63). Palgrave Macmillan .


Fuller, Janet M., 2003. Discourse marker use across speech contexts: a comparison of native and non-native speaker performance. Multilingua 22, 185--208 .


Gilquin, G., De Cock, S., & Granger S. (Eds.). (2010). The Louvain international database of spoken English interlanguage, handbook and CD-ROM. Nouvain-la-Neuve: Presses Universitaires de Louvain .


Han, N. R. & Lee, S. H. (2009). Developing a model for English preposition errors using a learner corpus. Linguistics, 53. 163-185 .


Ito, T., Kawaguchi, K., & Ohta, R. (2005). A Study of the relationship between TOEIC scores and functional job performance: Self-assessment of foreign language proficiency. TOEIC Research Report, 1-40 .


Ishikawa, S. (2013). The ICNALE and sophisticated contrastive interlanguage analysis of Asian learners of English. In S. Ishikawa (Ed.), Learner corpus studies in Asia and the world, 1 (pp. 91-118). Kobe, Japan: Kobe University .


Kim, M. (2009). An error analysis of a learner corpus of written and spoken English produced by Korean university students (Doctoral dissertation). Korea University, South Korea .


Kim, R.-E. & Rhee, S.-C. (2019). A study on English liquids in the rated L2 English speech corpus of Korean learners. Korean Journal of English Language and Linguistics, 19(1), 53-75 .


Kim, S.-S. (2018). Challenges and prospects of corpus linguistics. English Language Teaching, 30(1), 47-71. .


Kotani, K., Yoshimi, T., Nanjo, H., & Isahara, H. (2016). A corpus of writing, pronunciation, reading, and listening by learners of English as a foreign language. English Language Teaching, 9(9), 139-155 .


Kwon, Y. E. & Lee, E. J. (2014). Lexical bundles in the Korean EFL teacher talk corpus: A comparison between non-native and native English teachers. The Journal of Asia TEFL, 11(3), 73-103 .


Love, R. Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics 22(3), 319-344. .


Lee, J. (2019). Functional spectrum of a discourse marker so in Korean EFL teacher talk. Korean Journal of English Language and Linguistics, 19(3), 371-406 .


Lee, Y. (2018). A study on the use of discourse markers by non-native learners of English in spontaneous communication. Korean Journal of Communication Studies Volume, 20(4), 5-28 .


Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. Harlow: Pearson Education Limited .


Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330 .


Müller, S. (2005). Discourse markers in native and non-native English discourse. Amsterdam; Philadelphia: John Benjamins .


Muñoz, C. (Ed.). (2006). Age and the rate of foreign language learning. Clevedon: Multilingual Matters .


Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9-13 .


Rhee, S.-C. & Jung, C. K. (2012, March). Yonsei English Learner Corpus (YELC). In Proceedings of the First Yonsei English Corpus Symposium (pp. 26-36). Seoul, Korea .


Rhee, S.-C., & Jung, C. K. (2014). Compilation of the Yonsei English Learner Corpus (YELC) 2011 and its use for understanding current usage of English by Korean pre-university students. Journal of the Korea Contents Association, 14(11), 1019-1029 .


Runnels, J. (2016). Self-assessment accuracy: Correlations between Japanese English learners' self-assessment on the CEFR-Japan's can do statements and scores on the TOEIC. Taiwan Journal of TESOL, 13(1), 105-137 .


Simpson, R. C., Briggs, S. L., Ovens, J., & Swales, J. M. (2002). The Michigan Corpus of Academic Spoken English. Ann Arbor, MI: The Regents of the University of Michigan .


Shin, D., Chon, Y., Lee, S., & Park, M. (2018). A comparison of single word and multi-word unit profiles in spoken and written corpora of Korean learners and English native speakers. Journal of the Korea English Education Society, 17(2), 93-112 .


Slavianova, E. (2007). The LeaP Corpus: Generating a relational database for linguistic query support (Research Activity). Institute of Computer Science, University of Freiburg .


Swales, J. (1990). Genre analysis. Cambridge: Cambridge University Press .


The British National Corpus 2014 (2018). User manual and reference guide (version 1.1) Retrieved from .


VOICE. (2013). The Vienna-Oxford International Corpus of English (version 2.0 Online). Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl, Michael Radeka. Retrived from .


Yang, H., & Wei, N. (2005). Construction and data analysis of a Chinese learner spoken English corpus. Shanhai Foreign Languse Eduacation Press .


Yoon, S. (2020, October). Casual Conversations of Same-L1-Group and Foreigner-Including-Group: A Case study of Korean EFL Learner Corpus. Abstract accepted and paper to be presented at the 5th Annual CLIC Conference. Houston, Texas .


Yoon, S., Park, S., Kim, J.-T., Yoo, H., & Jung, C. K. (2020, June). Incheon National University Multi-language Learner Corpus (INU-MULC): Its design and application. Abstract accepted and paper to be presented at Asia Pacific Corpus Linguistics Conference 2020. Seoul, South Korea .


Yun, S. & Kim, J.-R. (2018). A study on the discourse marker well in conversation between Asian speakers of English. Korean Journal of Teacher Education, 34(30), 89-106 .