Data for lexicography The central role of the corpus

Allan F. Lauder


This paper looks at the nature of data for lexicography and in particular on the central role that electronic corpora can play in providing it. Data has traditionally come from existing dictionaries, citations, and from the lexicographer’s own knowledge of words, through introspection. Each of these is examined and evaluated. Then the electronic corpus is considered. Different kinds of corpora are described and key design criteria are explained, in particular the size of corpus needed for lexicography as well as the issue of representativeness and sampling. The advantages and disadvantages of corpora are weighed and compared against the other types of data. While each of these has benefits, it is argued that corpora are a requirement, not an option, as data for dictionary making.


Corpus linguistics, lexicography, data, linguistic intuition, citations,

Full Text:



Aitchison, Jean. 2003. Words in the mind; An introduction to the mental lexicon. Third Edition. Malden, MA.: Blackwell. [First Edition 1987.]

Alwi, Hasan, Dendy Sugono, and Tim Penyusunan Kamus (eds). 2001. Kamus Besar Bahasa Indonesia. Third Edition. Jakarta: Pusat Bahasa, Departemen Pendidikan Nasional. [First Edition: Tim Penyusunan Kamus 1988, Jakarta: Balai Pustaka.]

Atkins, Sue, Jeremy Clear, and Nicholas Ostler.1992. “Corpus design criteria”, Literary and Linguistic Computing 7(1): 1-16.

Atkins, Sue, and Michael Rundell. 2008. The Oxford guide to practical lexicography. Oxford: Oxford University Press.

Baker, Paul, Andrew Hardie, and Tony McEnery. 2006. A glossary of corpus linguistics. Edinburgh: Edinburgh University Press.

Barnbrook, Geoff. 1996. Language and computers; A practical introduction to the computer analysis of language. Edinburgh: Edinburgh University Press. [Edinburgh Textbooks in Empirical Linguistics Series.]

Biber, Douglas. 1993. “Representativeness in corpus design”, Literary and Linguistic Computing 8(4): 243-257.

Biber, Douglas, Susan Conrad, and Randi Reppen. 1998. Corpus linguistics; Investigating language structure and use. Cambridge: Cambridge University Press. [Cambridge Approaches to Linguistics, Jean Aitchison (ed.).]

Centre of Computational Linguistics. 2006. Systematic dictionary of corpus linguistics. Kaunas, Lithuania: Centre of Computational Linguistics, Vytautas Magnus University. [Online address:]

Cruse, Alan. 1986. Lexical semantics. Cambridge and New York: Cambridge University Press. [Cambridge Textbooks in Linguistics Series.]

Crystal, David. 1997a. A dictionary of linguistics and phonetics. Fourth Edition. Oxford, UK and Cambridge, Mass.: Blackwell. [First Edition 1980.]

Crystal, David. 1997b. The Cambridge encyclopedia of language. Second Edition. Cambridge: Cambridge University Press. [First Edition 1987.]

Green, Jonathon. 1996. Chasing the sun; Dictionary-makers and the dictionaries they made. London: Pimlico.

Halliday, M. A. K. 2004. “Lexicology”, in: M. A. K. Halliday, W. Teubert, C. Yallop, and A. Cermáková (eds), Lexicology and corpus linguistics; An introduction, pp. 1-22. London and New York: Continuum.

Hartmann, R. R. K. (ed.). 1983. Lexicography; Principles and practice. London: Academic Press.

Hartmann, R. R. K. and G. James. 1998. Dictionary of lexicography. London: Routledge.

Hunston, Susan. 2002. Corpora in applied linguistics. Cambridge: Cambridge University Press. [Cambridge Applied Linguistics Series, Michael H. Long and Jack C. Richards (eds).]

Jackson, Howard and Etienne Zé Amvela. 2000. Words, meaning, and vocabulary; An introduction to modern English lexicology. London and New York: Cassell. [Open Linguistics Series.]

Jackson, Howard. 2002. Lexicography; An introduction. London and New York: Routledge.

Kennedy, Graeme D. 1998. An introduction to corpus linguistics. London and New York: Longman.

Kilgarriff, Adam, Pavel Rychly, Pavel Smrž, and David Tugwell. 2004. “The sketch engine”, Euralex (July): 105-116. [Online address:]

Krishnamurthy, R. 2002. Corpus size for lexicography. [Corpora-list; Online address:, accessed: 25 July 2010.]

Leech, Geoffrey N. 1991. “The state of the art in corpus linguistics”, in: K. Aijmer and B. Altenberg (eds), English corpus linguistics; Studies in honor of Jan Svartvik, pp. 8-29. London: Longman.

Manning, C. D. and H. Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA.: MIT Press.

McEnery, Tony and Andrew Wilson (eds). 2001. Corpus linguistics; An introduction. Second Edition. Edinburgh: Edinburgh University Press. [First Edition 1996; Edinburgh Textbooks in Empirical Linguistics Series.]

McEnery, Tony, Richard Xiao, and Yukio Tono. 2006. Corpus-based language studies; An advanced resource book. London and New York: Routledge.

Oakes, Michael P. 1998. Statistics for corpus linguistics. Edinburgh: Edinburgh University Press. [Edinburgh Textbooks in Empirical Linguistics Series.]

Ooi, Vincent B. Y. 1998. Computer corpus lexicography. Edinburgh: Edinburgh University Press. [Edinburgh Textbooks in Empirical Linguistics Series.]

Otlogetswe, T. 2004. “The BNC design as a Model for a Setswana language corpus”, Proceedings of CLUK '04: 193-198.

Poerwadarminta, W. J. S. 1954. Kamus Umum Bahasa Indonesia. Second Edition. Jakarta: Balai Pustaka. [First Edition 1952.]

Poerwadarminta, W. J. S. ed. 1976. Kamus Umum Bahasa Indonesia. Fifth Edition. Jakarta: Balai Pustaka. [First Edition 1952.]

Pustet, R. 2004. “Zipf and his heirs”, Language Sciences 26(1), 1-25.

Read, Allen Walker. 1986. “The history of lexicography”, in: Robert Ilson (ed.), Lexicography; An emerging international profession, pp. 28-50. Manchester, UK and Dover N.H.: Manchester University Press in association with the Fulbright Commission, London.

Richards, Jack C., and Richard Schmidt. 2002. Longman dictionary of language teaching and applied linguistics. Third Edition. London: Longman, Pearson Education. [First Edition 1985.]

Salim, Peter and Yenny Salim. 1991. Kamus Bahasa Indonesia kontemporer. Jakarta: Modern English Press.

Schryver, Gilles-Maurice de and D.J. Prinsloo. 2001. “Corpus-based activities versus intuition-based compilations by lexicographers, the Sepedi Lemma-Sign List as a case in point“, Nordic Journal of African Studies 10(3): 374-398.

Scott, Mike and Christopher Tribble. 2006. Textual patterns; Key words and corpus analysis in language education. Philadelphia: John Benjamin. [Studies in Corpus Linguistics 22, Elena Tonigni-Bonelli (ed.).]

Scott, Mike. 2007. WordSmith Tools 5.0 Help File. Oxford: Oxford University Press.

Simpson, R. C., S.L. Briggs, J. Ovens, and J.M. Swales. 2002. The Michigan corpus of academic spoken English. Ann Arbor, MI: The Regents of the University of Michigan.

Sinclair, J. M. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press. [Describing English Language Series, John Sinclair and Ronald Carter (eds).]

Sinclair, J. M. and J. Ball. 1995. Text typology (external criteria); Draft version. Pisa EAGLES ftp server, Birmingham. [Online address:; accessed 25 Sept 2010.]

Sinclair, J. M. 2003. Reading concordances; An introduction. Harlow, Essex and London: Pearson Longman.

Stubbs, Michael. 1996. Text and corpus analysis; Computer-assisted studies of language and culture. Oxford UK and Cambridge, Mass.: Blackwell.

Summers, Della. 1991. Longman/ Lancaster English language corpus; Criteria and design. technical report. Harlow, Essex: Longman.

Teubert, Wolfgang and Anna Cermáková. 2004. “Directions in corpus linguistics“, in: M. A. K. Halliday, W. Teubert, C. Yallop and A. Cermáková (eds), Lexicology and corpus linguistics; An introduction, pp. 113-165. London and New York: Continuum.

Tim Penyusunan Kamus. 1988. Kamus Besar Bahasa Indonesia. Jakarta: Balai Pustaka.

Zgusta, Ladislav. 1971. Manual of lexicography. The Hague: Mouton. [Janua Linguarum; Series Maior.]

Zipf, G. K. 1935. The psycho-biology of language; An introduction to dynamic philology. Boston: Houghton Mifflin Company.

Zipf, G. K. 1965. Human behavior and the principle of least effort; An introduction to human ecology. New York: Hafner. [Facsimile of 1949 Edition.]



  • There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Free counters!

View My Stats