Issues and critiques of corpus design
Approaches to identifying specialised vocabulary for ESP
A major issue in corpus studies is determining the size of the corpus. A corpus needs to be large enough to ensure that sufficient samples of specialised vocabulary occur for analysis or selection in word lists, for example (see Nation & Sorrell, 2016). For high frequency words, corpora need not be very large, because the words occur often. Similarly, words which are closely related to a specialised field will also tend to occur quite frequently. In Plumbing, for example, a very frequently used word in written texts is pipe. Lower frequency words are more problematic, because they do not occur as often. Specialised corpora for ESP can help narrow down the analysis of texts and focus on specialised vocabulary in particular (see Krisnamurthy & Kosem, 2007; Ghadessy, Henry & Roseberry, 2001).
Representativeness of the corpus is an important issue in corpus-based research, and it has a number of elements. One question for representativeness is whether the corpus represents the kind of writing, reading or multi-media ’text’ which ESP students would be exposed to. For example, Gardner and Davies (2016) take issue with Durrant’s (2016) study of the AVL in university student writing in the BAWE corpus, which suggests that university undergraduate student writing is representative of writing in the disciplines because there are many other kinds of texts which also represent academic writing in the disciplines. This is not to say that investigating student writing is not a valid research activity, but that larger claims need to be based on wider and more representative samples of language. Paquot (2010) focuses on keywords in student writing, arguing that learner corpora shed a different light on academic vocabulary than analyses based on corpora of professional academic writers.
Another example of representativeness can be found in the Language in the Trades (LATTE) project (see Chapter 8). The spoken corpus in this study includes both classroom and on-site recordings in the case of Carpentry. This corpus focuses on teacher talk, mostly for practical reasons: building sites are noisy places, multiple microphones would be needed across areas as big as a building site for a house and over 30 microphones would be needed to capture the language use of one whole class out of a possible cohort of 120 students. For Automotive Engineering, recordings include classroom sessions where the talk changes from more engineering-oriented classroom talk about vehicles to a broader and more general chat about cars (see Parkinson & MacKay, 2016 for more on talk in the trades). Decisions need to be made about the purpose of the research and how it impacts the corpus development (see Nation, 2016 for more on taking account of the purpose of research and corpus development).
Miller and Biber (2015) pose another question in relation to representativeness, noting that as corpora get bigger and bigger in word list studies, words occur with different frequency rankings and lists might contain different words. They posit,
Corpus-based vocabulary researchers have paid considerable attention to the validity of their lists, usually evaluated through analyses of their predictive power when applied to a new corpus (i.e. the percent coverage of words in a new corpus accounted for by the words in the list). But reliability is a prerequisite to validity, and, in general, corpus-based vocabulary studies have not included evaluations of reliability: the extent to which we would discover the same set of words, ranked in the same order of importance, based on analysis of another corpus that represents the same discourse domain.
(Miller and Biber, 2015, p. 33)
Miller and Biber (2015) use a corpus of Psychology textbooks for university study and experiment with techniques to produce a reliable — that is, replicable — subject-specific word list. This task proved particularly tricky to achieve, as texts in corpora can vary in size, topic and number (p. 49). Miller and Biber (2015) dealt with the texts of different lengths by splitting their corpus in half, resulting in two corpora of about 1.75 million words (p. 44).
Different approaches to classification systems in corpus development and analysis can pose problems for comparing results from different studies. Decisions need to be made on a principled basis regarding whether a particular field of enquiry fits into one area or another. Becher (1989) classifies academic disciplines in higher education into four dimensions: hard-pure, hard-applied, soft-pure and soft-applied. This classification was used for the BASE and BAWE corpora. Krishnamurthy and Kosem (2007) compare approaches to classifications of academic disciplines across several systems and research studies in corpus linguistics. Krishnamurthy and Kosem (2007) show that levels of classification can be very different across studies. Compare, for example, the four disciplines of Arts, Commerce, Law and Science in Coxhead (2000) and Becher’s (1989) hard-pure, hard-applied, soft-pure and soft-applied used in the Michigan Corpus of Academic Spoken English (MICASE)/BASE/BAWE suite of corpora (see Nesi & Gardner, 2012), ten disciplines used by libraries to classify books drawing on the Dewey Decimal Classification System and 19 in a classification from the Higher Education Statistics Agency (Higher Education Statistics Agency, see www.hesa.ac.uk/jacs.htm). The purpose, scope and size of a study can determine the classification of a corpus; for example, the AWL (Coxhead, 2000) study was based on divisions and fields of study at a university in New Zealand in 1999, with a fairly large Law school but no Engineering or Medical school at that stage.
Finally, a particularly important point about corpus analysis is made by Bennett (2010), who states that corpora can only give evidence of what is possible, rather than evidence of what is not possible.