Corpora

A corpus is ‘a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of language’ (Sinclair, 1996).

In a sense, and at its simplest, a corpus is a sample of language in use. It consists of all the words, phrases, sentences, and even newspapers, magazines, books, and speech transcripts that have been collected and put into it. Usually, the sample will be taken from real examples of what people have said or written. Corpora can include spoken language and written language; the language of children, women, and men; language from any corner of the globe; language from a particular year or decade; and more.

Sometimes a corpus has a very specific focus on certain types of language – perhaps spoken language from casual conversations between professional women in London, or language from printed newspapers in the UK – but some corpora (the plural of corpus) consist of many different types of language.

Additionally, many corpora take their words from particular times, so can provide a historical record of how language was used and perhaps how it has changed over time.

Corpora allow us to look at language and see patterns and examples in language use. For example, now that most corpora are digitised you can search for when a word appeared in print for the first time, or how many times a word appears in one type of writing or speech.

Corpora are very helpful for carrying out statistical studies of language, perhaps to work out how common certain expressions are in relation to other expressions, or if certain words are preferred over others in spoken or written texts.

Perhaps most importantly, corpora allow linguists, teachers and students to study real language as it is actually used. Instead of making up rules about what people should say or write, we can see what real people have actually said or written and how they have used their language in its context. We can ask ‘What do people actually say?’ and then explore why.

Much of the content on this website is drawn from the ICE-GB corpus (the International Corpus of English, Great Britain component), which is housed at the Survey of English Usage at University College London. 

Full Preview

This is a full preview of this page. You can view one page a day like this without registering. But if you wish to use it in your classroom, please register your details on Englicious (for free) and then log in!

Corpora: Useful web tools

The following are corpus-related websites which we think are helpful for investigating language.

Wordle

Wordle is a simple-to-use site that lets you paste in your own data and then creates an attractive ‘word cloud’ based on the frequency of the words you’ve used. You can use Wordle as a very simple corpus tool for something like a poem, a song lyric, a political speech or a soliloquy from a play and get a visual representation of the language within it. (See also the lesson entitled 'Word clouds in action', which uses Wordle as a way in to analysing a poem).

http://www.wordle.net/

Concordle

This site bills itself as the ‘not so pretty cousin of Wordle’, which is harsh but fair. It works in a similar way to Wordle, allowing you to paste in text of your choice. However, it is more like a ‘proper’ corpus tool as it allows concordancing. In other words, you can click on a word and see it in its context. The only downside is that the clouds it creates are very basic and not as elegant as those on Wordle.

http://folk.uib.no/nfylk/concordle/

Google N-grams

Google N-grams are visual representations of the frequencies over time of particular words or phrases in the entire collection of Google books - including millions of books over hundreds of years of publishing history. This is particularly interesting for tracing the emergence of new words, the rise and fall of phrases over time, or for comparing synonyms.

https://books.google.com/ngrams

WebCorp

WebCorp is a cunning tool that allows you to use internet news sites as a corpus. The benefits of WebCorp are: it is constantly updated as it runs using the latest data available on the web; it provides a concordance facility giving you a preview of key lines in context; and it has a number of selectable search criteria. Again, this is quite simple to use, but can be refined to carry out more complicated searches.

http://www.webcorp.org.uk/index.html

British National Corpus

This is a huge (100-million-word) corpus of written and spoken language. The free version of it allows you to search for lexical items and will give you up to 50 hits as well as total frequencies. Some simple instructions can be found here: http://www.natcorp.ox.ac.uk/using/index.xml.ID=simple. Even for the free version, registration is required.

A particularly useful way of exploring the BNC for language students might be to compare the most frequent adjectives to appear with words like men and women, a comparison that can be done by following the 5-minute guided tour of the BNC.

Before jumping to conclusions about the significance of findings like this, it’s always worth having a look behind the headline figures at the context. With the BNC (and other corpora like ICE-GB, COCA and COHA) you can click on the context of each search to find out where the citations come from. In the case of dry (number 6 in adjectives that collocate with men) the bald fact is that the top 15 citations all come from adverts for hairdressers, so are probably not that significant in terms of issues of representation.

The same is true for the appearance of the adjective Salvadorean in the top 10 for collocates of women. A click on the blue, underlined 28 reveals that all 28 entries come from the same source. This is why it’s always important to look at context and think about how and where the examples are used. Perhaps adjectives like armed, pregnant and battered are more significant: an investigation into their contexts would probably tell us more.

Along with searches like this, the BNC has a number of genre search tools which allow the user to narrow down their search to a particular type of speech or writing, such as spoken classroom language, or written fiction.

http://www.natcorp.ox.ac.uk/

Corpus of Contemporary American English & Corpus of Historical American English

These large corpora are useful for exploring all sorts of language questions, including how words have changed over time, their frequency in particular times and their frequent collocates (which words go, or went, with them). Like the BNC, this corpus is free, but registration is required.

http://corpus.byu.edu/coha/

http://www.americancorpus.org/

Oxford English Dictionary

This excellent resource isn’t free, and it's only a corpus in a very loose sense. But, it includes lots of written examples from the full history of the English language for each and every English word throughout history. Many schools and local libraries have access to it through the internet. As with the hard copy, you can use the dictionary to search for word meanings, spellings and parts of speech, but with the online version you also get some etymological timelines and links to citations.

http://www.oed.com/

Full Preview

This is a full preview of this page. You can view one page a day like this without registering. But if you wish to use it in your classroom, please register your details on Englicious (for free) and then log in!