Corpora: Useful web tools

The following are corpus-related websites which we think are helpful for investigating language.

Wordle

Wordle is a simple-to-use site that lets you paste in your own data and then creates an attractive ‘word cloud’ based on the frequency of the words you’ve used. You can use Wordle as a very simple corpus tool for something like a poem, a song lyric, a political speech or a soliloquy from a play and get a visual representation of the language within it. (See also the lesson entitled 'Word clouds in action', which uses Wordle as a way in to analysing a poem).

http://www.wordle.net/

Concordle

This site bills itself as the ‘not so pretty cousin of Wordle’, which is harsh but fair. It works in a similar way to Wordle, allowing you to paste in text of your choice. However, it is more like a ‘proper’ corpus tool as it allows concordancing. In other words, you can click on a word and see it in its context. The only downside is that the clouds it creates are very basic and not as elegant as those on Wordle.

http://folk.uib.no/nfylk/concordle/

Google N-grams

Google N-grams are visual representations of the frequencies over time of particular words or phrases in the entire collection of Google books - including millions of books over hundreds of years of publishing history. This is particularly interesting for tracing the emergence of new words, the rise and fall of phrases over time, or for comparing synonyms.

https://books.google.com/ngrams

WebCorp

WebCorp is a cunning tool that allows you to use internet news sites as a corpus. The benefits of WebCorp are: it is constantly updated as it runs using the latest data available on the web; it provides a concordance facility giving you a preview of key lines in context; and it has a number of selectable search criteria. Again, this is quite simple to use, but can be refined to carry out more complicated searches.

http://www.webcorp.org.uk/index.html

British National Corpus

This is a huge (100-million-word) corpus of written and spoken language. The free version of it allows you to search for lexical items and will give you up to 50 hits as well as total frequencies. Some simple instructions can be found here: http://www.natcorp.ox.ac.uk/using/index.xml.ID=simple. Even for the free version, registration is required.

A particularly useful way of exploring the BNC for language students might be to compare the most frequent adjectives to appear with words like men and women, a comparison that can be done by following the 5-minute guided tour of the BNC.

Before jumping to conclusions about the significance of findings like this, it’s always worth having a look behind the headline figures at the context. With the BNC (and other corpora like ICE-GB, COCA and COHA) you can click on the context of each search to find out where the citations come from. In the case of dry (number 6 in adjectives that collocate with men) the bald fact is that the top 15 citations all come from adverts for hairdressers, so are probably not that significant in terms of issues of representation.

The same is true for the appearance of the adjective Salvadorean in the top 10 for collocates of women. A click on the blue, underlined 28 reveals that all 28 entries come from the same source. This is why it’s always important to look at context and think about how and where the examples are used. Perhaps adjectives like armed, pregnant and battered are more significant: an investigation into their contexts would probably tell us more.

Along with searches like this, the BNC has a number of genre search tools which allow the user to narrow down their search to a particular type of speech or writing, such as spoken classroom language, or written fiction.

http://www.natcorp.ox.ac.uk/

Corpus of Contemporary American English & Corpus of Historical American English

These large corpora are useful for exploring all sorts of language questions, including how words have changed over time, their frequency in particular times and their frequent collocates (which words go, or went, with them). Like the BNC, this corpus is free, but registration is required.

http://corpus.byu.edu/coha/

http://www.americancorpus.org/

Oxford English Dictionary

This excellent resource isn’t free, and it's only a corpus in a very loose sense. But, it includes lots of written examples from the full history of the English language for each and every English word throughout history. Many schools and local libraries have access to it through the internet. As with the hard copy, you can use the dictionary to search for word meanings, spellings and parts of speech, but with the online version you also get some etymological timelines and links to citations.

http://www.oed.com/

Full Preview

This is a full preview of this page. You can view a couple of pages a day like this without registering. But if you wish to use it in your classroom, please register your details on Englicious (for free) and then log in!

Englicious (C) Survey of English Usage, UCL, 2012-17 | Supported by the AHRC and EPSRC. | Cookies