The input for types 2 consists of the relevant hits and the corpus metadata, both in a tabular format. As output, the tool then produces the results in several different formats, including web pages like the one in Figure 1 above. Let us take an example. Our first example focuses on an extremely productive nominal suffix, - er , which is typically used to derive agentive or instrumental nouns from verbs e. Our research question is whether there is sociolinguistic variation in the productivity of these suffixes in present-day English.

We measure morphological productivity in two basic ways: Moreover, new or productive formations can be studied e. This approach is also recommended by Plag As noted in Section 1. Our solution to this is Monte Carlo permutation testing. Rather than trying to compare subcorpora of different sizes with each other, we compare each subcorpus with multiple randomly composed subcorpora of the same size. The idea behind permutation testing is as follows. First we divide the corpus into samples that are large enough to preserve discourse structure e.

Then we repeatedly choose a large number of random permutations i. For each random permutation, we can then trace a type accumulation curve: This will typically result in a banana-shaped plot similar to the one in Figure 2. A type accumulation curve: For any single random permutation, the idiosyncrasies of individual speakers may heavily influence the shape of the curve. However, by going through a large number of random permutations, we can learn what the typical shape of the type accumulation curve is, and how much variation there is from one random permutation to another.

This is indicated through the shaded areas in Figure 2: In essence, the dark area now identifies the region in which a randomly chosen subcorpus a randomly chosen subset of samples would typically fall. Now we can directly compare a subcorpus of a certain social group with a typical random subcorpus see the triangles in Figure 2. If, under the null hypothesis, the social factor is not connected to the type frequency, then we would expect the subcorpus to reside inside the shaded area. However, if we discover that a subcorpus lies outside this region, then we will have a good reason to reject the null hypothesis: The fraction of random permutations that result in equally high or low type frequencies will also directly give us a p -value: As we are interested in studying a large number of different social groups, we are, in essence, testing a large number of hypotheses.

If we simply tested each individual hypothesis at some fixed level e. To ensure that only a small fraction of the discoveries are false positives, we will control the false discovery rate FDR. In practice, with FDR control in place, we will need to obtain very low p -values for each individual hypothesis in order to reject the null hypothesis. Even if we are dealing with a true phenomenon, this poses the additional challenge of estimating very small probabilities accurately in Monte Carlo testing.

We tackle this simply by choosing a sufficiently large number of permutations: For visualisation purposes e. To estimate the p -values, we will typically use c. Our material consists of the demographically sampled spoken component of the British National Corpus BNC , which provides conversational data from the early s along with social metadata on the informants. Both gender and social class have been recorded for speakers, who have uttered a total of c. Even if it is getting rather old, the corpus is still an excellent source for sociolinguistic studies of British English.

This has the advantage of providing access to some of the audio recordings on which the corpus is based, which is very useful for checking the search results. To analyse and annotate the results further, we export them into Excel. Thanks to this resource, we know which words ending in - er and - or in the BNC represent genuine instances of the suffixes, which eliminates the need to go through all of our concordance lines manually.

Comparison of American and British English

After annotating the data in this manner, we enter it into the types 2 software along with the corpus metadata. We then use types 2 to analyse productivity as described in Section 2. For more details, see types 2: The landing page of the web-based output of types 2 immediately reveals the most significant results Figure 3. In addition to type and hapax frequencies, we have chosen to calculate token frequencies i. The results are ordered by p -value, lowest first. We can immediately see that there is a gender difference in the productivity of - er. None of the results for - or are significant after FDR control, perhaps because - or is relatively infrequent, so we do not have enough evidence to prove anything one way or the other.

In addition, none of the significant results are based on hapax legomena, which is to be expected as hapax-based measures require more data than type-based ones.

Word frequency: based on million word COCA corpus

Note that the type frequencies are calculated for two different measures of corpus size: Both of these measures may be of interest as they represent slightly different aspects of productivity. We can see that the female underuse as well as male overuse of inanimate - er is significant in terms of both measures, making the phenomenon quite robust.

How can we interpret these results? Let us click on the first row of text in black, showing the female underuse of inanimate - er. More details on the female subcorpus and the dataset of inanimate - er appear at the bottom of the screen. This banana-shaped plot shows confidence intervals for type frequencies at each possible subcorpus size, from zero to the size of the entire corpus.

We can see that as the size of the corpus increases x -axis , so does the number of types y -axis , but in a non-linear manner. The female subcorpus is hanging from the bottom edge of the banana, which means that almost all of the randomly composed subcorpora of the same size have a higher type frequency than the female subcorpus. This is also expressed in words below the plot. These plots do not tell the whole story, however. To interpret the results, we need to know how men and women are using - er , and what kind of people they are.

Rather depressingly stereotypically, we get household items like dryer and duster , items of clothing like bloomer and boxer , hair-care items like conditioner and curler , and so on. What about the men? We get masculine-sounding items like bomber , blaster , booster , technical ones like decoder and equaliser , more common tools like trimmer and carver … Clicking on bomber Figure 6 , we see that many of the speakers are in fact boys under 15 years of age, which would suggest that they may be playing a computer game.

This immediately brings to mind two hypotheses: Some of the more technical terms may have been uttered at work, which indeed seems to be the case when looking at the setting information in BNCweb. Moreover, many of the - er words used by women seem to be home-related, so setting is definitely a factor worth investigating. Exploring the use of the word bomber , commonly used by male speakers. There are quite a few professions like carpenter and banker.

Some interesting insults also pop up: Let us take a closer look at fucker Figure 7. Even though there are a couple of young female users of fucker their concordance lines are marked with white , the bulk of the users seem to be young working-class men C2 and DE roughly correspond to working class in the BNC metadata coding.

Thus, in addition to gender, age and social class may be relevant social factors in the productivity of animate - er ; even though social class did not emerge as significant on its own, perhaps it is the combination of gender and social class that matters.

Navigation menu

Exploring profanities in the BNC — gender, age, and social class might matter. By default, they are sorted by the number of running words they have uttered, which does not tell us much about the extent to which they use - er. Let us sort them by the unique animate - er words they use.

By this measure, the top user of - er is a working-class man who is over 60 years of age. We can click on his row to see the concordance lines of his - er instances Figure 8. We can also click on the name of the sample to view the speaker information in BNCweb Figure 9. As it turns out, this speaker is 82 and a retired precision engineer. He also seems to use inanimate - er with a considerable frequency Figure 10 ; perhaps engineers, who use all sorts of technical gadgets in their work, are more likely to use - er in general?

Figure 8, which shows concordances of animate - er for the same speaker. Concordances of inanimate - er. Figure 8 , which shows concordances of animate - er for the same speaker.

Based on our exploration, we may advance some tentative explanations for the male overuse of - er. Firstly, men seem to focus more on tools and occupations than women.


  1. Medical Education for the Future: Identity, Power and Location: 1 (Advances in Medical Education)?
  2. Comparing corpora (side by side): British and American English!
  3. Comparison of American and British English - Wikipedia.

This might be linked to the factor of setting: This may be influenced by factors like age and social class, perhaps also setting and the relationship between the interlocutors friends at a pub are more likely to engage in banter and playful insults than an employer and employee at work. We wish to emphasize that this is only the starting point for a complete analysis.

With the help of types 2, we have generated new hypotheses regarding possibly relevant social factors, such as age and setting. The next step would be to test the influence of these, possibly in combination with others. These suffixes are typically used to derive abstract nouns from adjectives e. While - ness is a native suffix, - ity was borrowed from French and Latin and can be seen as the more prestigious, formal and learned alternative. Our material comes from the 18th-century section of the Corpora of Early English Correspondence CEEC , which consists of 4, letters written by people, or a total of c.

This period also has the advantage that it can be easily subdivided into and year periods, which are commonly used in sociolinguistic research as 20 years roughly corresponds to one generation. We search the corpus using WordSmith Tools Scott and prune the search results down to relevant hits only in Excel. We then use the types 2 software to analyse productivity as in the previous case study Section 2. Looking at the types 2 landing page , a number of interesting results present themselves Figure Most importantly, the productivity of - ity is significantly low in the first subperiod, regardless of the measure of corpus size and periodization we use year periods are represented by the starting year only, e.

We can also see that the the productivity of - ity is significantly high in the last year period, — All this suggests that the productivity of - ity increases over time. And of course, there are tens of thousands of idioms that one could compare -- this is just a tiny sampling with just one word. More powerful lexical comparisons.

In the examples above, we mostly compared exact words or phrases, with some cases of phrases where a given part of speech was used in a particular slot e. But you can also do searches like the following:. But there are clear differences in the particular verbs, e. COCA poke, stick, tilt, cock, lift, bob, shake, nod ; BNC mind, feel, raise Moving away from the words and phrases with head , the following are just a handful of other lexical comparisons:. Comparing such lists of words provides some interesting insight into cultural differences between the two countries as well.

Adjectives used to describe men: COCA nuts, liable, scary, smarter, tougher, relentless, focused, easygoing, low-key, astounded ; BNC redundant, wont, spotty, chuffed, dotty, cheeky, posh Phrasal verbs with up COCA ratchet, fess, hike, crank, listen, bust, scare, cuddle, scrounge, rack ; BNC nip, stump, plant, top, phone, cash, tot, pluck, cock, mug, bugger, knock 4.

Your Answer

It's interesting to compare forms of words -- side-by-side -- in the two dialects. The following compare -- side by side -- particular grammatical phenomena in the two dialects. These are just a random sample of quick examples; hundreds of other phenomena could be studied in more depth. We can also search for very "narrow" phenomena, like the following:. We can tell a lot about the meaning of words by the "collocates" nearby words with which they occur.

Consider the following differences in meaning:. In American English, though, it refers to the British serviette , and this shows up with collocates referring to food and dining, like cocktail, silverware, plates , and cups. In British English, it would be a knob of butter , right? Notice newer words like memory stick as well, which won't occur in a 20 year old corpus like the BNC. BTW, what are these called in England? This shows the value of having up-to-date texts that reflect recent changes in the language.

Imagine if we were using tiny corpora of just two million words or so for each dialect. Very few of the searches above would be possible, and we'd be reduced to looking at just highly-frequent phenomena, like modals, other auxiliary verbs, and prepositions.