Statistical markers of cognitive decline in language

For most people, reading and writing are natural. They are a common part of daily life, that many will spend their whole lives relying on. But for some people, reading and writing are more than mere utility. They represent a livelihood, a passion for creation that stretches far beyond the mundane. And while it is tragic when anyone suffers from a disease that robs from them their capacity to communicate, it is especially heartbreaking when such a disease affects a writer or a storyteller, someone to whom writing is no different to breathing. It is this tragedy that makes it so ironic that a tool to detect and prevent these diseases could lie in analysing the very art that could be robbed from its sufferers.

Over the last two months, I’ve taken a sidestep from the ins and outs of Folk Evidence to explore something that I have found completely fascinating. A study was published in recent months by M. Pattison, A. Begde, and T.D.W. Wilcockson. This study, titled Detecting Dementia Using Lexical Analysis: Terry Pratchett’s Discworld Tells a More Personal Story explores the viability of lexical analysis as a signature of cognitive decline. By algorithmically analysing the writing of the author Sir Terry Pratchett, famous for the extensive Discworld series, these researchers were able to detect a measurable decline in the complexity of Pratchett’s writing as he aged.

Sir Terry Pratchett was diagnosed with PCA, or Posterior Cortical Atrophy, which is a form of dementia caused by Alzheimer’s disease. PCA typically affects vision, before causing difficulties with reading that can progress into complete alexia, and especially cognitive and behavioural changes. PCA is a particularly damaging form of early-onset dementia that leads to a distinct cognitive decline, and it is this cognitive decline that the researchers were looking for in their analysis. Here I want to try and explain the way that language can be used as a statistical indicator of mental performance, especially over time.


Language and Statistics

To many, language is about the furthest you can get from statistical thinking, a purely qualitative desert that cannot be quantified. And that is the challenge faced by a researcher attempting to analyse the use of language. But there are some specific techniques used by the authors of this study that can be compared to create defining statistic that essentially describe the variety of language.

In a text, statisticians can distinguish between two things: tokens, and types.

  • A token describes an individual word used in a text
  • A type describes a unique word

Compare the following 10-word sentences.

Sentence 1:
I distinctly recall the earliest moments of my daughter’s life.

  • 10 tokens (I, distinctly, recall, the, earliest, moments, of, my, daughter’s, life)
  • 10 types (I, distinctly, recall, the, earliest, moments, of, my, daughter’s, life)

Sentence 2:
First the cat went in then the cat went out.

  • 10 tokens (First, the, cat, went, in, then, the, cat, went, out)
  • 7 types (First, the, cat, went, in, then, out)

For the first sentence, which is unequivocably more complex a sentence than the second, the TTR (Type-Token Ratio) was 1.0. However, for the second sentence, arguably less complex than the first, the TTR was 0.7. The higher the TTR, the more ‘lexically rich’ the sentence.

Of course, this is not always true; the following sentences ignore this general rule:

Sentence A:
I sing of arms and of the man

  • 8 tokens
  • 7 types
  • TTR: 0.88

Sentence B:
I like hot dogs because they are nice

  • 8 tokens
  • 8 types
  • TTR: 1.0

Statistically speaking, the second sentence is more lexically dense, despite being an incredibly basic sentence without much interest, compared to the first sentence, which is the first line of Virgil’s Aeneid, one of the greatest epic poems ever written. Here it is clear that deliberate repetition for more interesting writing is largely overlooked by the TTR, and there are other issues with it.


The major flaw of TTR as a statistical value is that it depends heavily on text length.

  • Short texts will always appear artifically diverse
  • Long texts will appear less diverse

A 50-word paragraph written in a very similar style to a 100000 word novel will always have a much higher TTR than the novel. The longer the text, the more accurately texts can be compared with TTR. For this reason, the researchers used only Pratchett’s longer books for adults, excluding his shorter works like Eric, or The Last Hero.

This factor is largely because of the fact that longer texts will, by the nature of our limited language, tend towards a TTR of zero, as there are fewer and fewer unique words available for use once a novel already has 50000 words in it. It is nigh on impossible to write a 100 word paragraph without repeating words, while it is very easy to write a 10 word sentence without repeating words.


Introducing MATTR

This is where the main statistic the study used came in.

Aiming to eliminate the bias of text length, a statistic called MATTR (Moving Average Type-Token Ratio) was used .

This statistical measurement works as follows:

  1. Take the first 100 words in a “sliding window”
  2. Calculate the TTR for that window
  3. Move the window forward (words 2–102)
  4. Repeat across the entire text
  5. Average all the TTR values

This statistic means that texts that are at least 100 words long can be compared without any huge difference caused by overall text length. This is the method used in the Pratchett study.


Other statistics used in the study were essentially analyses of the MATTR, including:

  • Means
  • Standard deviations
  • t-tests
  • Linear regression

These were used to get a better impression of the actual quantitative effect of Pratchett’s cognitive decline on his writing. This was important in analysing lexical diversity over time.


The main method that the study used to analyse lexical change over time was the comparison of MATTR before and after Pratchett’s diagnosis with PCA.

This involved:

  • Calculating MATTR for each Discworld book
  • Comparing pre- and post-diagnosis works
  • Using t-tests to determine statistical significance

Importantly, the study found that after diagnosis, there were statistically significant declines in the diversity of nouns and adjectives used.


The Pratchett researchers found that:

“our study provides evidence that linguistic analysis can be a valuable tool for detecting early signs of cognitive decline”

and that:

“language deficits may be observed many years before a formal diagnosis”

They concluded that Alzheimer’s disease has a “long preclinical period”, estimated at nearly ten years in Sir Terry Pratchett’s case.

However, there has been some scepticism online regarding the validity of these claims. Many have suggested that the progressive reduction in lexical diversity may reflect a refinement of writing style and a deliberate stylistic change rather than signs of alzheimers.


What’s Next

I struggle to believe that this astoundingly well established decrease in MATTR values is purely based on the natural progression of a writer as they master their craft.

Therefore, I have resolved to analyse some other authors, who definitely had no cognitive decline to my knowledge, in order to determine the likelihood of alzheimers as the cause of Pratchett’s measureable decline in lexical diversity.

Watch this space, next month I’ll post the finished analysis.