August 18, 2020
Vocabulary Richness Ratio: a JavaScript Analyzer for Fiction Texts
I’ve been reading a lot of Yukio Mishima’s fiction lately. In awe of the richness of his vocabulary, I decided to code a little program that analyzes a novel and tells us how diverse the author’s word choices are. Enter Vocabulary Richness Ratio, my newest JavaScript experiment!
Just like my Fiction Complexity Index attempt, the Vocabulary Richness Ratio is:
- available to you to try here on this page.
- a work-in-progress – in other words, take the results with a grain of salt.
I’ve combined my expertise in literature and creative writing together with my interest in coding, and this little program was the result. Again, I must emphasize that it’s only a work-in-progress. It can give you a hint of how rich the vocabulary of your book is, but it’s not exact science.
Vocabulary Richness Ratio: the Program
Let’s start with the program right away, so you can try it for yourself. Afterwards, I’ll talk about the hows and whys of it.
The code only runs locally – i.e. on your own browser, as you visit this page. It doesn’t send or save your file anywhere. If you’re interested in it, you can see it on my GitHub page.
Note: The program is hosted on raw.githack. Since it’s a free service, 100% uptime cannot be guaranteed. If the program doesn’t appear below, please try later. The demo texts might also not work (for developers: it’s a CORS issue that I’m too bored to fix).
Click to run the program
The key number to keep in mind is, self-evidently, the Vocabulary Richness Ratio. As the code instructions indicate, typical values range between 8-13%. Values over 13% indicate a particularly rich vocabulary, whereas values below 8% indicate a repetitive, not so rich vocabulary.
How It Works
The code goes through the entire narrative and counts several elements. To name a few: It calculates word count, number of adjectives, unique words and adjectives, and others.
Afterward, it returns the Vocabulary Richness Ratio, by calculating the average between several other ratios. Briefly, factors that affect the result are:
- Unique words per word count.
- Unique adjectives per word count, excluding so-called stop words.
- Use of uncommon words.
Moreover, the program returns a list of uncommon adjectives that are used more than once. The idea is to detect tendencies to overuse such adjectives.
Like many other of my programs, this one too uses the wonderful RiTa library.
Caveats and Observations
There are some caveats about this Vocabulary Richness Ratio program. You should keep them in mind.
- The Vocabulary Richness Ratio code is optimized for novels, and so it is partly influenced by the length of the work. Short stories or poems will likely return a false high value. This means that results across literary formats (e.g. a short story vs a novel) are not directly comparable.
- Vocabulary richness is only that; it’s not any qualitative analysis of e.g. the suitability of the chosen words in the context or genre. A sentence like “vacuous hypotheses predict tempestuous elongations” will be rated higher than “I like black dogs and white cats and I don’t like white dogs and black cats”, although the former is semantically meaningless.
- I’m introducing bias by choosing to favor adjectives over, say, nouns, verbs, or adverbs. It’s my experience that
Bottom line, feel free to experiment with the program, but don’t take the results too heavily! Genre is a weird concept anyway, and programs dealing with genre can be off the mark.
Vocabulary Richness Ratio: What’s Next
The obvious next thing to do would be to improve accuracy. In other words, I plan to continue testing it with a variety of texts, to see what tweaks I still need to implement.
As I mentioned earlier, if you’re interested in the code, feel free to take a look at the dedicate Vocabulary Richness Ratio page on GitHub.