July 17, 2023
Literary Genre Detector: a Simple AI model in Python
My knowledge in Python is scant compared to JavaScript, though some years ago I did play with it a bit. Still, lately I’ve been interested in AI models, so I decided to give Python another go. As it turns out, it’s trivial to train some simple AI models with it. In today’s post, I’ll show you how I made a very simple literary genre detector.
AI models of this kind work in a very simple manner, conceptually speaking. They simply take as input a list of data the programmer has supplied in the form of [("love","positive"),
and then return guesses for a supplied string. For example, a sentence like “love, care, and blah blah” (in this extremely simple example) would be classified as positive.("care","positive"),
]("hate","negative"), ("rage", "negative")
As you can appreciate, it all boils down to the quality of the data – garbage in, garbage out, and all that. So, with this important caveat in mind, let’s see what a literary genre detector looks like!
Literary Genre Detector: the Basic Structure
If you’re not a programmer, chances are you’re not too interested in the details, so I’ll be very brief. If you are a programmer, you can find the entirety of the code on my GitHub – but first keep reading for that important caveat I referred to. Then feel free to play with the code and improve it further. This literary genre detector isn’t meant as anything remotely serious; only as a minor programming exercise.
The basic idea was to prepare training data – such as the ones you saw in the introduction – for each of various genre categories, such as as romance, crime, fantasy, etc. I quickly realized that, in order to get quality results, I would have to really put in a lot of work getting that training data. Basically, I would need to go find dozens of representative books for each genre, then extract representative samples from each. Doing everything… by the book (no pun intended), they should also be public-domain data.
Way too much hassle for what is little more than a proof of concept.
Therefore, I instead prepared lists of keywords relevant to the genre. Fantasy got words such as ["wizard", "magic", "spell", "enchantress", "sorcerer"]
, whereas romance got ["wedding", "marriage", "bride", "groom", "honeymoon"]
and so on. You get the idea.
Then, I prepared the interface that accepts an external text file containing the text to be checked. The code sends the submitted file to the Python script that checks it against its training data and simply counts “hits”. Then, it presents the result to the user.
If you want to see the code for this simpler version, that’s the main branch of the GitHub repository. If, instead, you’d like to see a way to do some proper AI model training, see this branch. However, as I said, in order for the AI model to deliver worthy results, you’ll need to find some much more elaborate training data than the ones included there.
Limitations and Other Issues
The obvious limitation of such a program is, as I said, its training data. This program isn’t meant as anything remotely serious, so it doesn’t contain any impossibly long training dataset.
Besides these programmatic limitations, there are also literary ones. The most obvious one should be about the very essence of a genre, which is about borders and definitive lines.
In other words, texts that feature genre overlap, that are genre-defying, or that are outright experimental, won’t return accurate results.
And let’s not even try to detect literary fiction! Though I made an attempt to categorize such works based on the use of poetic and literary language, it proved to be unsatisfactory. Some of my works – such as The Other Side of Dreams – were erroneously categorized as romance, whereas others – such as Musings After a Suicide – (equally erroneously) as crime fiction.
Oh, and while we’re at it… If you liked this literary genre detector program, why don’t you take a look at some of my “more serious” efforts?