January 17, 2022
Image to Text with TensorFlow
Machine learning is all the programming craze these days, so I thought to give it a shot myself. In all honesty, the (undoubtedly important) applications of machine learning are mostly outside my immediate interests, at least at present. But for fun, I thought to play with the intriguing TensorFlow.js library and see what I could come up with. In this post we’ll do an image-to-text with TensorFlow – a silly little program that takes an input image, detects its content, and generates some semi-random text with it.
A complete description of what machine learning and TensorFlow are is beyond the scope of this post. If you’re familiar with them, you anyway don’t need me to tell you. If you’re not but would like to find out more, feel free to read about them on the TensorFlow site (linked above) or Wikipedia.
And if you’re neither familiar nor interested in finding out more, you can still continue reading to see the results we can get. No programming knowledge required for reading; only curiosity.
Image to Text with TensorFlow: the Basics
My initial idea was to use the TensorFlow.js library both to detect the image contents and generate random text. However, I quickly realized that something like that would be slow; very, very slow. Whereas there are ready-trained models that can find out what an image contains, generating text is a significantly more time-taking endeavor. There are some ready solutions – indeed, even in API form – but they generally cost without being necessarily better. If you are aware of a free one (even with limitations) let me know; I’m interested!
So, what I did was the following:
- I used TensorFlow’s ready-trained models to detect what the image contained.
- Having isolated a few keywords, I ran them through the Free Dictionary API – the same powering my Word Journey game and narrative adventure.
- I retrieved the examples from each result.
- For more randomization, I interjected some random text generated with the RiTa library that uses Markov chains (see my post on the text shuffler program for more info).
Let’s now see a couple of examples. I won’t comment on the code, as I think it’s trivial to put together such a silly example, but if you’re looking for it, here it is:
Examples and Limitations
I began with the cat photo you see above. TensorFlow thought the scene includes “shopping cart, broom, cat, bed”. That’s not great; one word is correct (“cat”), and I can see why TensorFlow saw a “bed” there, but it’s still rather poor. Let’s try with another photo:
TensorFlow here detected “chain, dock, boat”, which is a better result. So, let’s see what kind of text we can get by asking DictionaryAPI for some examples:
- He slid the bolts on the front door and put the safety chain across.
- The gangplank was lowered to the dock.
- They boated through fjords.
Notice how, besides the limitation of TensorFlow in recognizing image elements, we have certain grammatical issues, too: “boat” is both a noun and a verb. Putting all the elements together (including the text RiTa generated), the result is this:
The chain has grown to over 4,000 stores nationwide in the last year. This will hurt, I promise. But it is my toy, it brings me such joy! The boat is owned and operated by the San Francisco National Guard, San Francisco law enforcers, and the US Navy. If this then that – that’s what I’ve grown to understand. The sun has fallen under the world now – how long have we been here? The gangplank, with the other gangplank, took to throwing rocks down the road, and, while running, took the other gangplank at it”
Not exactly coherent, but it’s fun! We could get more random results using a (paid) API, such as the one from DeepAI – note that I’ve kept the RiTa-generated text the same, to make it easier to see the changes from the excerpt above.
The chain of power in this country, which is its own way of regulating itself. This will hurt, I promise. But it is my toy, it brings me such joy! The boat was still afloat after it stopped. If this then that – that’s what I’ve grown to understand. The sun has fallen under the world now – how long have we been here? The gangplank that is the main obstacle to meeting the mission.”
This isn’t any better in terms of coherence; it’s just more random. As dictionaries contain a finite number of examples, you would quickly get repetitive results from similar images (e.g. containing cats).
Image to Text with TensorFlow: Worth the Effort?
The answer is, I don’t know. It’s a fun experiment. It has obvious limitations, though to some extent they’re a result of my implementation, not the technology itself. In other words, if you had enough time (and the motivation that comes with actually having some proper use for this), you could definitely get much better results. This is only meant as a silly example, not to be taken too seriously.
Machine learning does have its uses, especially in image recognition. Text is another matter (writing and art are two concepts that are not synonymous), but your mileage may vary. At the very least, it’s a nice programming exercise, so I consider the time I used on this well spent!