AI

Image Alt Text Generation Using Image Recognition

8 minute read

Website developers and content creators often overlook or disregard one of the most vital aspects of improving a site’s accessibility and search engine optimization (SEO), image alt text.

Images and videos are crucial to ensure user engagement on any given website. For example, product images from different angles or a 360-degree video can result in higher conversion rates on an e-commerce website. If you have a news site or digital magazine, visitors are more likely to read articles with images or videos accompanying the content. In fact, content that includes images produces a 650% higher engagement rate than text-only posts. It is therefore of utmost importance to include images in your webpage content.

Creating descriptive text for images is a daunting task, especially if your webpage contains a large number of images. In this article, we’ll take a look at some of the ways you can automatically generate alt text for images by leveraging the power of image recognition.

What is Image Alt Text & Why Is It Important?

Alt text (alternative text), also called “alt attributes”, is a small piece of HTML code that defines the appearance and purpose of an image on a webpage.

In the example above, the highlighted text shows the alt text of the photo on the left.

When deciding on the wording for alt text description one must focus on generating beneficial, information-rich text that uses keywords suitably and is relevant to the page content. One must avoid keyword-stuffing alt text descriptions as, not only does it impact the user experience negatively, it may also cause Google to classify your content as spam as described in our previous article.

Although alt text is so unobtrusive that it may not seem to have any influence on the typical visitor, it has quite significant uses including:

Convenience for Screen Readers

Suppose a webpage has plenty of pictures but none of them includes alt text. A person browsing via a screen reader would only listen to the word “image” which is not useful. They would only know that there is a picture, but what does it show?

Adding alt text allows screen readers to aid visually impaired people “see” what is present and better understand the page content.

Show Description if a Visual Doesn’t Load

Due to slow or faulty internet connections, often images do not load properly. Alt text is shown on the webpage instead of where the “broken” image is. This provides visitors with text as a substitute.

Contribute to SEO Performance

Image alt text improves SEO performance as well. It provides better visual context to search engine crawlers, assisting them to index a visual correctly.

Search engines like Google can’t interpret images. As an alternative, they depend on alt text to understand the images on your website. By including optimized alt text on your images, it helps your pages get indexed and rank better in different search engines.

Although alt text does not exactly help a website or webpage shoot up to the first position on the search engine results page (SERP), it’s an opportunity to include important keywords that may contribute to the overall SEO performance of the site.

The Best Tools to Generate Image Alt Text Automatically

Now that you know how important alt text is, hopefully, you can add appropriate alt text while developing and uploading content. However, if your archives have missing alt text, that could be a challenge.

Trying to generate detailed Alt descriptions for a huge backlog of images can be overwhelming, particularly if you’re facing strict deadlines or have to juggle several tasks.

What if you could add alt text right when the picture is being uploaded? And what if you could check the webpage for missing alt text and automatically add them?

Now that’s possible – all thanks to image recognition technology!

Image recognition (or computer vision) isn’t a new concept. Corporations like Microsoft, Google, and IBM have their APIs publicly accessible so that developers can leverage those APIs to classify visuals in addition to the text in them.

Some of the available solutions are:

Azure’s Computer Vision API

Azure’s Computer Vision API is an artificial intelligence (AI) service that analyzes content in photos and videos. It can help you improve content discoverability, perform real-time video analysis, and automate text extraction. It leverages visual data processing to create image descriptions, label content with objects and concepts, and extract text.

Its text extraction (OCR) feature enables you to extract even printed and handwritten scripts from photos and documents with mixed languages and writing styles. You can automatically classify over 10,000 objects and concepts in your visuals.

Many developers have already used this API and produced their own plugins to create alt text. For example, Sarah Drasner’s generator uses Azure’s Computer Vision API to generate alt text for any image by uploading it or entering its URL.

Another example is Jacob Peattie’s Automatic Alternative Text. It’s a WordPress plugin that uses Azure’s Computer Vision API. It essentially adds to the workflow allowing you to upload a photo and automatically generate alt text.

The algorithms of Computer Vision API analyze the content found in an image and generate complete sentences of human-readable language describing what is found in the image. These plugins fetch this description and add it as the alt text for every image uploaded while the plugin is active.

Amazon Rekognition Custom Labels

It streamlines the process of analyzing photos and videos using deep learning technology that needs no knowledge of machine learning. You only have to provide photos you want to generate alt text for, and the service manages the rest.

You can detect the items and scenes in visuals according to your requirements with this platform. For instance, you can identify logos in social content, detect items on aisles, or identify animated characters in videos.

With Amazon Rekognition Custom Labels, you can recognize items, individuals, text, scenes, and activities in photos and videos, along with detecting any unsuitable content. In images and videos, text appears in a different way than neatly written words on a printed page. The tool can read lopsided and distorted content to capture info such as brand names, street signs, text overlapped on images, and descriptions on product packaging.

Amazon Rekognition Celebrity Detection

If you’re uploading the image of a celebrity, Amazon Rekognition can quickly identify well-known people in your video and image. It can identify thousands of personalities in an extensive range of classifications, like politics, entertainment and broadcasting, sports, and commerce. You can even use it to identify celebrities in photos and stored videos.

The Amazon Rekognition celebrity recognition API can identify celebrities in different surroundings, getups, and other situations. To identify celebrities within photos and get additional info about identified personalities, you can utilize the RecognizeCelebrities operation.

For instance, in the entertainment industry where collecting info can be time-sensitive, using the RecognizeCelebrities operation can help you detect up to 64 personalities in a photo, and –if available – return links to personality pages.

Consider using StartCelebrityRecognition to start video analysis to identify celebrities in a stored video.

Google Chrome Extensions

You’ll find several alt text generation extensions in the Google Chrome web store. One such extension is Auto Alt Text that generates descriptive captions for images. It uses the power of artificial intelligence to caption photos with a simple right-click and can analyze a photo and detect the contents of the scene shown in it within five seconds.

Auto Alt Text is based on the im2txt model formed by Oriol Vinyals and others for the 2015 MCOCO Image Captioning Challenge. It leverages an encoder-decoder neural network, which is essentially a deep convolutional network coupled with a Long Short-Term Memory Network (LSTM). LSTM is a complex subfield of deep learning which aims to learn the order of dependencies in sequence prediction problems. An example of this is to learn sentence structure and to understand how words in a sentence are interlinked and dependent on other words in the same sentence.

The deep convolutional net first encodes a photo into a vector representation using Inception v3 (a renowned image recognition model). The LSTM then generates a captioning model based on the Inception v3 encodings.

The developer transformed the model into an API, paring it down to fit it on a Lambda instance so that it could stay loaded into memory for ultra-fast responses under five seconds.

Watch this video on YouTube

Imagga Auto-Tagging

The Imagga Auto-Tagging API is a set of technologies that help understand and analyze images. It is accessible as a web service allowing you to automate the process of analyzing, consolidating, and searching through huge collections of unstructured images.

The vendor hosts the services at their end, which makes them quite flexible, reasonable, and scalable (applicable to any size of image collection).

You can obtain several keywords defining the given image with just a simple GET request to the auto-tagging endpoint (/tags). In the event of a successful request, the service will fetch a list of tags in the response with confidence levels.

These levels are represented by a percentage, such that 100% indicates that the API is completely sure about the relevancy of the tags, and confidence of less than 30% means that there’s a higher probability that the tag might not be applicable.

Clarifai

Clarifai is a powerful tool that allows you to search for individuals, places, items, and topics in your photos, videos, and text. It uses artificial intelligence to generate richer shot detection annotations in videos to improve object and context findability in scenes. You can gain insights into where objects appear in scenes to improve searchability.

By providing richer scene detection, it allows you to validate logos and product placements to support marketing. With its AI-automated metadata description, you can expect to reduce tagging errors by 80% and get more meaningful descriptions.

Cloudsight

Cloudsight’s AI instantly describes images with human-like captions. It applies computer vision right out of the box to offer thorough object recognition. All you have to do is send your visuals to its API and it will generate a natural language description.

This is particularly useful if you have an e-commerce store as it allows you to sell your products on your platform just by taking a photo. Instead of worrying about writing product descriptions, the tool will automatically recognize what you want to sell.

You can also identify objects in the environment by simply scanning your phone around. This eliminates the need to take individual photos.

With Cloudsight’s on-device computer vision model, you can expect an average response time of < 250ms. This is more than 4x faster than using their API and doesn’t need an internet connection.

Key Takeaway

These are just a few tools that can help you simplify and expedite the process of content management, editing, and maintenance. You no longer have to think of a descriptive text for every image as the machine can handle it on your behalf.

Alt texts are an important part of your overall SEO performance. Our technical SEO audit services can help you identify and fix missing alt texts and attributes. Get in touch to request a quote.

Michaela Tromans-Jones Spiteri

Dr Michaela Spiteri BEng, MSc, PhD (AI / Healthcare domain), is a well-published researcher in the field of AI and machine-learning. She is the founder of AI consultancy Analitigo Ltd. Currently working as the lead researcher at Gainchanger.

AI