How to Map 404 URLs at Scale with Sentence Embeddings

Let us start by stating a well-known fact: 404 redirects are bad. They negatively impact the user experience on your webpage, and search bots will also penalize your site for it. The best way to fix this is to map 404 urls.

When a visitor calls that webpage from the browser, the server returns a 404 Page Not Found Error. When search bots encounter 404 errors numerous times, they will deindex the specific webpage from the search results.

In this article, you’ll learn what 404 URLs are and how you can redirect those URLs to relevant ones automatically. This involves a technique that uses a Universal Sentence Encoder approach leveraging neural networks.

What are 404 Redirects?

404 redirects are server response codes notifying a visitor that the webpage they are in search of can’t be found. This could be either because of user error when typing the URL or the webpage they are seeking isn’t an actual webpage.

Moreover, a 404 redirect can occur if the webpage the visitor is seeking was live in the past, but it has since been removed or the page name and URL were altered.

Example of 404 error

Source: muffingroup

The biggest drawback of 404 redirects is that if your website has a large volume of 404s in a short period, this can yield an undesirable result on your website’s overall search ranking.

Moreover, redirects are bad for your website visitors as they generate an unfavorable user experience. Redirects also may drain your crawl budget. Every redirect link that a crawling bot follows causes your crawl budget to reduce, and if there are plenty of redirects, then a bot will simply stop following them.

The easiest way to discover 404 redirects on your site is to use Google Search Console to monitor crawl errors. It will show you wherever Google crawls a website error. This information will help you learn where the 404 redirects are occurring, allowing you to fix these website errors.

Example of Crawl Error

Source: Google

 

If you have some 404 URLs on your site, this fact alone doesn’t hurt you or weigh against you in search results. However, there may be other reasons that you would want to address to fix certain kinds of 404s. For instance, if some 404 webpages are important to you, you should explore why Google is seeing 404s when its bots crawl them.

If you see a spelling mistake of a genuine URL (www.xyz.com/awsome in place of www.xyz.com/awesome), it is possible that somebody wanted to link to you and just made a typographical error.

Rather than returning a 404, one may prefer to forward the user to a different URL. This is known as a redirect. Permanent link forwarding is known as a 301 redirect, and one could use this kind of redirect to map a misspelled URL to the right URL and seize the intended traffic from that link. You can also ensure that, when visitors do land on a 404 page on your website, you help them find what they’re in search of instead of just saying “404 Page Not found.”

How to Map 404 Redirects

If you want to get more SEO traffic to your site, a guaranteed way is to redirect important URLs that result in 404s to relevant ones. Typically, these URLs still receive traffic and may have important external links coming in.

Many people map 404 URLs by redirecting all of them to the homepage or a dynamic search result. However, this approach has several drawbacks including:

  • Redirecting all pages to your homepage can be perplexing for visitors if they’re trying to go to a particular page and keeping landing on the homepage.
  • Google may consider it cheating to redirect nonexistent URLs to your homepage.
  • If you have a large website, you won’t get any benefit from redirecting all 404s. The server will have to process never-ending redirects which would slow down your site.

The right method is to map each 404 URL separately to a relevant webpage if such a webpage is present. Yet, this procedure can be very tiresome, laborious, and costly if you have to do it manually.

Often, you may have to depend on the default internal search engine of the website, which is hardly any good.

A more effective approach is to automate the process using a neural matching method called sentence embedding.

What is Sentence Embedding?

Sentence embedding involves a set of methods in natural language processing (NLP) that map sentences to vectors of real numbers. In simple words, it represents whole sentences and their semantic information as vectors.

This aids the machine in understanding the perspective, purpose, and other nuances in the whole content. PyTorch and TensorFlow are deep software libraries that use this approach.

A simple method to achieve sentence embedding is to average the embeddings of words in a sentence and use that average as the representation of the full sentence. However, this method has some drawbacks.

First, there’s a loss of information. Also, there’s no respect for order. You will attain 100% similarity from averaged word vectors even if you switch the order of words in a sentence (which changes the meaning of the sentence).

While it is possible to tackle these challenges manually –such as by skipping stop words and concatenating embeddings –it can be time-consuming and inefficient. That’s why automating the process is a smarter option.

Using Sentence Embedding to Map 404 URLs

Here are a few steps to help you automatically map 404s with sentence embedding:

Download URL Sets

The first step is to get all 404 URLs. There are several methods to do so. For instance, you can run a website crawl or download 404s from Google or Bing Search Consoles.

Another efficient way to get 404 URLs is the Ahrefs’ Broken Backlinks tool. You can find a complete list of all 404 pages that a site has in Ahrefs Site Explorer’s “Best by links” report by applying the “404 not found” filter.Ahrefs' Broken backlinks tool

However, Google Search Console will possibly list out more 404s for mapping.

Next, you require a set of all valid site URLs, preferably canonical URLs, which you can get from downloading the XML sitemap URLs.

If you don’t have XML sitemaps, you can run a conventional SEO crawl to get the URLs.

Upload the URL Sets to Google Drive

You may have a dataset of URLs that you wish to redirect to a specific URL in the form of a CSV file (or spreadsheet that can be converted to CSV format). You can access such files from Google Colab via different ways, which includes uploading them directly from your hard drive. They can also be uploaded to Google Drive and accessed from Colaboratory.

In case you get a shareable link from Google Drive, you can use the following code to download the files to your Google Colaboratory environment.

Code to upload URL set to google drive

To keep your files private, you may use this code.

Code to upload keep uploaded url set in google drive private

Convert URL Paths to Phrases

While you may try to match the web pages using their content, 404 pages do not have content that can be used for matching purposes.

However, it is possible to match 404 pages with appropriate web pages on the website by leveraging meta-information present in URLs. This is a simple approach that works quite well for all websites except those that have poor URLs, such as those including numbers only.

Just follow these two steps:

  • Acquire only the path of the URLs.
  • Change slashes, underscores, and hyphens to spaces to extract the text in the URLs.

This will help you convert URL paths into phrases.

Use Google’s Universal Sentence Encoder

Once you have a list of phrases, you need to leverage Semantic Textual Similarity (STS) to match URLs by the phrases included in them. STS refers to methods by which two strings of text or paragraphs are compared to identify similarities between the two.

For this purpose, you can use Google’s Universal Sentence Encoder (USE) which encodes content into vectors of high dimensions that can be utilized for text sorting, semantic similarity, grouping, and other natural language-related functions.

The USE simplifies obtaining sentence-level embeddings as it is generally used to lookup the embeddings for discrete words. The sentence embeddings can then be utilized to work out sentence-level meaning similarity along with enabling better performance on downstream sorting tasks utilizing less supervised training data.

Graphic how googles universal source encoder works

Source: Amitness

 

According to Google, a deep averaging network (DAN) encoder trains the USE model. The DAN model works in three simple steps:

  • Take the embedding’s vector average related with an input sequence of tokens
  • Pass that average through one or more feedforward layers
  • Perform (linear) sorting on the representation of the final layer

Graphic how the deep averaging network encoder trains the universal sentence encoder model

Google has trained and optimized the model for longer texts, like sentences, phrases, or even short paragraphs. It has trained on a range of data sources and tasks to dynamically address an extensive range of NLP tasks. A 512-dimensional vector is an output, while an English text of variable length is an input.

The major advantage of this approach is that it is more potent than naive textual matching. Google’s Universal Sentence Encoder allows matching pages or phrases with similar meaning, even if it’s not written identically. It also allows you to match phrases when they’re written in dissimilar languages.

This is how the code will look like:

Code to use to match phrases when they are written in dissimilar languages

Essentially, Universal Sentence Encoder allows you to encode complete sentences in a virtual space so that comparable phrases are close and entirely different ones are far apart.

You can use the Universal Sentence Encoder to resolve a range of natural language issues. For mapping 404 URLs, the most suitable task is a semantic textual similarity as you want to match sentences that mean precisely the same but might be verbalized differently.

To see USE in action, follow these steps.

  • On the USE page, open the example Google Colab notebook.
  • Click on the File menu and save a copy in your Google Drive.
  • Change the Runtime Type to GPU.
  • Next, go to the Runtime menu and click on “Run all Cells”.
  • Scroll down through the notebook to see the following visualization.

Visualization of the universal sentence encoder in action

This image shows similarity in a heat map matrix with the most similar phrases shown in hot red. As you can see, “What is your age?” and “How old are you?” are the most similar phrases.

You can add custom code to the notebook to make use of this technique with the phrases you generate from the URLs uploaded.

The very similar URLs will show a lot more red squares than yellow. In that case, you can generate a top-five list for every match and assign one of your team members to manually review and choose the best match. This method still reduces a lot of manual time.

Calculate Similarity Suggestions

While the heat map looks great visually, it doesn’t provide real suggestions for every 404 URL. For this, you can create a custom function using this code as mentioned on Search Engine Journal:

code involving Tensorflow to calculate similarity suggestions

The code above involves a bit of TensorFlow coding where it passes an encoded phrase to the USE model and receives the embeddings of the closest matches. As the embeddings are merely vectors, you can find the closest ones “result” by computing the dot product.

The dot product helps measure the distance between two vectors. The code below limits the matches to the top five.

code using tensorflow to calculate similarity suggestions limited to the top five matches

Takeaway

There are several ways to fix the 404 redirects. At the end of the day, what matters the most is that you fix them because they can affect your SEO ranking as well as user experience. In this article, we discussed how to leverage Google’s USE model to automatically map 404s at scale with sentence embedding.

If you’re still not sure about how to do it, we can help. Our Technical SEO Audit Service involves identifying and fixing all dead links and chains of redirections that could be impacting your SEO rating. Get in touch to request a quote for your technical SEO audit.

Join Our Newsletter

Follow our SEO insights and industry news.

Want more traffic & brand awareness?

Answer these quick questions to request an action plan explaining what you need to do to get more traffic.

You’re almost done!

To request a step-by-step action plan to increase your web traffic, fill in your name and email.