SEO

Python In Seo – Understanding Sentiment Analysis

[rt_reading_time] minute read

In order to provide the most suitable results following a search, a search engine must first truly understand why the user is carrying out the search. This is referred to as user intent or search intent and is any search engine’s number one goal.

Let’s start off with an example.

If a user searches for “best online casino”, the search engine must understand whether to return a list of links to online casinos, or a list of links to review websites, where one can read about users’ experiences on different sites.

For this reason, Google does not simply return matches to a query on its search engine results page (SERP). Instead it first attempts to gauge user intent by analysing the sentiment of the search query as part of its ranking algorithm. It then uses 200+ factors to rank webpages and provides the user with an ordered list of the most suitable ones. Many digital marketers and SEO experts believe that Google uses sentiment analysis as one of these factors.

Throughout this article we will discuss the role of sentiment analysis in Google ranking and we shall analyse real-world data to further investigate this relationship. More specifically we shall investigate the relationship between the sentiment of the search query and that of the webpage content such as the h1 tag, title, text, and meta-description of the top 10 ranking links in response to a query.

Sentiment Analysis Tools

Sentiment analysis forms part of the growing field of Natural Language Processing (NLP) and as such there exist many new tools and techniques which one can use to perform this analysis, such as the IBM Watson Tone Analyzer and Google Cloud Natural Language.

This article includes a form of big data analysis, therefore it is skewed towards tools which are both financially and computationally inexpensive. Furthermore we are interested in tools which can be easily used as part of an automated script, such as Python libraries.

Python is suitable for most machine learning projects and is capable of handling big datasets efficiently; it even has several open-source libraries allowing us to integrate sentiment analysis functionality into apps, making it a great fit for this article.

Pre-trained Models vs Non-trained Models

One of the benefits of machine-learning is the ability to transfer learning from one project to another. This means that a model can be pre-trained on large datasets of text and then programmers may apply these pre-trained models for their own use.

In the case of sentiment analysis, a pre-trained model will allow us to simply perform the analysis of content directly without the need to train the model beforehand. In other words, a pre-trained model is more suitable to create an application.

Usually, training a sentiment analysis model from scratch using machine learning and natural language processing can take anywhere from one hour to a month or even a whole year, this is because it involves gathering large amounts of data, training the model and validating the results.

This depends on several factors such as the amount of data that needs to be processed, the programming language used, processing power, as well as the algorithm. Most of the libraries and tools considered in this article are pre-trained models.

This is because we are generally interested in models that support multi-lingual analysis (not easily achievable with models which are not pre-trained) and models which are ready to use out-of-the-box and require little to no pre-processing.

Google NLP and Microsoft Azure

The three giants leading the natural language processing field are Microsoft, IBM and Google. Both these companies are actively developing systems for natural language processing which include out-of-the-box functions to perform sentiment analysis of text.

All three models are pre-trained, meaning they can analyse text without prior training and validation. For example, Microsoft offers a Text Analytics service as part of its Azure cloud solutions, IBM has a package called Watson Tone Analyzer, and Google has introduced Natural Language (read our previous article on Google Cloud NLP for SEO).

In this section, we’ll compare Microsoft Azure’s “Text Analytics” with Google Cloud’s “Natural Language” since both these companies offer their own search engines (Microsoft Bing and Google, respectively) which make use of their proprietary sentiment analysis algorithms at some point throughout the website ranking process.

Since both companies offer search engines, they are able to access millions of data points allowing them to optimise their algorithms for website analysis and website ranking.

Features

Both services are packed with features for sentiment analysis, but they offer way more than that. Following are the main highlights of what each of these services have to offer.

Google Natural Language:

Contains algorithms that can analyse the structure (syntax analysis) as well as the meaning (sentiment analysis) of input text.
Can understand the context of the input text and classify text into 700+ distinct categories.
It is smart enough to identify entities present in a text block. For example, it can read documents like invoices, receipts, and contracts. It then uses labels to specify data by types like person name, date, contact information, location, organization, etc.
Offers the ability to analyse the text in multiple languages, however, currently it only supports the following languages: English, Spanish, Japanese, Chinese (simplified and traditional), French, German, Italian, Korean, Portuguese and Russian.
Can easily be connected to Google’s Speech-to-text API to allow the automatic extraction of insights from audio data.
All code is open-source and can easily be tweaked to suit one’s needs.
Allows the straightforward customisation of pre-trained models, using the Cloud AutoML suite.
It can efficiently extract data from custom entities defined inside the PDF file’s layout and structure and has complete support for large datasets. For example, you can process 1 million documents, but each file must not be greater than 10 MB in size. It may then use 5000 classification labels to process the data.
Provides an easy to use REST API which is compatible with Python.

Microsoft Text Analytics:

Just like Google Natural Language, it can also perform entity recognition on text input.
It is specifically designed and optimised for sentiment analysis. You can use it to determine what people are saying about any specific topic. Companies can use it to examine what customers think about their products or services.
Its functionality of opinion mining is very helpful in understanding the perception of customers about a product.
Capable of detecting the sentiment of text, and providing support, for the following languages: Chinese-Simplified, Chinese-Traditional, Danish, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Korean, Norwegian (Bokmål), Polish, Portuguese (Portugal), Russian, Spanish, Swedish and Turkish.
You can use it on the cloud as well as install it on-site.
It works for both structured and unstructured text data.
Microsoft doesn’t use your training data to improve Text Analytics. Meaning that your data is completely yours and stored using enterprise-grade security.
Provides an easy to use REST API which is compatible with Python.

Pricing for Sentiment Analysis

Google Natural Language Pricing:

0 – 5K units evaluated: Free per month
5K – 1M units evaluated: $1 per 1000 units
1M – 5M units evaluated: $0.50 per 1000 units
5M – 20M units evaluated: $0.25 per 1000 units

Microsoft Text Analytics Pricing:

0 – 5K transactions: Free per month
0 – 500,000 text records: $2 per 1000 text records
0.5M – 2.5M text records: $1 per 1000 text records
2.5M – 10.0M text records: $0.50 per 1000 text records
10M+ text records: $0.25 per 1000 text records

Both services offer a free trial so one can test their functionality as well as get familiar with them without spending any money. This is also useful for small tasks which can be carried out manually, such as analysing the content of the pages for one’s own website.

However, in both cases a single unit refers to 1000 characters of text and the unit pricing is very strict, such that one will be charged the full unit amount for analysing even small sentences, for example a sentence composed of 100 characters. If a sentence of text is composed of 1001 characters, then one will be charged for two units.

This pricing is perfectly adequate for small datasets, or to be included in an app where the customer can be charged per analysis block. However, if one were to incorporate these algorithms into more complex software which would require prior training and testing, or if one were to carry out big data analysis using these tools, then one would run into extremely hight costs.

For this reason, both tools are not ideal for scaled up projects.

In addition to this, in order to analyse a chunk of text, one must make a call to the aforementioned APIs. Although this is easy to implement, it could become extremely computationally inefficient, specifically if performing tasks on a local machine.

We therefore deemed these tools extremely helpful for small investigations, but inadequate for our big data analysis. As a result, we opted to use a Python library and create a native functionality to boost performance (and keep costs low!).

Open-Source Python Libraries for Sentiment Analysis

The reason why I prefer Python for sentiment analysis is that a lot of popular open-source machine learning libraries are available for it. We even have access to libraries that are specifically designed for natural language processing related tasks (like sentiment analysis).

Here is a list of the top 5 open-source Python sentiment analysis libraries together with their most important features.

TextBlob – Extremely easy and fast to use with little to no data pre-processing required. Can analyse sentiment polarity and subjectivity.
NLTK – Very popular toolkit with a strong online community and an abundance of online resources. Requires data pre-processing.
Stanza – Stanford’s very own NLP library, known as Stanford Core NLP. This library is excellent for handling multi-lingual analysis. It is also optimised for biomedical applications.
spaCy – Very well equipped for scaling up projects as it can easily handle large datasets at fast speed.
Gensim – This library is optimised for the efforts that precede sentiment analysis, such as vectorisation of text.

All of these libraries could be used as stand-alone libraries or included as part of a sentiment analysis pipeline; you may read more about their capabilities and drawbacks here. These are some of the other tasks that can be performed by these libraries:

Part-of-speech tagging
Syntax-driven sentence segmentation
Spelling correction
Non-destructive tokenization
Noun phrase extraction

Michaela Tromans-Jones Spiteri

Dr Michaela Spiteri BEng, MSc, PhD (AI / Healthcare domain), is a well-published researcher in the field of AI and machine-learning. She is the founder of AI consultancy Analitigo Ltd. Currently working as the lead researcher at Gainchanger.

SEO