RapidMiner: Relationship between product scores and text review sentiment

This is the first of two tutorials where we will be using MeaningCloud Extension for RapidMiner to extract insights that combine structured data with unstructured text. See the second one here. To follow these tutorials you will need to have RapidMiner Studio and our Extension for RapidMiner installed on your machine (learn how here).

In this tutorial we shall analyze a set of food reviews from Amazon. We will use the MeaningCloud sentiment API and try to see how users score products and whether their review description of a certain product corresponds to the score that they have assigned – more specifically we will try to see

  • How closely the review sentiment corresponds to the manually assigned score (which we already have available in our dataset).

The dataset that we will be using throughout the tutorial can be found here. First thing we need to do is download the CSV to our computer.

Importing the data

Before we can use the dataset in RapidMiner we need to import it. In order to do this, we click on the Add Data button at the top of the Repository panel in RapidMiner. This will pop-out a new window where we will need to choose where the data that we want to import is located. In our case, the correct choice is “My Computer”. Next, we browse with the explorer to the location where our previously downloaded CSV file is located and we click “Next” after selecting it. The next step gives us some basic choices that we can customize regarding the import such as whether or not there is a header row, the start/end rows that we would like to import, file encoding etc.; the default options are fine so leave all this as it is and go to the next step by clicking the “Next” button.

Importing data

The next thing that we need to take care of is the formatting of the columns. In this case, we need to do some small modifications to the default options: change the data types for the columns Id, HelpfulnessNumerator, HelpfulnessDenominator and Score  to integer (you can also convert the Id attribute to integer, but we will not use it in the process) – this is done by accessing the dropdown next to the column name and choosing the appropriate data type in the “Change Type” submenu. This is how your final result should be:

Convert column types

After clicking on “Next” choose the name that you would like to give to your dataset and choose the location that you would like to use to store it (such as Local Repository) and afterwards – click on “Finish”:

Store data

Retrieving the data

After we have imported the data in RapidMiner, the next step is to retrieve the dataset by dragging it from our Repository panel into the Process modeller:

Retrieve data

Selecting attributes

We now have the full dataset loaded into RapidMiner, however since we will not use all of the attributes for our purposes, let us only select the ones that we will need for further processing. In order to do this, we will use the Select Attributes operator. In its parameters section we choose the attribute filter type to be subset and we select the following attributes in the Select Attributes panel: Score, Text and Summary. Do not forget to connect the output of the Retrieve to the input of the Select Attributes operator:

Select attributes

Sentiment analysis

After we done the previous steps, it is now time to use MeaningCloud to do the sentiment analysis of the reviews. First things first – make sure that you have the MeaningCloud extension installed in your RapidMiner Studio and locate your license key by logging into your MeaningCloud account on our website.

In order to perform the sentiment analysis of the actual review text we will use the Sentiment Analysis operator that is a part of the MeaningCloud extension (if you are unable to find the operator in your RapidMiner Studio, check whether the extension has been installed properly). Connect the input to this operator with the output (exa) of the Select Attributes operator that we used in the previous step. Once you have done that, in the Parameters section of the operator input your License Key, choose Text from the Attribute dropdown (this is the attribute that contains the full text of the review) and set the Text language to en. Your process should now look like this:

Sentiment analysis

Mapping sentiment polarity into numerical values

After including the Sentiment Analysis operator in our process, we now need a way to compare the assigned sentiment polarity with the score that the user has assigned manually. The best and simplest way to do this is to translate each sentiment class to a suitable numerical score that corresponds to one of the 1-5 manually assigned scores. But, before doing this, let us first filter the examples which did not get a sentiment for some reason (one of the reasons for this would be invalid text or missing text attribute in the specific row of the dataset). Add a Filter Examples operator to the process and connect it to the Sentiment Analysis output port. In its Parameters panel click on “Add Filters…” and in the new window choose polarity(Text) in the leftmost dropdown field, does not equal in the middle dropdown field and enter NONE in the last text field (rightmost). The process should now look like this:

Filter examples

The next step after cleaning the dataset for any invalid sentiment results is to convert the sentiment nominal values that we received from the MeaningCloud API to numerical values. We can do this by first adding the operator Map to the process and connecting it to the Filter Examples operator. In the Parameters section for the operator choose single as the attribute filter type and select polarity(Text) from the attribute dropdown below. Additionally, click on the Edit List button for the value mappings parameter and in the new window that pops-out enter the following:

old value new value
P+ 5
P 4
N 2
N+ 1

This is how your process should look like after performing this step:

Map values

We have now transformed the sentiment polarity classes that we got as results from the sentiment analysis performed by MeaningCloud to numeric digits, but we are not done yet! Namely, the values that we have right now are still considered nominal by RapidMiner (even though we have numerals 1-5, they are still seen as classes) – to change this and finish the transformation process we need to transform the attribute from nominal to numeric. In RapidMiner when we have numerically denoted classes the easiest way to do this is to simply parse the numbers using the Parse Numbers operator – so add it to the process and connect it to the Map operator that we used in the previous substep. In the Parameters section of this operator choose single as the attribute filter type and choose the polarity(Text) attribute in the appropriate dropdown:

Parse numbers

And that was the last part of the Transformation process! Now we just need to compare the results to see how well we did.

Correlation matrix

In order to see how much the sentiment analysis correlate to the user-assigned score let us extract a correlation matrix. This can be done using the “Correlation Matrix” operator. In this step, we do not need to configure anything, we simply connect the output (exa port) of the  “Parse Numbers” operator to the input port of the “Correlation Matrix” operator.

In the end, do not forget to connect the example set output port (named exa) of the “Correlation Matrix” operator and the matrix output port (named mat) to the first and second result ports of the process itself, as in the following depiction:

Correlation matrix

Import the process

Don’t want to go through all the process creation steps? Just download the process specification Amazon fine food reviews dataset and import it into RapidMiner by using the File, Import Process option.

Execute the process and analyze

We are now ready to execute the process and analyse the results! Simply click on the Play button on RapidMiner button bar. Try to be patient in this step, since the execution may take some time.

After the execution has finished, you should get the following results:

Correlation results

As we can see, the correlation between score and polarity is 0.49 which means that almost half of the users have given the products that they have given a score which closely follows what they have written in the textual review of that same product.

Congratulations! You have just finished your first process with the MeaningCloud Extension for RapidMiner and successfully predicted the sentiment for a set of food reviews. Be sure to visit the MeaningCloud website for more information about all the powerful (and fun) things that you can do with our APIs and for our next tutorial.

Leave a Reply

Your email address will not be published. Required fields are marked *