One of the questions we get more often at our helpdesk is how to apply the text analytics functionalities that MeaningCloud provides to specific scenarios.
Users find MeaningCloud knowing they want to incorporate text analytics into their process but not sure how to translate their business requirements into something they can integrate into their pipeline.
If you also add the fact that each provider has a different name for the products they offer to carry out specific text analytic tasks, it becomes difficult not just to get started, but even to know exactly what you need for your scenario.
In this post, we are going to explain what our different products are used for, the NLP tasks they trace to, the added value they provide, and which are the requirements they fulfill.
Topic Extraction is MeaningCloud’s product for information extraction, which is the task of “automatically extracting structured information from unstructured and/or semi-structured machine-readable documents” . In other words, we want to extract specific pieces of information that appear in texts, from names of people, to locations or amounts of money.
There are a number of ways to refer to this task, some of them derived from some of the most popular subtasks in it, such as Named Entity Recognition. The objective is still the same: extracting structured information from a text.
Let’s take a look at the following text, taken from an article in the New York Times:
It’s Official: Simone Biles Is the World’s Best Gymnast
RIO DE JANEIRO — Simone Biles, already considered the world’s greatest female gymnast before even competing in the Olympics, emphatically confirmed her standing on Thursday by winning the women’s individual all-around gold medal at the Rio Games.
Wearing a stars-and-stripes leotard, Biles, 19, joined Mary Lou Retton, Carly Patterson, Nastia Liukin and Gabby Douglas as American all-around winners.
The American Aly Raisman, 22, won the silver, and Aliya Mustafina, 21, of Russia won bronze.
Victory in this event brings lucrative endorsements and widespread adoration, a popularity bonanza fueled by a prime-time showcase of athletic artistry. At 4 feet 9 inches, with size 5 feet, Biles is someone that young viewers can relate to. Then she performs, and her abilities are unimaginable.
Her ascent has been sudden to those who follow gymnastics only ever four years. At the last Summer Games, in London in 2012, Douglas was the show-stopper. Biles arrived here from Texas and gave the Rio Games a performance for the ages. Whether you know an Amanar from an aardvark, you watch her not because the result is in doubt but rather to witness something without equal.
So, how does this text look when we extract its information using our Topics Extraction API?
At first, it may seem that it’s just a matter of finding the names that appear in the text, but there’s a little more to it. There are many ways to refer to the same person, nicknames and variants of their name that you need to take into account. For instance, in this text Simone Biles appears five times, two with her complete name, and three where she’s just referred to by her surname.
But names — or named entities, as they are often referred to — are not the only thing we may want to extract. In the text we can also see quantities, dates, and keywords. Depending on the scenario you are working on, you will need to extract different types of structured information.
Sometimes, even all the named entities in a text are more than you need. For those instances, entities have a type associated, so you can choose only locations, persons, organizations, etc. You can check all the different types we detect in our ontology.
We also provide the possibility of defining your own entries with their corresponding types through our customization engine. By using user dictionaries you will be able to extract the entities/concepts specific to your domain using your own ontology.
These are some scenarios in which Topics Extraction can be applied:
- Automatic tag suggestions for news articles/blog posts and semantic publishing.
- Popularity analysis according to mentions.
- Key data extraction
Text Classification is MeaningCloud’s product for document categorization or document classification, which is the task of “assigning a document to one or more classes or categories” . In this case, instead of extracting something from a text, we analyze it and decide into which category/categories from the ones available it should be classified.
This task assumes that we have a number of categories defined beforehand, and that we know the criteria that determines whether a text should be categorized into any of them. In MeaningCloud, we refer to this definition of both the categories and their criteria as classification models.
Our Text Classification API provides several generic predefined models such as IAB (a standard from the advertisement industry) or IPTC (an international standard to classify news).
Going back to the example we used before, on the image on the right, we can see in which categories our text is placed for the two predefined models we have mentioned.
Text classification gives us an idea of what a text is about according to a specific criteria. This may apply to an article, a tweet or to the feedback you obtain from a customer.
In some instances, this generic criteria may not fit your needs, so for those cases we provide the possibility of defining your own classification model through the use of our customization engine.
These are some scenarios in which Text Classification can be applied:
- Automatic tag suggestions for news articles/blog posts.
- Complete characterization of user feedback according to different criteria.
Sentiment Analysis is MeaningCloud’s product for sentiment analysis or opinion mining, which is the task of “identifying and extracting subjective information in source materials” . One of the most basic tasks in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level.
Our Sentiment Analysis API combines the complete morphosyntactic analysis carried out by our core engine with sentiment information, which allows us to extract a sentiment analysis at every level.
We can obtain a global polarity of the text, or we can go in deeper, and see the polarity expressed in each one of the sentences that compose the text.
On the right, we can see the global analysis we obtain for the text we’ve used as an example before. We have a polarity value with a level of confidence, an agreement/disagreement value — to indicate if within the text all the polarities detected per sentence/segment agree —, a subjectivity value, and an irony value.
MeaningCloud also gives the possibility of combining this analysis with the Topics Extraction functionality, allowing us to obtain the polarity associated to the entities and the concepts in the text. This is usually referred to as aspect-level sentiment analysis.
On the right, we see some of the entities detected. In the image, the entities detected with a positive polarity are shown in green — the athletes that have won medals — while the ones with no polarity are shown in blank rows.
Much the same way as with the other products we’ve mentioned, Sentiment Analysis can be customized through our customization engine, both the sentiment associated to terms, as well as the entities and concepts to analyze at an aspect level.
These are some scenarios in which Sentiment Analysis can be applied:
- Customer satisfaction analysis
- Popularity analysis
- Voice of the costumer
Text Clustering provides cluster analysis, the task of “grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)” .
In this case, the objects in question are texts, and the different analyses provided can help us discover patterns in them, either to characterize the data or to learn new information about it and use it as feedback for other types of analyses. An example of a possible use of Text Clustering is to apply it over the texts we are classifying using Text Classification in order to identify new categories to add to our model.
On the right, we can see the result we would obtain if we analyze the text we have used as an example with the next two paragraphs in the same article. The three texts are grouped in two different clusters: “Rio Games and “Performs“, which fit quite well the overall themes discussed in them.
Language Identification is MeaningCloud’s product for language identification or language guessing which is the task of “determining which natural language given content is in” . It’s usually considered an auxiliary task, but it’s no less important for it.
Any of the analyses we’ve mentioned until now need the language of the content to analyze. If you are working in a single language, this is not an issue, but nowadays, multilingual scenarios such as Twitter are more and more common, and so having an API to carry out this task is extremely useful.
Lemmatization, PoS and Parsing
Lemmatization, PoS and Parsing provides a complete morphosyntactic analysis of a text, which includes classic NLP tasks such as the following:
- Lemmatization: the task of “grouping together the different inflected forms of a word so they can be analyzed as a single item” .
- PoS tagging or grammatical tagging, the task of “marking up a word in a text (corpus) as corresponding to a particular part of speech” .
- Parsing or syntactic analysis, the task of “analyzing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar” .
On the right, there’s the morphosyntactic tree obtained from one of the sentences in the text we’ve used as an example through the post.
As you can probably guess from the image, this morphosyntactic tree is also combined with the Topics Extraction and Sentiment Analysis.
This provides an extremely powerful API, where you can combine a sentiment analysis with morphological, syntactical and semantic information. The output is quite complex but it provides a myriad of possibilities for post-processing, including among others pattern extraction.
Corporate Reputation is not as much a classic NLP task, but a combination of several tasks focused on a specific application. “Reputation of a social entity (a person, a social group, an organization) is an opinion about that entity, typically a result of social evaluation on a set of criteria” .
By combining Topics Extraction, Sentiment Analysis and Text Classification, we are able to analyze the sentiment associated to organizations mentioned in a text according to the different categories defined in a reputation model.