Unstructured content is growing. Turn it into actionable insights

The "natural" raw material of communication between people is not structured data but unstructured content in the form of free-format text, images, audio and video. It is commonly accepted that 80% of business-relevant information is unstructured, mainly text, and this unstructured content is growing much faster than structured data.

 

Despite its huge potential as a source of valuable insights, open text in particular is rarely analyzed or used in decision-making, since manually reading and extracting insights is tedious and expensive in the best-case scenario, and impossible in the worst considering the large amount of information. To overcome this challenge, text analytics technologies automatically process and analyze textual content and provide valuable insights, transforming this "raw" data into structured, usable information.

What is Text Analytics?

Text Analytics, roughly equivalent to text mining, refers to the automatic extraction of high-value information from text. This extraction usually involves structuring the input text, discovering patterns in the structured text and finally, evaluating and interpreting the results. Machine learning, statistics, computational linguistics, data mining and information retrieval techniques are all used in the process, making text mining a strongly multidisciplinary field. These technologies and processes uncover knowledge –facts, opinions, relationships– that would otherwise remain hidden in textual form, inaccessible to automatic processing. For this purpose, text analytics tools use linguistic resources (grammars, ontologies, taxonomies) that model the language being analyzed.

Why is text analytics more important than ever?

Importance of Text Analytics
There has always been a need to extract information from unstructured content, but especially in the last few years, the explosion of user-generated content in social media (networks, forums, communities) has greatly increased this need. A multitude of comments, posts, and product reviews are generated online daily. They can be used to "take the pulse" of the market or society in general, which makes this resource incredibly valuable! Organizations' internal content and external interactions (via email, chat, etc) are also increasingly abundant and valuable. Thus, social media and business correspondence have become the most powerful drivers for adopting analytics technologies. Also, the availability of a range of technologies and products on the market that are reliable, easy to use and integrate, and affordable (most of them as SaaS) have contributed to it being adopted by organizations of all kinds.

Where can it be employed?

Text Analytics adds value in several contexts and new areas of application are discovered almost every day. These are some of the most common:
  • Organizations of all kinds need to understand the people they interact with. In businesses, this is known as Voice of the Customer or Customer Experience Management. The massive, automatic processing of the unstructured information contained in surveys, contact center records, and comments on social media provides a 360 degree view of their customers. In the case of public administrations (such as city councils) and other political organizations, this scenario is called Voice of the Citizen or Voice of the Voter.
  • One area of application that overlaps with the previously mentioned one to some extent is media monitoring and analysis, especially of new social media, but also traditional ones, given that the information analyzed may be generated both by (potential) clients and by reporters, analysts, and influencers.
  • Additionally, when we analyze an organization's internal community, we are talking about Voice of the Employee applications geared toward Talent Management.
  • In scientific research, text analytics is used to mine large volumes of articles and other documents, identify relationships, and facilitate information retrieval.
  • Media and publishers use it to make the most of their archives, produce higher-quality content more quickly, engage the audience through personalized content and monetize their production through targeted advertising and new business models.
  • In the fields of justice and crime prevention, in Compliance and eDiscovery applications, it is used to process documents and communications automatically in order to disclose clues of potentially criminal behaviors, e.g. insider trading or fraud.
  • Organizations in the areas of health, law, etc. use it to automatically code and analyze records for a better categorization, mapping, and exploitation.

Typical tasks of text analytics

Text mining processes often combine several tasks, which include the following:
  • Part-of-speech tagging (or PoS tagging) consists in identifying the structure of a text and assigning a grammatical category to each word, depending on the context.
  • Clustering allows you to discover the relevant topics and the relationships within a collection of documents by grouping them into sets according to similarity. It is especially useful in exploratory applications, in which the aim is to uncover topics that were not predefined, similarities, or duplicates.
  • Classification or categorization, consists in assigning a text to one or more categories out of a predefined taxonomy taking the text's global content into account. In general, it requires a previously configured and trained classification model built in line with the selected taxonomy. Classification is used to identify the themes discussed in the text as a whole.
  • Information extraction identifies entities (names of people, places, companies and brands), abstract concepts, and other specific elements: amounts, relationships, etc. It is used to detect mentions and identify the most meaningful elements of a text.
  • Sentiment analysis detects the polarity (positive, negative, neutral, or no polarity) contained in a document. This polarity can be found as a subjective opinion or the expression of an objective fact. In addition to the global polarity at document level, it is possible to carry out a more granular analysis and identify the polarity associated with different aspects or attributes mentioned in the same document.
At MeaningCloud, we provide APIs to perform all these tasks.

What determines the quality of text analytics?

Like many Artificial Intelligence applications, text mining is not perfect, as it does not provide correct results in every case. In fact, not even "human intelligence" is perfect when it comes to understanding texts. Some experiments with human analysts have shown that due to the ambiguity of language, the percentage of success is 90-95%. The quality of automatic analytics is essentially given by the parameters of recall and precision, indicating respectively the exhaustivity (all relevant elements are identified) and correctness (every identified element is relevant) of results. Recall and precision are antagonistic, in the sense that a technology that increases precision will reduce recall, and vice versa. Therefore, developing a solution based on text analytics involves achieving an optimal trade-off between recall and precision, depending on the scenario in question.
Quality of Text Analytics
Of course, the quality of a text analytics system depends on the technologies and algorithms used. But there is another vital aspect that marks the suitability of a text mining project's final result: whether its tools can be adapted to the domain of the problem. This feature is attained by tailoring the linguistic resources (dictionaries, classification models, and sentiment dictionaries) used in the project. For example, if we are analyzing user reviews about hotels in London, we must include items such as their names, the typical attributes that define their quality (rooms, services, food, etc), the polarity associated with the fact that a room is big or small, models to classify such conversations by topic, and more. The customizing of the resources for a specific domain allows us to reach an optimal trade-off between precision and recall. MeaningCloud has powerful functions for the customization of resources that let us to easily adapt its functionality to each domain.

Advantages of automatizing text analytics

Sometimes, manual processing is a viable option for text mining. However, when the requirements of volume, velocity, or variability increase, automatic processing is essential, since it results in undeniable benefits:
  • Volume, scalability - Manual processing does not scale properly when the volume of the texts to analyze increases: its unit costs increase with volume. This is unacceptable in a world where the amount of unstructured content increases at an exponential rate. By contrast, automated tools can process virtually unlimited volumes with increasingly lower costs.
  • Homogeneity, standardization - Human analysts are also subject to errors due to the ambiguity of language. Moreover, these errors and the applied criteria depend on the individual (and even on his/her situation) producing inconsistencies that are difficult to prevent. Although the accuracy given by automatic analytics might initially be lower, its bias is homogeneous and therefore easier to counteract. Furthermore, an automatic tool always applies consistent criteria and procedures, providing more homogeneous results.
  • Availability -  Automatic tools are always available, which makes the presence of specific individuals at specific times unnecessary.
  • Low latency - Automatic procedures respond in milliseconds (even with high volumes), which enables decision-making and action in near real-time.
  • Quality - With a proper adaptation to the application environment, automatic tools can achieve precision and recall parameters comparable to human processing.

What is the relationship between text analytics and cognitive computing?

Cognitive computing makes a new class of problems computable. It tackles complex situations characterized by ambiguity and uncertainty. In other words, it addresses human-type issues. Cognitive computing combines artificial intelligence and machine learning algorithms, in an approach that tries to reproduce the behavior of the human brain. One of the promises of cognitive computing is to provide a new user experience that employs communication in natural language. Also, its ability to learn is very interesting and promising. Cognitive computing extends analytics to new types of data by using new technologies. Such types of data include multimedia and unstructured content; the new technologies are language processing and machine learning. These technologies allow us to train cognitive systems through examples instead of programming them. Text analytics is a subset and main component of the new cognitive computing, which broadens the scope of analytics to fields that were previously unattainable because they used more traditional techniques such as business intelligence or statistics.

What features should a good text analytics solution have?

Experts in this industry highlight a series of characteristics that contribute to the value and suitability of a text mining tool:
  • Completeness: It must feature a wide range of functions to implement text analytics tasks.
  • Integrability: It should be easy to integrate into systems, applications, and current user processes. This can be achieved through open interfaces and a repertoire of SDKs and plug-ins that are compatible with different languages and systems.
  • Customization: It should facilitate its own adaptation to the application domain to optimize accuracy.
  • Low risks and costs: It has to include tried and true technologies, it should not require large investments or commitments, and it must be affordable.
Together, these attributes result in a short time-to-benefit: they allow users to quickly obtain the benefits promised by the previously mentioned technologies without having to spend valuable time and resources on internal development.