Unstructured content is growing. Turn it into actionable insights

The unstructured content in the form of free-format text, images, audio and video (and not the structured data) is the "natural" raw material of communications between people. It is commonly accepted that 80% of business-relevant information is unstructured, mainly text, and this unstructured content is growing much faster than structured data.

In particular, despite its huge potential as a source of valuable insights, free text is rarely analyzed and hardly used in decision-making, since the manual reading and extraction of insights are, is the best-case scenario, tedious and expensive, and, in the worst, impossible to perform due to the high volumes. To overcome this challenge, text analytics technologies process and automatically analyze textual content and provide valuable insights, transforming these "raw" data into structured and manageable information.

What is Text Analytics?

Text Analytics, a concept approximately equivalent to text mining, refers to the automatic extraction of high-value information from text. This extraction usually involves a structuring process of the input text, the discovery of patterns in the structured text and, finally, the evaluation and interpretation of results. To achieve it, machine learning, statistics, computational linguistics, data mining and information retrieval techniques are used, which gives text mining a strong multidisciplinary character.

These technologies and processes find out and present knowledge –facts, opinions, relationships– that would otherwise remain hidden in textual form, inaccessible to automatic processing. For this purpose, text analytics tools internally employ linguistic resources that model the language to analyze: grammars, ontologies, taxonomies.

Why is text analytics more important than ever?

Importance of Text Analytics

The need to extract information from unstructured content has always been there. But especially in the last few years, the explosion of user-generated content in social media (networks, forums, communities) has greatly increased this need. On the Internet, a multitude of comments, posts, and product reviews are generated, and they have an incalculable value to "take the pulse" of the market... or of the society in general. This fact has made these media the most powerful driver for the adoption of analytical technologies. All this, without forgetting the organizations' internal contents and the registration of their multichannel external interactions (via email, chat, etc.), which are increasingly abundant and valuable.

Also, the availability on the market of a range of technologies and products that are reliable, easy to use and integrate, and affordable (most of them as SaaS) has contributed to its adoption by organizations of all kinds.

Where can it be employed?

Text Analytics add value in a multitude of contexts and, almost every day, new application areas are discovered. These are some of the most common:

  • Organizations of all kinds need to understand the external agents to whom they relate. In commercial enterprises, this is known as Voice of the Customer / Customer Experience Management: the massive, automatic processing of the unstructured information contained in surveys, contact center interactions, and social comments provides a 360° view of such customers. In the case of public administrations (city councils, governments) and other political organizations, this scenario is called Voice of the Citizen or Voter.
  • One application area that to some extent overlaps with the above mentioned is media monitoring and analysis, especially of new social media, but also traditional ones, given that the information analyzed may be generated both by (potential) clients and by reporters, analysts, and influencers.
  • Additionally, when we analyze an organizations' internal community instead of the external, we are talking about Voice of the Employee applications oriented to Talent Management.
  • In scientific research, text analytics is employed to mine large volumes of articles and other documents, identify relationships, and facilitate information retrieval.
  • Media and publishers use it to exploit their archives, produce higher-quality content more quickly, engage the audience through personalized content and monetize their production through targeted advertising and new business models.
  • In the field of justice and the prevention and fight against crime, in Compliance and eDiscovery applications, it is used to process documents and communications automatically in order to disclose clues of potentially criminal behaviors, e.g. insider trading, fraud.
  • Organizations in the areas of health, law, etc. employ it to perform the automatic coding and analysis of records for a better categorization, mapping, and exploitation.

Typical tasks of text analytics

Text mining processes are often conceived as a combination of several tasks, which include the following:

  • Part-of-speech tagging (or PoS tagging) consists in identifying the structure of a text and assign to each word its grammatical category, depending on the context in which it appears.
  • Clustering enables to discover the relevant topics and the relationships within a collection of documents, grouping these in sets, internally homogeneous but different among each other, according to similarity criteria. It is especially useful in exploratory applications, in which the aim is to discover not-predefined topics and similarities or duplicates among documents.
  • Classification or categorization consists in assigning a text one or more categories from a predefined taxonomy considering the global content of the text. In general, it requires a previously configured and trained classification model built up in line with the selected taxonomy. The classification is used to identify the theme (or themes) treated in the text as a whole.
  • Information extraction is employed to identify within the text entities (names of people, places, companies, brands), abstract concepts, and other specific information elements: amounts, relationships, etc. It is used to detect mentions and identify the most meaningful elements of a text.
  • Sentiment analysis detects the polarity (positive, negative, neutral, or absence of polarity) contained in a document. This polarity can be due to a subjective opinion or the expression of an objective fact of one or another sign. In addition to the global polarity at document level, it is possible to carry out a more granular analysis and identify the polarity associated with different aspects or attributes mentioned in the same document.

At MeaningCloud, we provide APIs to perform all these tasks.

What determines the quality of text analytics?

Like many Artificial Intelligence applications, text mining is not perfect, as it does not provide correct results in 100% of cases. In fact, not even "human intelligence" is perfect when it comes to understanding texts: some experiments with human analysts and due to the ambiguity of language, the percentage of success is 90-95%. The quality of automatic analytics is essentially given by the parameters of recall and precision, indicating respectively the exhaustivity (all relevant elements are identified) and correctness (every identified element is relevant) of results.

Recall and precision are antagonistic, in the sense that a technology that increases precision will reduce recall, and vice versa. Therefore, developing a solution based on text analytics involves achieving an optimal trade-off between recall and precision, depending on the scenario in question.

Quality of Text Analytics

Obviously, the quality of a text analytics system depends on aspects related to the technologies and algorithms used. But there is another vital aspect that marks the suitability of the final result of a text mining project: the adaptability of its tools to the domain of the problem, achievable by tailoring the linguistic resources (dictionaries, classification models, sentiment dictionaries) that are going to be employed.

For example, if we are analyzing user reviews about the hotels in London , we must include items such as their names, the typical attributes that define their quality (rooms, services, food...), the polarity associated with the fact that a room is big or small, models to classify thematically such conversations... The customization of the resources according to a specific domain permits to reach an optimal trade-off between precision and recall. MeaningCloud has powerful functions for the customization of resources that enable to easily adapt its functionality to each domain.

Advantages of automatizing text analytics

Sometimes, manual processing is a viable option to do text mining. However, when the requirements of volume, velocity, or variability increase, automatic processing is essential, since it results in undeniable benefits:

  • Volume, scalability. Manual processing does not scale properly when the volume of the texts to analyze increases: its unit costs increase with said volume. This is something unacceptable in a world where the amount of unstructured content increases at an exponential rate. On the contrary, automated tools can provide virtually unlimited volumes with increasingly limited costs.
  • Homogeneity, standardization. Human annotators are also subject to errors due to the ambiguity of language; moreover, these errors and the applied criteria depend on the individual (and even on his/her situation at each time) producing inconsistencies difficult to prevent. On the contrary, although the accuracy given by automatic analytics might be initially lower, its bias is homogeneous and, therefore, easier to counteract. Furthermore, an automatic tool always applies consistent criteria and procedures, providing more homogeneous results.
  • Availability. Automatic tools are always available, which makes the presence of specific individuals at specific times unnecessary.
  • Low latency. Automatic procedures respond in milliseconds (even with high volumes), which enables decision-making and to act in near real-time.
  • Quality. With a proper adaptation to the application environment, automatic tools can achieve precision and recall parameters comparable to human processing.

What is the relationship between text analytics and cognitive computing?

Cognitive computing turns a new class of problems into computable. It tackles complex situations characterized by ambiguity and uncertainty; in other words, it addresses human-type issues. Cognitive computing combines artificial intelligence and machine learning algorithms, in an approach that tries to reproduce the behavior of the human brain. One of the promises of cognitive computing is to provide a new user experience that employs communication in natural language. Also, its learning abilities are very interesting and promise great benefits.

Cognitive computing extends analytics to new types of data, using new technologies. Such types of data include multimedia and unstructured content; such new technologies, language processing and machine learning. These technologies permit to train cognitive systems through examples instead of programming them.

Text analytics is a subset and main component of the new cognitive computing, which broadens the scope of analytics to fields that were previously unattainable using more traditional techniques as business intelligence or statistics.

What features should have a good text analytics solution?

Experts in this industry highlight a series of characteristics that contribute to the value and suitability of a text mining tool:

  • Completeness: it must feature a wide range of functions to implement text analytics tasks.
  • Integrability: it should be easy to integrate into systems, applications, and current user processes - it can be achieved through open interfaces and a repertoire of SDKs and plug-ins compatible with different languages and systems.
  • Customization: it should facilitate its adaptation to the application domain to optimize the accuracy of the analysis.
  • Low risks and costs: it has to include tested and reliable technologies, it should not need large investments or commitments, and it must be affordable.

Together, these attributes result in a short time-to-benefit: they allow users to obtain quickly the benefits promised by the mentioned technologies, without having to spend valuable resources and time for its internal development.