Text Clustering | MeaningCloud

Group similar texts and discover meaningful subjects

The Text Clustering API divides a set of texts into several groups -depending on the similarities and differences among them- and gives each one a representative name. Use it to detect duplicate texts, recommend related contents, organize the texts in a collection according to their contents (and not to externally predefined categories), and discover meaningful subjects within your customers’ feedback and in all type of unstructured interactions.

MeaningCloud’s Text Clustering API

The Text Clustering API automatically detects the implicit structure of a collection of documents, identifying the most frequent subjects within it and arranging the single documents in several groups (clusters). This distribution maximizes both the similarity between the elements of a same group and, at the same time, the differences among the different groups. This MeaningCloud API is specialized in the processing of unstructured content (it is not, as often happens with the offer available in the market, a clustering functionality for structured data). It groups documents together not by applying a purely textual similitude, but according to their relevance with regard to the subjects present in the collection, and automatically assigns to each cluster a title or name that represents its prevailing subject. Also, it internally employs lemmatization technologies which enable to take into account all the variants of a term, and it can be configured to consider stopwords and other linguistic aspects.

Differences between text classification and clustering

The classification or categorization of texts consists of assigning to a single text one or more categories from a predefined taxonomy. Creating a classification model requires to train an engine with manually preclassified texts or to define a series of rules for each category (what is known as supervised learning). MeaningCloud provides categorization functionalities through its Text Classification API, which offers different predefined and standard classification models (e.g. IPTC for news, IAB for web contents) and also enables the user to create custom models by means of the product’s personalization tools.

By contrast, clustering is generally performed simultaneously on a set of documents to arrange them in several groups according to their similarities. Besides, it does not depend on a predefined taxonomy: the decision whether a text belongs to a group or another is made dynamically and it is based on the contents of the set of documents. Therefore, clustering does not require the prior definition of a taxonomy, nor the consequent training or definition of rules, in an approach known as unsupervised learning.

Classification and clustering are two complementary approaches. Classification is appropriate when the structure of the set of documents is known a priori and the aim is the analysis of individual documents. Clustering requires to analyze simultaneously a set of documents (and the result changes if the set is altered) but provides the potential of discovering the implicit structure and the meaningful subjects that emerge from the documents’ content.

In general, clustering enables to obtain more unexpected insights and to codify them using “the very same terms” appearing in the texts. For example, a company may classify its customers’ feedback in relation to its different products and route the opinions to the appropriate departments. But using clustering techniques, the company might discover that in a certain period most of those opinions were about the fact that “the website is too slow” independently of the product: an important insight that could have gone unnoticed using the rigid classification mentioned.

Text Clustering Applications

Clustering is specially indicated for those applications that aim at detecting relations between different texts, distributing them dynamically in natural groups, or discovering the most relevant subjects within their content and expressing them in their own terms. More specifically, in the key fields of the analysis of the Voice of the Customer or the management of the Customer Experience, clustering is applied when it is required to discover the “new voice” of those customers.

Media monitoring and analysis (social and traditional)

Detection of duplicate content, identification of plagiarism, related news.

Information retrieval and recommendation systems

Grouping of search results, aid to navigation, suggestion of related information, recommendation of contents and products.

Feedback analysis and opinion mining

Detection of not predefined and unforeseen subjects in surveys and claims (which enable a more proactive management and a more effective response); aggregation and description of verbatims using “their own words”; analysis of the voice of the customer, employee, citizen, etc.; idea management.

Document organization

Structuring of collections of documents and records according to the implicit subjects that naturally emerge from the contents themselves and not from external taxonomies.

Advantages of MeaningCloud’s Text Clustering API

Our API is specialized in the processing of unstructured content (not of structured data) and is easily configurable and integrable.

Optimized for unstructured content

It processes all types of text -from documents in formal language to social comments- in several languages and employs lemmatization to take into account all the variants of a term.

Automatically generated descriptions

It uses the phrases that appear in the texts of each cluster to provide meaningful descriptions of each one.

Configurable

It allows to define stopwords and configure other linguistic aspects to adapt and refine the analysis of texts.

Easy to integrate

Its standard interface and SDKs enable to easily incorporate clustering into any application with maximum scalability and availability.

Who can benefit

Market research and CX management agencies can use this API to discover the “new voice” in the unstructured feedback provided by customers and employees. Companies and organizations of any industry can discover the implicit structure of their collections of documents and records. Providers of tools for media (traditional and social) monitoring and analysis can incorporate these advanced capabilities and so differentiate their offering.