We have just published a new release of MeaningCloud with some new features that will change your way of doing text analytics. As a complement to the most common analytical techniques -which extract information or classify a text according to predefined dictionaries and categories- we have included unsupervised learning techniques that enable to explore a series of documents to discover and extract unexpected insights (subjects, relationships) from them.
In this new release of MeaningCloud we have published a Text Clustering API that allows to discover the implicit structure and the meaningful subjects embedded in the contents of your documents, social conversations, etc. This API takes a set of texts and distributes them in groups (clusters) according to the similarity between the contents of each document. The aim is to include in each cluster documents that are very similar to each other and, at the same time, highly different from the ones included in other clusters.
Clustering is a technology traditionally used in the analysis of structured data. What is so special about our API is that its pipelines are optimized for analyzing unstructured text.
The clustering API in fact:
- Uses lemmatization technology to take into account all the morphological variants of a term (e.g. high/higher/highest)
- Allows to define “stop words” that should not be considered in the analysis process due to their little semantic relevance
- Groups the documents not by applying a purely textual similarity, but according to their relevance with respect to the subjects present in a collection
- Assigns to each cluster a name or title which semantically represents its contents.
The Text Clustering API complements the abilities of the Topics Extraction and Text Classification ones (which employ predefined taxonomies and dictionaries), providing more flexible and dynamic analytics and enabling to discover meaningful subjects and unexpected relations between documents.
Where can you apply Text Clustering? It is indicated for countless scenarios, especially for those applications that aim at detecting relations between different texts, distributing them dynamically in natural groups or discovering the most relevant subjects within their content and expressing them in their own terms. More specifically, in the key fields of the analysis of the Voice of the Customer or the management of the User Experience, clustering is applied when it is required to discover the “new voice” of those customers.
As usual, you can obtain more information in the API’s documentation page; besides, you have the possibility of trying it thoroughly without programming by using its Test Console.
But there are much more new features in this MeaningCloud release. Here are the other ones:
- IAB standard classification model. We have increased our set of predefined models for the Text Classification API. In addition to IPTC, EuroVoc or Business Reputation, now we feature the IAB standard taxonomy for advertising-oriented content classification. Using this model we can identify whether a certain website or page (or even an advertisement) is about Business, Health, Technology, etc. and thus achieve better ad targeting and higher brand protection.
- Processing of URLs and HTML. We have greatly improved the way in which our engines process URLs and HTML code, making them more robust in the face of malformed sources and optimizing how external elements such as scripts and style sheets are dealt with.
In addition, we have used this release to introduce performance enhancements and extend our linguistic resources: changes that will positively affect several existing APIs. You can find the details in each API’s change log.
We hope you find these improvements very useful!