Topic Extraction in Dataiku

Topic Extraction

Topic extraction extracts named entities (people, organization, etc.), concepts, money expressions and quantities from a text. These elements have an ontology type associated that gives a more detailed description of what they are. The types are defined by MeaningCloud's ontology, although it can be customized using user-defined dictionaries.

These user-defined dictionaries can be defined in the dictionary console. There you can add new entries that follow our ontology, or define your own.

These are the settings for this recipe:

Input parameters:
- Text column: the column names available in the dataset used as input source will be loaded, so you can select the one with the texts to analyze.
- Language: language of the text to analyze.
Configuration:
- Topics: types of topics to extract. There are four types available:
  - Entities, with named entities, such as 'James Bond' or 'United Kingdom'.
  - Concepts, with keywords such as 'financial crisis' or 'fish and chips'.
  - Money expressions, such as '$5' or 'ten million euros'.
  - Quantities, such as '0.5%' or 'six months'.
- API configuration preset: license key and server to use in the API requests. It can be set using one of the presets defined in the plugin Settings or they can be manually defined.
Advanced: when expert mode is enabled, the following two fields can be configured:
- Relevance threshold: to filter the topics extracted by their relevance level. By default, it's set to 100, so only the most relevant elements are shown.
- User dictionary: ID of the user-defined dictionary with additional analyses to take into account in the topic extraction.

Important

Some of the languages supported are part of our language packs. You need to have access to them to analyze the sentiment successfully. If you don't have access to them, make sure to request the free trial or to subscribe to them!

The output provided by this recipe will add the following columns to the input dataset:

topic_person, with the entities and concepts with "Person" as their first level ontology type.
topic_organization, with the entities and concepts with "Organization" as their first level ontology type.
topic_location, with the entities and concepts with "Location" as their first level ontology type.
topic_product, with the entities and concepts with "Product" as their first level ontology type.
topic_id, with the entities and concepts with "ID" as their first level ontology type.
topic_event, with the entities and concepts with "Event" as their first level ontology type.
topic_other, with the entities and concepts with any first level ontology type other than the ones mentioned above.
topic_quantity, with the money expressions and quantities detected in the text.

Each cell will contain an array with the elements detected in the text for a specific ontology type.

The following example uses the dataset and the user-defined model described in this tutorial. The language is English, the relevance threshold is set to a 100, and all topic types are extracted.

Previous Next