Text tokenization & multiwords

Multiwords are combinations of words that are always grouped together if they appear in a specific order. These words are defined as terms in the categories of a model (see the term definition section) and affect the tokenization of any text that's going to be classified with it.

Here in the multiwords view you can get an overview of all the multiwords defined in the model.

Multiwords list

The main table shows the information about all the multiwords defined in the category view. You can browse them by name, as entered by the user, and directly go to the category where each one is defined as a rule for categorization. The table allows to order the multiwords alphabetically by each column in it; it also provides a dynamic text filter.

Text tokenization

Text tokenization is the process in which the training texts, if they exist, and the texts that are going to be processed are split into the units that will be considered for the classification.

Once the tokenization is done, the system carries out a basic filtering and uses two user-defined elements: stopwords and the multiwords defined in the lists of terms.

First of all, the basic filtering will remove different types of characters in order to standardize the texts you are working on:

  1. Opening and closing characters (simple and double quotation marks, question and exclamation marks...) and periods (a dot followed by a blank space) will be removed.
    • Example: It is "beautiful"! => It is beautiful
  2. Any other non-alphanumeric character except for '.' and '_' will be replaced with a blank space.
    • Example: Marks&Spencer => Marks Spencer
  3. Every character will be changed to lowercase and accent marks will be removed (except for 'ñ', which remains the same).
    • Example: Está en Español => esta en español
  4. Every suffix and prefix separated with an apostrophe will be removed.
    • Example: We're fine => we fine

After this filtering, user-defined multiwords from the lists of terms in the categories will be grouped. The fact that this grouping is made right after the filtering tells us that for the system to work correctly, there are certain limits in the characters allowed in the multiwords definition.

It's important to remark that this grouping is made following the usual reading direction for European languages, horizontally and from left to right.

Here you can see an example of the resulting text after grouping two multiwords:

Text Multiword grouped Result Resulting text
the blue house the+blue the blue house
the blue house blue+house the blue house
this blue house the+blue this blue house
this blue house blue+house this blue house
the blue house the+blue the blue house
the blue house the+blue+house the blue house

See that in the cases where two patterns could match, the one to appear first in the text will be the chosen one.

Important

When more than one grouping possibility exists, the system will choose the longest. In the text "the blue house", with "the+blue" and "the+blue+house" as the defined multiwords, the second one will be chosen.

After the grouping of multiwords, all the stopwords that remain ungrouped will be removed, as well as those words that contain numbers, dots, commas and blank spaces only (decimal numbers, dates, etc.).

In the following example, you will see how the tokenization of an input text would be done. We will consider that the stopwords list is the one provided by default for English, and that there are two multiwords defined in the model: "ping+pong" and "Rio 2016".

Tokenization example

The text resulting from this tokenization is the one upon which the statistical and rule-based classification will be performed, so it's important to ensure everything is consistent.