Stopwords

Stopwords are those words that do not provide any useful information to decide in which category a text should be classified. This may be either because they don't have any meaning (prepositions, conjunctions, etc.) or because they are too frequent in the classification context.

In the model creation step we saw that it is possible to associate the model with a language. This means, in practical terms, that when you create a model, a default list of stopwords for the chosen language is added. This list includes prepositions, conjunctions and the most common verbs.

The following image shows the list of stopwords you would obtain if the chosen language were English:

Stopwords list

The list is editable, so you will be able to add or remove any item. Each stopword must be written in a different line, and to save any changes you will have to click the "Save" button.

Adding new words

It not unusual to find that for some scenarios, words that would normally be used to classify need to be added as stopwords. For example, when analyzing a company's customers feedback, the company name may not be relevant for the classification.

To add a new stopword, just have in mind the following guidelines:

  • Stopwords must contain only one word.
  • Blank spaces are used for word separation, so they’re not valid characters for stopwords. If a blank space is detected, the stopword will be deleted from the list.
  • Stopwords must contain only alphanumeric characters, dots and underscores. If any other character is detected, the stopword will be deleted from the list.
  • Accent marks are not allowed (except for the 'ñ' letter). If you add a word with an accent mark, it will be automatically removed before saving.
  • Stopwords are case-insensitive, so 'Shield', 'SHIELD' and 'shield' are processed as if they were the same word. If a stopword contains capital letters, the lowercase version will be saved.

This is the message you will see if any of these guidelines is not satisfied:

Stopwords error message
Stopword Is it correct? Why isn't it correct? Result
agent agent
capitán (ES) Accent marks are not allowed capitan
señor (ES) señor
captain america Blank spaces are not allowed
u.k u.k
u. k Blank spaces are not allowed
e-mail Dashes are not allowed.

These limitations in the stopwords list come from the filtering process the system carries out before classifying a text. You can get more info about it in the tokenization section.