Settings

In this view you will be able to modify the settings of a model. There are two sections: the first one is related to the model’s general settings, and the second one enables to modify the classification settings.

Model settings

Model settings
  • Model ID: the ID of the model is what will be used to classify any text. The ID is composed by the model name followed by underscore, and the model language.
  • Model name: the name of the model is how it will be listed in the resources section. It is limited to 64 characters, and can contain only alphanumeric characters, dashes and underscores.
  • Language: this is the language configured for this model. It only applies for the stopwords list, configuring it according to the language selected.
  • Description: a brief description of the model. It is limited to 1024 characters.

Classification settings

Model classification settings
  • Minimum absolute relevance: it establishes the minimum absolute relevance for a category to be accepted as a valid result. In other words, it filters the results you'll see in the response by the relevance value obtained in the classifier. The value assigned by default is 0.06, but depending on the model type you are working with, the absolute relevance value range of the results will change, so you can adjust this parameter to optimize the results after the evaluation.

    Heads up!

    If the minimum absolute relevance is set to 0 and you are working with a statistical model, it's quite possible that you will get all the categories in the results, as if the text you want to classify is long enough, it will have some similarity (albeit a very low one) to the training texts in the categories.

  • Minimum relative relevance: it establishes the minimum relative relevance of a category to be accepted as a valid result. In other words, it filters the results you'll see in the response if its relevance value with respect to the more relevant category is below the configured threshold.

    Valid range is between 0 and 1, and by default it's set to 0. See an example of the filtering in the table below, according to different values of the parameter:

    Category Absolute relevance Relative relevance Threshold = 0.3 Threshold = 0.51
    Category 1 0.8 = 0.8/0.8 = 1 100%
    Category 2 0.4 = 0.4/0.8 = 0.5 50%
    Category 3 0.2 = 0.2/0.8 = 0.25 25%
  • Rule vs Statistical: this parameter controls the balance between the weight given by the classification that comes from the rules versus the statistical classification. Lower values give more weight to the statistical classification (the training texts), while high values give more weight to the rule classification (the terms). Allowed values range from 0.0 (0%) to 1.0 (100%), and typical values are around 0.7.
  • Boost multiwords relevance: this parameter lets you boost the relevance assigned by multiword terms and operators involving several terms (NEAR and AND. By default, it's disabled.
  • Title boost: this parameter gives the terms that appear in the title field of the Text Classification more importance with respect to appearances in the text. The default value is 5, in other words, terms in the title count five times as much as terms in the text.
  • Abstract boost: this parameter gives the terms that appear in the abstract field of the Text Classification more importance with respect to appearances in the text. The default value is 3, in other words, terms in the abstract count three times as much as terms in the text.
  • Lemmatization: this parameter enables lemmatization for the text that's going to be classified, allowing you to simplify the term definition by using lemmas. It's activated by default.
  • Stopwords: list of words that do not provide any useful information to decide in which category a text should be classified. This may be either because they don't have any meaning (prepositions, conjunctions, etc.) or because they are too frequent in the classification context.

    In the model creation step we saw that it is possible to associate the model with a language. This means, in practical terms, that when you create a model, a default list of stopwords for the chosen language is added. This list includes prepositions, conjunctions and the most common verbs.

    In the image at the top of the page, we can see the list of stopwords you would obtain if the chosen language were English. The list of stopwords only affects the statistical classification.

    The list is editable, so you will be able to add or remove any item. Each stopword must be written in a different line, and to save any changes you will have to click the "Save" button.

    It not unusual to find that for some scenarios, words that would normally be used to classify need to be added as stopwords. For example, when analyzing a company's customers feedback, the company name may not be relevant for the classification.

    To add a new stopword, just have in mind the following guidelines:

    • Stopwords must contain only one word.
    • Blank spaces are used for word separation, so they’re not valid characters for stopwords.
    • Stopwords must contain only alphanumeric characters, dots and underscores.
    • Accent marks are not allowed (except for the 'ñ' letter).
    • Stopwords are case-insensitive, so 'Shield', 'SHIELD' and 'shield' are processed as if they were the same word. If a stopword contains capital letters, the lowercase version will be saved.

    Let's see some examples:

    Stopword Is it correct? Why isn't it correct?
    agent
    capitán (ES) Accent marks are not allowed
    señor (ES)
    captain america Blank spaces are not allowed
    u.k
    u. k Blank spaces are not allowed
    e-mail Dashes are not allowed.

    These limitations in the stopwords list come from the filtering process the system carries out before classifying a text. You can get more info about it in the tokenization section.