Resolve false positives

False positives can be resolved through rules, training texts or by modifying the list of stopwords:

Using rules

There are three possible actions we can carry out:

  • Adding excluding terms to the category to exclude it from the result; if you don't want to exclude it, you may add irrelevant terms to decrease its relevance.

    A very common scenario is to find some contexts that add ambiguity to the classification: "A" is always relevant for a category except when it's in the same context as "B". We could cover this case by saying that "A" is relevant while A WITH B is irrelevant.

  • Adding mandatory terms to the category to force that no text without one of those mandatory terms is classified into it.

    It is important to take into account that if mandatory terms are added to a category, all possible cases must be considered in order to make the list is as complete as possible.

  • If you don't find adequate terms for the previous options, and you want to classify the text in another category of the model, a viable option would be to add mandatory/relevant terms to that other category instead of modifying the one that's giving the false positive. This way, the correct classification would become more relevant and the false positive would become irrelevant.

This option can be applied to hybrid models and to rule-based ones.

Using training texts

The way to correct a false positive using training texts is simply to eliminate from the category the texts similar to the one that gives the false positive. This solution is not very frequent, as it is more complicated than editing the rules.

This option can be applied to statistical models and to rule-based ones.

Using stopwords

If it's detected that a term is irrelevant for the whole model and that it adds noise to some categories, a good option would be to add said term to the list of stopwords so it's not taken into account in the classification.

This solution is not very frequent, but it is a good one in the initial phase of a model's optimization, as it's when it is easier to identify the terminology used in the domain, and thus to identify the terms that are common within the domain but do not help in the classification.

Important

It's important to remember than modifying any category may change the relevance values the model assigns to the rest of the categories.