Term definition

The term definition that's going to apply to rule-based and hybrid models is not just a list of words. There are a number of operators that enable to define more complex terms.

There are two basic elements: simple terms and multiwords (or literal terms). We will be able to combine them using the different operators available.

The most basic element is the simple term. It is formed by a single word (and thus, without any blank spaces in it) and there are no operators. This word will be matched literally in the processed text or, in if the setting lemmatization is enabled and the term is a lemma, it will match any of its lexical forms.

The fact that this match is carried out over the processed text means that the pre-processing and tokenization steps described in the text tokenization section have been already carried out, and so, the terms will be matched against the resulting text.

In its simplest version, a list of terms can be formed just by simple terms.

The next element are the multiwords (or literal terms), which are used to specify combination of words or n-grams that have to appear in the text in the exact same order and form as specified. Multiwords are defined by using double quotes around the words that form the multiword. Similarly to simple terms, when the setting lemmatization is enabled and the words that compose the multiword are lemmas, then their lexical forms will also match.

ID Text Term Lemmatization Result
1 The machine is learning to think by itself machine n
2 Machines will take over shortly machine n
3 Machines will take over shortly machine y
4 The machine is learning to think by itself "machine learning" n
5 It uses a machine learning algorithm "machine learning" n
6 It uses a machine learning algorithm "machine learn" n
7 It uses a machine learning algorithm "machine learn" y

In this table, we can see several examples of when a term is detected and when it is not, depending both on the term definition used and the value of the lemmatization setting.

Important

Different lexical forms can correspond to the same lemma, for instance, "fly" will be the lemma for the verb "fly", as in take flight as well as for "fly", the animal. It is be important to take those ambiguous contexts into account in the term definition.

The simple terms and the multiwords are the basic elements that will be combined through the use of operators to define more complex terms.

These are the operators available:

  • AND: the classical logical operator. It allows to give context to the term. The two terms joined by the operator need to appear in the text for the condition to be satisfied. The order of the terms associated to the operator does not affect the result of the evaluation.
    ID Text Term Result
    1 That passenger is afraid to fly passenger AND fly
    2 The fruit fly is very common passenger AND fly
    3 That fly is buzzing around the passenger passenger AND fly
    4 The fruit fly is buzzing around the passenger "fruit fly" AND passenger
    5 The fly is buzzing around the passenger "fruit fly" AND passenger
  • WITH: similar to AND, but only the first term will have any impact on the weight. It's used mainly to disambiguate.
    ID Text Term Result
    1 That passenger is afraid to fly passenger WITH fly
    2 The fruit fly is very common passenger WITH fly
    3 The fruit fly is buzzing around the passenger "fruit fly" WITH passenger
    4 The fly is buzzing around the passenger "fruit fly" WITH passenger
  • OR or |: the classical logical operator. The condition will be satisfied if any of the terms joined by the operator appear. The difference between the two versions is the priority given: | takes precedence over OR.
  • ID Text Term Result
    1 I think the machine is working now machine OR think
    2 I don't think that's correct think OR machine
    3 The machines are taking over think OR machine
  • NEAR: proximity operator. It's similar to the AND operator, but within a specific distance. There are two variants: -, which implies a strict order in the appearances of the terms and ~, where the order does not matter.

    Distance is counted as jumps starting from the first word in the NEAR operator. The distance is computed taking into account every word within the NEAR operator, that is, multiwords or literal terms, do not count as a single "jump" but as the number of words the literal term contains.

    ID Text Term Result
    1 The fruit fly is buzzing around the passenger [fruit fly]~3
    2 The fly is buzzing around the fruit [fruit fly]~3
    3 The fruit fly is buzzing around the passenger [fruit fly]-5
    4 The fly is buzzing around the fruit [fruit fly]-5
    5 The fruit fly is buzzing around the passenger ["fruit fly" passenger]-5
    6 The fruit fly is buzzing around the passenger ["fruit fly" passenger]-6

    Important

    Other operators or parentheses are not allowed within the NEAR operator with the exception of |!

  • NOT: the negator. It indicates that a term/s must not appear for the condition to be considered satisfied.
    ID Text Term Result
    1 She enjoys going to the beach NOT fly
    2 He does not like to fly NOT fly

    Important

    The NOT operator can only be applied to terms (that is, simple terms or multiwords) or to terms separated by the operator |.

Did you notice...?

As the system is accent insensitive, the AND and WITH operators will be very useful to limit the context for those cases in which depending on the accent mark, a word means a thing or another.

This happens very often in languages such as Spanish or French. For instance: "ingles" vs "inglés" could be disambiguated using "ingles AND clases" or with "ingles WITH clases".

You can use | as an alternative to OR to combine simple terms in a easy way without having to specify priorities using parentheses.

ml OR "machine learning" = ml | "machine learning"
(ml OR "machine learning") AND (think OR thinks) = ml | "machine learning" AND think | thinks

The following table shows several examples of sentences and terms defined using all the different operators.

ID Text Term Lemmatization Result
1 He invented the machine machine | system AND think y
2 The systems are learning to think by themselves machine | system AND think y
3 The machines are thinking about rebelling machine | system AND think n
4 The systems are learning to think by themselves (machine OR system) AND think y
5 The machines are learning to think by themselves "machine learning" AND think y
6 I think it uses a machine learning algorithm "machine learning" AND think n
7 I think machine learning is very cool [system|machine learn]-3 AND think y
8 I think machine learning is very cool [machine learn]-3 AND think n
9 If the machine does not do that, I will have to learn to do it [machine learn]-10 AND think y
10 If the machine does not do that, I will have to learn to do it [machine learn]-10 AND NOT think y
11 If the machine does not do that, I will have to learn to do it [learn machine]-10 AND NOT think y
12 If the machine does not do that, I will have to learn to do it [learn machine]~10 AND NOT think y

Operators precedence

The precedence of the operators we have seen is the following:

| NOT WITH NEAR AND OR

Parentheses can be also used to indicate precedence. For instance, the following would be true:

house AND(dog OR cat) = house AND dog|cat

Weights

The weight of a term will only be added to the relevance of the category when the condition described by the term is satisfied.

The starting value for the computations will always be the number of times a simple term or a multiword appear in the text, that is, their frequency. By default, their relevance will be the same, but it's possible to configure the settings of the model so multiwords add more weight than a simple term. The parameter that allows to do this is relevance_boost, and as we will see in the next section, by default will be disabled.

The following table contains the weight impact for the different elements we can use:

relevance_boost Element Weight added
disabled Simple term Frequency of the term in the text
Multiword or literal Frequency of the multiword in the text
OR or | Sum of the values of the terms joined by the operator
AND Minimum value of the terms joined by the operator
WITH Frequency of the first term
NEAR Minimum value of the terms that satisfy the distance restriction
NOT Does not add any weight
enabled Simple term Frequency of the term in the text
Multiword or literal Frequency of the multiword in the text times the number of words in the multiword
OR or | Sum of the values of the terms joined by the operator
AND Minimum value of the terms joined by the operator times the number of terms involved
WITH Frequency of the first term
NEAR Minimum value of the terms that satisfy the distance restriction times the number of terms involved
NOT Does not add any weight

The following table contains some examples of terms and how the relevance they would add/subtract would be obtained if the condition were satisfied in a text. We will represent the number of appearances of a term or frequency with f(term).

ID Term Weight (relevance_boost disabled) Weight (relevance_boost enabled)
1 machine f(machine) f(machine)
2 "machine learning" f(machine learning) f(machine learning)*2
3 machine | system f(machine) + f(system) f(machine) + f(system)
4 machine AND think min(f(machine), f(think)) min(f(machine), f(think))*2
5 machine | system AND think min(f(machine) + f(system), f(think)) min(f(machine) + f(system), f(think))*2
6 machine | system AND think OR ponder min(f(machine) + f(system), f(think)) + f(ponder) min(f(machine) + f(system), f(think))*2 + f(ponder)
7 "machine learning" AND think min(f(machine learning), f(think)) min(f(machine learning)*2, f(think))*2
8 machine AND learning AND think min(f(machine), f(learning),f(think)) min(f(machine), f(learning),f(think))*3
9 machine WITH think f(machine) f(machine)
10 "machine learning" WITH think f(machine learning) f(machine learning)*2
11 machine WITH learn AND think min(f(machine), f(think)) min(f(machine), f(think))*2
12 machine WITH (learn AND think) f(machine) f(machine)
13 [machine learn]-3 AND think min(min(f(machine, f(learn)), f(think)) min(min(f(machine, f(learn))*2, f(think))*2
14 [machine learn think]~5 min(f(machine), f(learn), f(think)) min(f(machine), f(learn), f(think))*3
15 ["machine learn" think]~5 min(f(machine learn), f(think)) min(f(machine learn)*2, f(think))*2
16 "machine learn" AND NOT think f(machine learn) f(machine learn)*2

Let's see the actual values we would obtain for a given text (and assuming that lemmatization is enabled):

I'm trying to think of some machine learning algorithms
ID Term Weight (relevance_boost disabled) Weight (relevance_boost enabled)
1 machine 1 1
2 "machine learning" 1 1*2 = 2
3 machine | system 1 + 0 = 1 1 + 0 = 1
4 machine AND think min(1, 1) = 1 min(1, 1)*2 = 1*2 = 2
5 machine | system AND think min(1 + 0, 1) = min(1, 1) = 1 min(1 + 0, 1)*2 = min(1, 1)*2 = 2
6 machine | system AND think OR ponder min(1 + 0, 1) + 0 = 1 min(1 + 0, 1)*2 + 0 = 2 + 0 = 2
7 "machine learning" AND think min(1, 1) = 1 min(1*2, 1)*2 = 2
8 machine AND learning AND think min(1, 1, 1) = 1 min(1, 1, 1)*3 = 3
9 machine WITH think 1 1
10 "machine learning" WITH think 1 1*2
11 machine WITH learn AND think min(1, 1) = 1 min(1, 1)*2 = 2
12 machine WITH (learn AND think) 1 1
13 [machine learn]-3 AND think min(min(1, 1), 1) = min(1, 1) = 1 min(min(1, 1)*2, 1)*2 = min(2, 1)*2 = 2
14 [machine learn think]~5 min(1, 1, 1) = 1 min(1, 1, 1)*3 = 3
15 ["machine learn" think]~5 min(1, 1) = 1 min(1*2, 1)*2 = min(2, 1)*2 = 2
16 "machine learn" AND NOT think 1 1*2 = 2