Text tokenization

Text tokenization is the process in which the training texts — if they exist — and the texts that are going to be processed are split into the units that will be considered for the classification.

The process has three steps:

  • Pre-processing, where symbols are filtered and conversions are done.
  • Tokenization, where the text is converted into tokens.
  • Post-processing, where numbers and stopwords are filtered. This step is only carried out when the model has a statistical component.

Pre-processing

This step takes the UTF8 texts and carries out a number of removals and substitutions in order to normalize texts as much as possible.

These are the operations done:

  • Texts are transformed into lower case and the following accents are removed: ´, ', ", ~, ` and ^ (with the exception of ñ, which remains the same).
  • Prefixes and suffixes are removed (taking into account different types of apostrophes):
    • Prefixes: all', c', d', dell', dall', j', k', l', m', me', n', p', qu', s', se', t' and te'.
    • Suffixes: 'd, 'l, 'll, 'ls, 'm, 'ns, 're, 's, 't, and 've.
  • Opening and closing symbols (simple and double quotes, parentheses, brackets, curly brackets, question and exclamation marks) are removed.
  • The following symbols will be transformed into blank spaces:
    • Only when they are surrounded by numbers in the same token: dot (.), coma (,) and apostrophes (´).
    • Always: ampersand (&), dashes (-, and ), forward slash (/) and pipe (|).
  • The following symbols will remain the same in the token:
    • In any position of a token: A-Za-zñÑ0-9çÇβ
    • In the first position: @ and #(useful when working with tweets)
    • In the middle of the token: &@+.·_
    • In the last position: +

Any token that contains a symbol that does not satisfy any of the scenarios mentioned in the previous points will be removed in order to be more robust to the noise derived from OCR and other processes the input may go through.

Tokenization

Tokens will be obtained dividing the text resulting from the pre-processing and separating it by blank spaces.

If the model being used has lemmatization enabled, then every token will be added its lemma as a variant. These variants will only be used in the rule-based classification (as opposed to the statistical classification).

Post-processing

Once the tokenization is done, for the statistical classification the system will filter stopwords, tokens that contain only numbers and tokens with two characters or less.

The list of stopwords to use can be configured by the user in the Settings section.

Examples

The following table contains some examples with the results the different steps have on a text:

Text Pre-processing Tokenization Post-processing

I'm happy to be here (what a surprise!)

i happy to be here what a surprise

[i, happy, to, be, here, what, a, surprise]

[happy, surprise]

@AnnaKendrick47 has also featured in the comedy-drama 50/50 (2011).

@annakendrick47 has also featured in the comedy drama 50 50 2011

[@annakendrick47, has, also, featured, in, the, comedy, drama, 50, 50, 2011]

[@annakendrick47, featured, comedy, drama]

We've bought it in U.K., in a store called Mark&Spencer.

we bought it in u.k in a store called mark&spencer

[we, bought, it, in, u.k, in, a, store, called, mark&spencer]

[bought, u.k, store, called, mark&spencer]