Text tokenization is the process in which the training texts — if they exist — and the texts that are going to be processed are split into the units that will be considered for the classification.
The process has three steps:
This step takes the UTF8 texts and carries out a number of removals and substitutions in order to normalize texts as much as possible.
These are the operations done:
^(with the exception of
ñ, which remains the same).
.), coma (
,) and apostrophes (
&), dashes (
—), forward slash (
/) and pipe (
#(useful when working with tweets)
Any token that contains a symbol that does not satisfy any of the scenarios mentioned in the previous points will be removed in order to be more robust to the noise derived from OCR and other processes the input may go through.
Tokens will be obtained dividing the text resulting from the pre-processing and separating it by blank spaces.
If the model being used has lemmatization enabled, then every token will be added its lemma as a variant. These variants will only be used in the rule-based classification (as opposed to the statistical classification).
Once the tokenization is done, for the statistical classification the system will filter stopwords, tokens that contain only numbers and tokens with two characters or less.
The list of stopwords to use can be configured by the user in the Settings section.
The following table contains some examples with the results the different steps have on a text:
I'm happy to be here (what a surprise!)
i happy to be here what a surprise
[i, happy, to, be, here, what, a, surprise]
@AnnaKendrick47 has also featured in the comedy-drama 50/50 (2011).
@annakendrick47 has also featured in the comedy drama 50 50 2011
[@annakendrick47, has, also, featured, in, the, comedy, drama, 50, 50, 2011]
[@annakendrick47, featured, comedy, drama]
We've bought it in U.K., in a store called Mark&Spencer.
we bought it in u.k in a store called mark&spencer
[we, bought, it, in, u.k, in, a, store, called, mark&spencer]
[bought, u.k, store, called, mark&spencer]