Text preprocessing is the process of getting the raw text into a form which can be vectorized and subsequently consumed by machine learning algorithms for natural language processing (NLP) tasks such as text classification, topic modeling, name entity recognition etc.
Raw text extensively preprocessed by all text analytics APIs such as Azure’s text analytics APIs or ones developed by us at Specrom Analytics, although the extent and the type of preprocessing is dependent on the type of input text. For example, for our historical news APIs, the input consists of scraped HTML pages, and hence it is important for us to strip the unwanted HTML tags from text before feeding it to the NLP algorithms. However, for some news outlets we get data as JSON from their official REST APIs. In that case, there are no HTML tags at all and it will be a waste of CPU time to run a regex based preprocessor to such a clean text. Hence, it makes sense to preprocess text differently based on the source of the data.
If you want to create word clouds as shown below, than it is generally recommended that you remove stop words. But in cases such as name entity recognition (NER), this is not really required and you can safely throw in syntactically complete sentences to the NER of your choice.
There are many good blog posts developing a text preprocessing steps but let us go through those here just for completeness sake.
1. Tokenization
The process of converting text contained in paragraphs or sentences into individual words (called tokens) is known as tokenization. This is usually a very important step in text preprocessing before we can convert text into vectors full of numbers.
Intuitively and rather naively, one way to tokenize text is to simply break the string at spaces and python already ships with very good string methods which can do it with ease, lets call such a tokenization method “white space tokenization”.
However, white space tokenization cannot understand word contractions such as when we combine two words ‘can’ and ‘not’ into “can’t”, don’t (do + not), and I’ve (I + have). These are non-trivial issues, and if we don’t separate “can’t” into “can” and “not” then once we strip punctuations,we will be left with a single word “cant” which is not really a dictionary word.
The classical library for text processing in Python called NLTK ships with other tokenizers such as WordPunctTokenizer and TreebankWordTokenizer which all operate on different conventions to try and solve the word contractions issue. For advanced tokenization strategies, there is also a RegexpTokenizer available which can split strings according to a regular expression.
All of these approaches are basically rule-based though, and since no real “learning” is happening, you as a user will have to handle all the special cases which might crop up as a result of tokenization strategy.
The next generation NLP libraries such as Spacy and Apache Spark NLP have largely fixed this issue and deals with common abbreviations with the tokenization methods as part of their language model.
1.1 NLTK Tokenization Examples
# Create a string input sample_text = "Gemini Man review: Double Will Smith can't save hackneyed spy flick U.S.A"from nltk.tokenize import WhitespaceTokenizer tokenizer_w = WhitespaceTokenizer()# Use tokenize method tokenized_list = tokenizer_w.tokenize(sample_text) tokenized_list# output['Gemini', 'Man', 'review:', 'Double', 'Will', 'Smith', "can't", 'save', 'hackneyed', 'spy', 'flick', 'U.S.A']
WordPunct Tokenizer will split on punctuations as shown below.
from nltk.tokenize import WordPunctTokenizer tokenizer = WordPunctTokenizer() tokenized_list= tokenizer.tokenize(sample_text) tokenized_list# Output['Gemini', 'Man', 'review', ':', 'Double', 'Will', 'Smith', 'can', "'", 't', 'save', 'hackneyed', 'spy', 'flick', 'U', '.', 'S', '.', 'A']
And NLTK’s treebanktokenizer splits word contractions into two tokens as shown below.
from nltk.tokenize import TreebankWordTokenizer tokenizer = TreebankWordTokenizer() tokenized_list= tokenizer.tokenize(sample_text) tokenized_list #Output['Gemini', 'Man', 'review', ':', 'Double', 'Will', 'Smith', 'ca', "n't", 'save', 'hackneyed', 'spy', 'flick', 'U.S.A']
1.2 SpaCy Tokenization Example
Its pretty simple to perform tokenization in SpaCy too, and in the later section on lemmatization you will notice why tokenization as part of language model fixes the word contraction issue.
# Spacy Tokenization example sample_text = "Gemini Man review: Double Will Smith can't save hackneyed spy flick U.S.A"from spacy.lang.en import English nlp = English() tokenizer = nlp.Defaults.create_tokenizer(nlp) tokens = tokenizer(sample_text) token_list = [] for token in tokens: token_list.append(token.text) token_list #output ['Gemini', 'Man', 'review', ':', 'Double', 'Will', 'Smith', 'ca', "n't", 'save', 'hackneyed', 'spy', 'flick', 'U.S.A']
2. Stemming and Lemmatization
Stemming and lemmatization attempts to get root word (for eg rain) for different word inflections (raining, rained etc). Lemma algos gives you real dictionary words, whereas stemming simply cuts off last parts of the word so its faster but less accurate. Stemming returns words which are not really dictionary words and hence you will not be able to find pretrained vectors for it in Glove, Word2Vec etc and this is a major disadvantage depending on application.
Nevertheless, it is pretty popular to use stemming algorithms such as porter and more advanced snowball stemmers. Spacy does not ship with any stemming algorithms so we will be using NLTK for performing stemming; we will show outputs from two stemming algorithms here. For ease of use, we will wrap the whitespace tokenizer into a function. As you can see, both stemmers reduced the verb form (raining) into rain.
2.1 NLTK’s Stemming Examples
sample_text = '''Gemini Man review: Double Will Smith can't save hackneyed spy flick U.S.A raining rained ran'''from nltk.tokenize import WhitespaceTokenizerdef w_tokenizer(text): tokenizer = WhitespaceTokenizer() # Use tokenize method tokenized_list = tokenizer.tokenize(text) return(tokenized_list)from nltk.stem.snowball import SnowballStemmerdef stemmer_snowball(text_list): snowball = SnowballStemmer(language='english') return_list = [] for i in range(len(text_list)): return_list.append(snowball.stem(text_list[i])) return(return_list) stemmer_snowball(w_tokenizer(sample_text))#Output['gemini', 'man', 'review:', 'doubl', 'will', 'smith', "can't", 'save', 'hackney', 'spi', 'flick', 'u.s.a', 'rain', 'rain', 'ran']
You get the same result with NLTK’s Porter Stemmer, and this one too words into non dictionary forms such as spy -> spi and double -> doubl
from nltk.stem.porter import PorterStemmerdef stemmer_porter(text_list): porter = PorterStemmer() return_list = [] for i in range(len(text_list)): return_list.append(porter.stem(text_list[i])) return(return_list) stemmer_porter(w_tokenizer(sample_text))#Output['gemini', 'man', 'review:', 'doubl', 'will', 'smith', "can't", 'save', 'hackney', 'spi', 'flick', 'u.s.a', 'rain', 'rain', 'ran']
2.2 SpaCy’s Lemmatization Example
If you use SpaCy for tokenization, then it already stores an attribute called .lemma_
with each tokens, and you can simply call it to get lemmatized forms of each words. Notice that it’s not as aggressive as a stemmer, and it converts word contractions such as “can’t” to “can” and “not”.
# https://spacy.io/api/tokenizer from spacy.lang.en import English nlp = English()tokenizer = nlp.Defaults.create_tokenizer(nlp) tokens = tokenizer(sample_text) #token_list = [] lemma_list = [] for token in tokens: #token_list.append(token.text) lemma_list.append(token.lemma_) #token_list lemma_list #Output['Gemini', 'Man', 'review', ':', 'Double', 'Will', 'Smith', 'can', 'not', 'save', 'hackneyed', 'spy', 'flick', 'U.S.A', 'rain', 'rain', 'run']
3. Stop Word removals
There are certain words above such as “it”, “is”, “that”, “this” etc. which don’t contribute much to the meaning of the underlying sentence and are actually quite common across all English documents; these words are known as stop words. There is generally a need to remove these “common” words before vectorizing tokens by a count vectorizer so that we can reduce the total dimensions of our vectors, and mitigate the so called “curse of dimensionality”.
You can remove stop words by essentially three methods:
- First method is the simplest where you create a list or set of words you want to exclude from your tokens; such as list is already available as part of sklearn’s countvectorizer, NLTK as well as SpaCy. This has been accepted method to remove stop words for quite a long time, however, there is an awareness among researchers and working professionals that such one size fits all method is actually quite harmful in learning about overall meaning of the text; and there are papers out there which caution against this approach.
# using hard coded stop word listfrom spacy.lang.en import English import spacy spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS # spacy_stopwords is a hardcoded set nlp = English() tokenizer = nlp.Defaults.create_tokenizer(nlp) tokens = tokenizer(sample_text) #token_list = [] lemma_list = [] for token in tokens: if token.lemma_.lower() not in spacy_stopwords: #token_list.append(token.text) lemma_list.append(token.lemma_) #token_list lemma_list #Output['Gemini', 'Man', 'review', ':', 'Double', 'Smith', 'save', 'hackneyed', 'spy', 'flick', 'U.S.A', 'rain', 'rain', 'run']
As expected, the words “will” and “can” etc are removed since they were present in the hard-coded set of stopwords available in SpaCy. Let us wrap this into a function called remove_stop_words so that we can use it as part of sklearn pipeline in section 5.
import spacy def remove_stopwords(text_list): spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDSreturn_list = [] for i in range(len(text_list)): if text_list[i] not in spacy_stopwords: return_list.append(text_list[i]) return(return_list)
- The second approach is where you let the language model figure out if a given token is a stop word or not. Spacy’s tokenization already provides an attribute called is
.is_stop
for this purpose. Now, there will be times when common stop words are not being excluded by spacy’s flag, but that is still better than a hard-coded list of words to be excluded. Just FYI, there is a well documented bug in some SpaCy models[1][2] which avoids detection of stop words in cases when the first letter is capitalized so you need to apply the workaround in case its not detecting stop words properly.
# using the .is_stop flagfrom spacy.lang.en import Englishnlp = English() tokenizer = nlp.Defaults.create_tokenizer(nlp) tokens = tokenizer(sample_text) lemma_list = [] for token in tokens: if token.is_stop is False: lemma_list.append(token.lemma_) lemma_list#Output['Gemini', 'Man', 'review', ':', 'Double', 'Will', 'Smith', 'not', 'save', 'hackneyed', 'spy', 'flick', 'U.S.A', 'rain', 'rain', 'run']
This is obviously doing a better job, since it detected that “Will” here is the name of a person only removed “can” from the sample text. Let’s wrap this in a function so that we can use it in the last section.
# from spacy.lang.en import Englishdef spacy_tokenizer_lemmatizer(text): nlp = English() tokenizer = nlp.Defaults.create_tokenizer(nlp) tokens = tokenizer(text) lemma_list = [] for token in tokens: if token.is_stop is False: lemma_list.append(token.lemma_) return(lemma_list)
- The third approach to combating stop words is excluding words which appear too frequently in a given corpus; sklearn’s countvectoriser and tfidfvectorizer methods has a parameter called `max_df` which lets you ignore tokens that have a document frequency strictly higher than the given threshold. You can also exclude words by specifying total number of tokens through `max_features` parameter. If you are going to use tf-idf after count vectorizer, than it will automatically assign a much lower weightage to stop words compared to words which contribute to overall meaning of the sentence.
4. Removing Punctuation
Once we have tokenized the text and have converted the word contractions it really isn’t useful anymore to have punctuation and special characters in our text. This is of-course not true when we are dealing with text likely to have twitter handles, email addresses etc. In those cases, we alter our text processing pipeline to only strip whitespaces from tokens or skip this step altogether. We can clean out all HTML tags by using the regex ‘<[^>]*>’; All the non word characters can be removed by ‘[\W]+’. You should be careful though about not stripping punctuations before word contractions are handled by the lemmatizer. In the code block below, we will modify our SpaCy code to account for stop words and also remove any punctuations from tokens. As shown in example below, we have successfully removed special character tokens such as “:” which don’t really contribute anything semantically in a bags of words vectorization.
import re def preprocessor(text): if type(text) == string: text = re.sub('<[^>]*>', '', text) text = re.sub('[\W]+', '', text.lower()) return textfrom spacy.lang.en import English nlp = English() tokenizer = nlp.Defaults.create_tokenizer(nlp) tokens = tokenizer(sample_text)lemma_list = [] for token in tokens: if token.is_stop is False: token_preprocessed = preprocessor(token.lemma_) if token_preprocessed != '': lemma_list.append(token_preprocessed) lemma_list #Output:['gemini', 'man', 'review', 'double', 'will', 'smith', 'not', 'save', 'hackneyed', 'spy', 'flick', 'usa', 'rain', 'rain', 'run']#A more appropriate preprocessor function is below which can take both a list and a string as inputdef preprocessor_final(text): if isinstance((text), (str)): text = re.sub('<[^>]*>', '', text) text = re.sub('[\W]+', '', text.lower()) return text if isinstance((text), (list)): return_list = [] for i in range(len(text)): temp_text = re.sub('<[^>]*>', '', text[i]) temp_text = re.sub('[\W]+', '', temp_text.lower()) return_list.append(temp_text) return(return_list) else: pass
Another common text processing use case is when we are trying to perform document level sentiment analysis from web data such as social media comments, tweets etc. All of these make extensive use of emoticons, and if we simply strip out all special characters than we may miss out on some very useful tokens which contribute greatly to the semantics and sentiments of the text. If we are planning on using a bags of word type text vectorization than we can simply find all those emoticons and add them towards the end of the tokenized list. In this case, you might have to run the preprocessor as the first step before tokenization.
# find emoticons functionimport re def find_emo(text): emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text) return emoticons sample_text = " I loved this movie 🙂 but it was rather sad 😦 " find_emo(sample_text) # output [':)', ':(']
5. Sklearn Pipelines
As you saw above, text preprocessing is rarely a one size fits all, and most real world applications require us to use different preprocessing modules as per the text source and the further analysis we plan on doing.
There are many ways to create such a custom pipeline, but one simple option is to use sklearn pipelines which allows us to sequentially assemble several different steps, with only requirement being that intermediate steps should have implemented the fit and transform methods and the final estimator having atleast a fit method.
Now, this might be too onerous a requirement for many small functions such as ones for preprocessing text; but thankfully, sklearn also ships with a functionTransformer which allows us to wrap any arbitrary function into a sklearn compatible one. There is one catch though: the function should not operate directly on objects but wrap them into lists, pandas series or Numpy arrays. This is not a major deterrent though, you can just create a helper function which wraps the output into a list comprehension.
# Adapted from https://ryan-cranfill.github.io/sentiment-pipeline-sklearn-3/ from sklearn.preprocessing import FunctionTransformerdef pipelinize(function, active=True): def list_comprehend_a_function(list_or_series, active=True): if active: return [function(i) for i in list_or_series] else: # if it's not active, just pass it right back return list_or_series return FunctionTransformer(list_comprehend_a_function, validate=False, kw_args={'active':active})
As a final step, let us compose a sklearn pipeline which uses NLTK’s w_tokenizer function and stemmer_snowball from section 2.1 and uses the preprocessor function from section 4.
from sklearn.pipeline import Pipelineestimators = [('tokenizer', pipelinize(w_tokenizer)), ('stemmer', pipelinize(stemmer_snowball)),('stopwordremoval', pipelinize(remove_stopwords)), ('preprocessor', pipelinize(preprocessor_final))]pipe = Pipeline(estimators) pipe.transform([sample_text])Output:[['gemini', 'man', 'review', 'doubl', 'smith', 'cant', 'save', 'hackney', 'spi', 'flick', 'usa', 'rain', 'rain', 'ran']]
You can easily change the above pipeline to use the SpaCy functions as shown below. Note that the tokenization function (spacy_tokenizer_lemmatizer) introduced in section 3 returns lemmatized tokens without any stopwords, so those steps are not necessary in our pipeline and we can directly run the preprocessor.
spacy_estimators = [('tokenizer', pipelinize(spacy_tokenizer_lemmatizer)), ('preprocessor', pipelinize(preprocessor_final))] spacy_pipe = Pipeline(spacy_estimators) spacy_pipe.transform([sample_text])# Output:[['gemini', 'man', 'review', 'double', 'will', 'smith', 'not', 'save', 'hackneyed', 'spy', 'flick', 'usa', 'rain', 'rain', 'run']]
I hope that I have illustrated the ample advantages of using Sklearn pipelines with SpaCy based preprocessing workflow to effectively and efficiently perform preprocessing for almost all NLP tasks