Setting up text preprocessing pipeline using scikit-learn and spaCy Learn how to tokenize, lemmatize, remove stop words and punctuation with sklearn pipelines

Text preprocessing is the process of getting the raw text into a form which can be vectorized and subsequently consumed by machine learning algorithms for natural language processing (NLP) tasks such as text classification, topic modeling, name entity recognition etc.

Raw text extensively preprocessed by all text analytics APIs such as Azure’s text analytics APIs or ones developed by us at Specrom Analytics, although the extent and the type of preprocessing is dependent on the type of input text. For example, for our historical news APIs, the input consists of scraped HTML pages, and hence it is important for us to strip the unwanted HTML tags from text before feeding it to the NLP algorithms. However, for some news outlets we get data as JSON from their official REST APIs. In that case, there are no HTML tags at all and it will be a waste of CPU time to run a regex based preprocessor to such a clean text. Hence, it makes sense to preprocess text differently based on the source of the data.

If you want to create word clouds as shown below, than it is generally recommended that you remove stop words. But in cases such as name entity recognition (NER), this is not really required and you can safely throw in syntactically complete sentences to the NER of your choice.

Word cloud generated from a corpus of scraped news (2016–2020). Jay M. Patel ©

There are many good blog posts developing a text preprocessing steps but let us go through those here just for completeness sake.

1. Tokenization

The process of converting text contained in paragraphs or sentences into individual words (called tokens) is known as tokenization. This is usually a very important step in text preprocessing before we can convert text into vectors full of numbers.

Intuitively and rather naively, one way to tokenize text is to simply break the string at spaces and python already ships with very good string methods which can do it with ease, lets call such a tokenization method “white space tokenization”.

However, white space tokenization cannot understand word contractions such as when we combine two words ‘can’ and ‘not’ into “can’t”, don’t (do + not), and I’ve (I + have). These are non-trivial issues, and if we don’t separate “can’t” into “can” and “not” then once we strip punctuations,we will be left with a single word “cant” which is not really a dictionary word.

The classical library for text processing in Python called NLTK ships with other tokenizers such as WordPunctTokenizer and TreebankWordTokenizer which all operate on different conventions to try and solve the word contractions issue. For advanced tokenization strategies, there is also a RegexpTokenizer available which can split strings according to a regular expression.

All of these approaches are basically rule-based though, and since no real “learning” is happening, you as a user will have to handle all the special cases which might crop up as a result of tokenization strategy.

The next generation NLP libraries such as Spacy and Apache Spark NLP have largely fixed this issue and deals with common abbreviations with the tokenization methods as part of their language model.

1.1 NLTK Tokenization Examples

WordPunct Tokenizer will split on punctuations as shown below.

And NLTK’s treebanktokenizer splits word contractions into two tokens as shown below.

1.2 SpaCy Tokenization Example

Its pretty simple to perform tokenization in SpaCy too, and in the later section on lemmatization you will notice why tokenization as part of language model fixes the word contraction issue.

2. Stemming and Lemmatization

Stemming and lemmatization attempts to get root word (for eg rain) for different word inflections (raining, rained etc). Lemma algos gives you real dictionary words, whereas stemming simply cuts off last parts of the word so its faster but less accurate. Stemming returns words which are not really dictionary words and hence you will not be able to find pretrained vectors for it in Glove, Word2Vec etc and this is a major disadvantage depending on application.

Nevertheless, it is pretty popular to use stemming algorithms such as porter and more advanced snowball stemmers. Spacy does not ship with any stemming algorithms so we will be using NLTK for performing stemming; we will show outputs from two stemming algorithms here. For ease of use, we will wrap the whitespace tokenizer into a function. As you can see, both stemmers reduced the verb form (raining) into rain.

2.1 NLTK’s Stemming Examples

You get the same result with NLTK’s Porter Stemmer, and this one too words into non dictionary forms such as spy -> spi and double -> doubl

2.2 SpaCy’s Lemmatization Example

If you use SpaCy for tokenization, then it already stores an attribute called .lemma_ with each tokens, and you can simply call it to get lemmatized forms of each words. Notice that it’s not as aggressive as a stemmer, and it converts word contractions such as “can’t” to “can” and “not”.

3. Stop Word removals

There are certain words above such as “it”, “is”, “that”, “this” etc. which don’t contribute much to the meaning of the underlying sentence and are actually quite common across all English documents; these words are known as stop words. There is generally a need to remove these “common” words before vectorizing tokens by a count vectorizer so that we can reduce the total dimensions of our vectors, and mitigate the so called “curse of dimensionality”.

You can remove stop words by essentially three methods:

  • First method is the simplest where you create a list or set of words you want to exclude from your tokens; such as list is already available as part of sklearn’s countvectorizer, NLTK as well as SpaCy. This has been accepted method to remove stop words for quite a long time, however, there is an awareness among researchers and working professionals that such one size fits all method is actually quite harmful in learning about overall meaning of the text; and there are papers out there which caution against this approach.

As expected, the words “will” and “can” etc are removed since they were present in the hard-coded set of stopwords available in SpaCy. Let us wrap this into a function called remove_stop_words so that we can use it as part of sklearn pipeline in section 5.

  • The second approach is where you let the language model figure out if a given token is a stop word or not. Spacy’s tokenization already provides an attribute called is .is_stop for this purpose. Now, there will be times when common stop words are not being excluded by spacy’s flag, but that is still better than a hard-coded list of words to be excluded. Just FYI, there is a well documented bug in some SpaCy models[1][2] which avoids detection of stop words in cases when the first letter is capitalized so you need to apply the workaround in case its not detecting stop words properly.

This is obviously doing a better job, since it detected that “Will” here is the name of a person only removed “can” from the sample text. Let’s wrap this in a function so that we can use it in the last section.

  • The third approach to combating stop words is excluding words which appear too frequently in a given corpus; sklearn’s countvectoriser and tfidfvectorizer methods has a parameter called `max_df` which lets you ignore tokens that have a document frequency strictly higher than the given threshold. You can also exclude words by specifying total number of tokens through `max_features` parameter. If you are going to use tf-idf after count vectorizer, than it will automatically assign a much lower weightage to stop words compared to words which contribute to overall meaning of the sentence.

4. Removing Punctuation

Once we have tokenized the text and have converted the word contractions it really isn’t useful anymore to have punctuation and special characters in our text. This is of-course not true when we are dealing with text likely to have twitter handles, email addresses etc. In those cases, we alter our text processing pipeline to only strip whitespaces from tokens or skip this step altogether. We can clean out all HTML tags by using the regex ‘<[^>]*>’; All the non word characters can be removed by ‘[\W]+’. You should be careful though about not stripping punctuations before word contractions are handled by the lemmatizer. In the code block below, we will modify our SpaCy code to account for stop words and also remove any punctuations from tokens. As shown in example below, we have successfully removed special character tokens such as “:” which don’t really contribute anything semantically in a bags of words vectorization.

Another common text processing use case is when we are trying to perform document level sentiment analysis from web data such as social media comments, tweets etc. All of these make extensive use of emoticons, and if we simply strip out all special characters than we may miss out on some very useful tokens which contribute greatly to the semantics and sentiments of the text. If we are planning on using a bags of word type text vectorization than we can simply find all those emoticons and add them towards the end of the tokenized list. In this case, you might have to run the preprocessor as the first step before tokenization.

5. Sklearn Pipelines

As you saw above, text preprocessing is rarely a one size fits all, and most real world applications require us to use different preprocessing modules as per the text source and the further analysis we plan on doing.

There are many ways to create such a custom pipeline, but one simple option is to use sklearn pipelines which allows us to sequentially assemble several different steps, with only requirement being that intermediate steps should have implemented the fit and transform methods and the final estimator having atleast a fit method.

Now, this might be too onerous a requirement for many small functions such as ones for preprocessing text; but thankfully, sklearn also ships with a functionTransformer which allows us to wrap any arbitrary function into a sklearn compatible one. There is one catch though: the function should not operate directly on objects but wrap them into lists, pandas series or Numpy arrays. This is not a major deterrent though, you can just create a helper function which wraps the output into a list comprehension.

As a final step, let us compose a sklearn pipeline which uses NLTK’s w_tokenizer function and stemmer_snowball from section 2.1 and uses the preprocessor function from section 4.

You can easily change the above pipeline to use the SpaCy functions as shown below. Note that the tokenization function (spacy_tokenizer_lemmatizer) introduced in section 3 returns lemmatized tokens without any stopwords, so those steps are not necessary in our pipeline and we can directly run the preprocessor.

I hope that I have illustrated the ample advantages of using Sklearn pipelines with SpaCy based preprocessing workflow to effectively and efficiently perform preprocessing for almost all NLP tasks

WTH are R-squared and Adjusted R-squared? Understanding the math and intuition behind R-squared.

Today I am going to explain the concept of R-squared and adjusted R-squared from the Machine Learning perspective. I’ll also show you how to find the R-squared value of your ML model. Let’s begin…

R-squared

It acts as an evaluation metric for regression models. To understand it better let me introduce a regression problem. Suppose I’m building a model to predict how many articles I will write in a particular month given the amount of free time I have on that month. So, here the target variable is the number of articles and free time is the independent variable(aka the feature). Here’s the dummy data that I created.

In this case, a simple linear regression model should be enough. The equation of the model is…

The parameters of the model(w_1 and b) can be found by minimizing the squared error over all the data points. This is also known as the least square loss function.

After this optimization step, we found the red line as our model(best-fit line).

Now we want to know how good our model is. This can be done in many ways but R-squared uses a statistical measure called variance. Variance denotes how much the values are spread out around its mean. Mathematically,

n is the number of data points

R-squared finds how much of the variance of the target variable is explained by the model(a function of the independent variables). But to find this we need to know two things. 1) variance of the target variable around mean(avg variance), 2) variance of the target variable around the best-fit line(model variance).

The average variance can also be seen as the variance of a model that outputs the mean of the target variable for every input. We can visualize this model as a horizontal line that cuts the y-axis at the mean of all the y coordinates of our data points. See the green line in the plot.

Ignoring the factor 1/n, we can write…

The formula for model variance is…

Now we at a position to understand the formula of R-squared.

How to interpret?

As I mentioned earlier R-squared value denotes the proportion of variance of the target variable that can be explained by your model. The more proportion of variance explained the better your model is. So an R-squared value close to 1 corresponds to a good model and a value close to 0 corresponds to a bad model.

Let’s say the R² value of our model is 0.78. This statement means that our model explains 78% of the variance of the data corresponding to the number of articles. It is close to 1 so we can say this is a good model.

Possible values of R²

R² = 0 when our model is the same as the average model. R²>0 means our model is better than the average model. The maximum possible value of R² is equal to 1. Although it has a square in its name it may take a negative value. R²<0 means our model is worse than the average model. This case does not occur in general as the optimization step will produce a better model than the average one.

Problem with R-squared

At first, it seems all fine but as we add more features, R² shows a huge problem. R-squared can never decrease as new features are added to the model.

This is a problem because even if we add useless or random features to our model then also R-squared value will increase denoting that the new model is better than the previous one. This is false because the new features have nothing to do with the output variable and only contribute to the overfitting.

Why can R-squared never decrease?

To understand this let’s introduce a new feature to our model which has no relation with the number of articles written by me(output variable). I am taking the average temperature of a month as our new feature. Let’s call it x_2. So our model becomes…

After the optimization two cases can occur for w_2:

  1. We get w_2 as 0. This means no correlation between x_2 and the output variable has been found and we are stuck with the previous minimum of the loss function. Thus our model stays the same as the previous model. So, in this case, the R² value remains the same.
  2. We get a non-zero value of w_2. This means some correlation between x_2 and the output variable has been found and we achieve a better minimum of the loss function. So, the R² value increases.

Almost always the second case occurs as it is very easy to find a small correlation in randomness. But this small correlation overfits the model. To tackle this problem we use adjusted R-squared.

Adjusted R-squared

The idea behind adjusted R-squared is to penalize the score as we add more features to our model. Let’s look at the formula of adjusted R-squared.

n is the number of data points; m is the number of independent features

The denominator(n-m-1) increases as we increase the number of features. So if we don’t find a significant increase in R² then the value of the whole expression does not increase(may even decrease). That’s how adjusted R² is somewhat resistant to the problem that we were facing with the ordinary R².

How to find R² (using StatsModels)?

Statsmodels library offers a simple way to perform many statistical tasks. Firstly, I’ve created a fake dataset and build a linear regression model. After that, I’ve printed the R² and adjusted R² values by calling the summary() function. Here’s the code…

from pandas import DataFrame
import statsmodels.api as sm# making the fake dataset
data = { 
         'month': [12,11,10,9,8,7,6,5,4],
         'free_time': [120,110,100,90,80,85,60,50,40],
         'num_articles': [8,8,7,6,6,7,6,4,5]
       }df = DataFrame(data, columns=['month', 'free_time', 'num_articles'])# features
X = df[['free_time']] 
# target variable
Y = df['num_articles']
# adding a constant
X = sm.add_constant(X)# applying method of least squares
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)print_model = model.summary()
print(print_model)

Now focus on the selected portion of the output

I’ve been in multiple data jobs for the past five years and I felt it’s a good idea to share some of my main learnings, as I have seen other writers do as well. I’ve worked mainly in companies where “data” is a support function to the business, so my experience may well be different from 100% data-driven companies. Also, this is only a personal experience: feel free to disagree!

Lesson 1 — The Data vs The Data Scientist

The first lesson I learned was one that people had more or less prepared me for. Data is more important than everything: make sure you understand the data before starting. However, when hearing this, I would have never expected the degree of incorrectness that many real-world data has.

Real world data is often completely wrong

And it is not only a data problem, but it is also a company problem. Sometimes people seem to pay money to collect data that is just non-sensical. For example, I’ve worked on projects where the goal is the prediction of a target variable which was a percentage of defects on a production line.

Four months into the project, I found out that the trackers that measure our target value were completely wrong: the operators of the production line used the detectors differently then they should have, resulting in a huge bias in the data. Result: the project got canceled and I could throw away four months of work.

To know how the data is really generated, you need to ask the right people: your users or your data generating field workers

Conclusion: data is often wrong. There is huge value in looking at individual observations and deep-diving how the data collection process works. Not in theory, but in practice! Go and visit the data generation if possible. Ask the people who work with those every day to give you an estimate of their usability before starting. You really want to find this out at the beginning rather than at the end.

Lesson 2 — Machine Learning KPIs vs Business Logic

My second lesson is about the statistical and machine learning models that we data scientists have in our toolkit.

Understanding your model is much more important than you think.

I’ve learned that understanding your model is much more important than you think. In Data Science, we are always talking about accuracy scores. They are very important. Also, improving the accuracy of models is what I get paid for, so we can’t ignore it.

The famous Black Swan effect is very well known but ignored anyway. Photo by Ahsan S. on Unsplash

Some events are super rare and maybe not present in your training data: do you know what your model will do?

But we all know when we built a model that the accuracy is like the lotto. Sometimes it’s good, sometimes it’s bad. The worst-case: some events are super rare and maybe not present in your training data. This is also known as the Black Swan effect.

My lesson: the day your production model goes bad and you find out that there is a rare value that occurs sometimes and your model starts making very bad decisions, you will wish you had inspected the actual fit of your model much, much better.

Lesson 3 — The Data Scientist vs His Client

The worst lesson I’ve had is this: many clients, stakeholders, managers are not at all interested in lessons 1 and 2. Sometimes the hardest part of Data Science is justifying what can or cannot be done and how many times certain projects can take.

Sometimes the hardest part of Data Science is justifying what you can or cannot do.

A famous internet meme.

I’ve heard of cases when a model that outperformed on selected KPIs would be sent to prod, whether the data scientist would agree or not. You could say okay the KPI is wrong, but sometimes a client will be very focused on the KPIs rather than on having a reliable solution.

Work with your clients, not against them. Photo by Sebastian Herrmann on Unsplash

In Data Science, it is often difficult to know in advance whether a project can work. Justification, communication, and feedback loops are very important in this domain and it is primordial to adapt to different types of clients.

An incremental way of working improves collaboration with stakeholders and can make the final result much better.

I’ve learned that an incremental way of working improves collaboration with stakeholders: start with building a minimum solutions — whether a data analysis or a working model — and ask feedback from your stakeholders. Then if they want to continue in your proposed direction, add an increment to your product, until your stakeholders find it good enough.

Going step-by-step from nothing to a great product using incremental product delivery. Photo by Isaac Smith on Unsplash

This increment-and-feedback-cycle practice is based on Agile methodologies and avoids a part of the risk of Data Science projects that is to not have a good solution at the end. You can even base your contracts on this approach by stating a price per product increment or per time spent without fixing the total amount of money beforehand. This can seem risky at the beginning, but it can make your final result much better.

Mastering String Methods in Pandas What you Need to Know to Get Started

Pandas is a popular python library that enables easy to use data structures and data analysis tools. Pandas can be used for reading in data, generating statistics, aggregating, feature engineering for machine learning and much more. The Pandas library also provides a suite of tools for string/text manipulation.

In this post, we will walk through some of the most important string manipulation methods provided by pandas.

Let’s get started!

First, let’s import the Pandas library

Now, let’s define an example pandas series containing strings:

Let’s print this series:

We notice that the series has ‘dtype: object’, which is the default type automatically inferred. In general, it is better to have a dedicated type. Since the release of Pandas 1.0, we are now able to specify dedicated types. Make sure Pandas is updated by executing the following command in a terminal:

We can specify ‘dtype: string’ as follows:

Let’s print the series:

We can see that the series type is specified. It is best to specify the type, and not use the default ‘dtype: object’ because it allows accidental mixtures of types which is not advisable. For example, with ‘dtype: object’ you can have a series with integers, strings, and floats. For this reason, the contents of a ‘dtype: object’ can be vague.

Next, let’s look at some specific string methods. Let’s consider the ‘count()’ method. Let’s modify our series a bit for this example:

Let’s print the new series:

Let’s count the number of times the word ‘python’ appears in each strings:

We see this returns a series of ‘dtype: int64’.

Another method we can look at is the ‘isdigit()’ method which returns a boolean series based on whether or not a string is a digit. Let’s define a new series to demonstrate the use of this method. Let’s say we have a series defined by a list of string digits, where missing string digits have the value ‘unknown’:

If we use the ‘isdigit()’ method, we get:

We can also use the ‘match()’ method to check for the presence of specific strings. Let’s check for the presence of the string ‘100’:

We can even check for the presence of ‘un’:

All of which is in concert with what we’d expect. We can also use methods to change the casing of the string text in our series. Let’s go back to our series containing opinions about different programming languages, ‘s1′:

We can use the ‘upper()’ method to capitalize the text in the strings in our series:

We also use the ‘lower()’ method:

We can also get the length of each string using ‘len()’:

Let’s consider a few more interesting methods. We can use the ‘strip()’ method to remove whitespace. For this, let’s define and print a new example series containing strings with unwanted whitespace:

As you can see, there is whitespace to the left of ‘python’ and to the right of ‘ruby’ and ‘fortran’. We can remove this with the ‘strip()’ method:

We can also remove whitespace on the left with ‘lstrip’:

and on the right with ‘rstrip’:

In the previous two examples I was working with ‘dtype=object’ but, again, try your best to remember to specify ‘dtype=strings’ if you are working with strings.

You can also use the strip methods to remove unwanted characters in your text. Often times, in real text data you have the presence of ‘\n’ which indicates a new line. Let’s modify our series and demonstrate the use of strip in this case:

An we can remove the ‘\n’ character with ‘strip()’:

In this specific example, I’d like to point out a difference in behavior between ‘dtype=object’ and ‘dtype= strings’. If we specify ‘dtype= strings’ and print the series:

We see that ‘\n’ has been interpreted. Nonetheless using ‘strip()’ on the newly specified series still works:

The last method we will look at is the ‘replace()’ method. Suppose we have a new series with poorly formatted dollar amounts:

We can use the ‘replace()’ method to get rid of the unwanted ‘#’ in the first element:

We can also replace the text ‘dollar’ with an actual ‘$’ sign:

Finally, we can remove the ‘,’ from the 2nd element:

I will stop here but feel free to play around with the methods a bit more. You can try applying some of the Pandas methods to freely available data sets like Yelp or Amazon reviews which can be found on Kaggle or to your own work if it involves processing text data.

To summarize, we discussed some basic Pandas methods for string manipulation. We went over generating boolean series based on the presence of specific strings, checking for the presence of digits in strings, removing unwanted whitespace or characters, and replacing unwanted characters with a character of choice.

There are many more Pandas string methods I did not go over in this post. These include methods for concatenation, indexing, extracting substrings, pattern matching and much more. I will save these methods for a future article. I hope you found this post interesting and/or useful. The code in this post is available on GitHub. Thank you for reading!

Netflix is a streaming media company headquartered in Los Gatos, California that has permeated culture as the largest content media company disrupted by tech. Founded in 1997, Netflix started out as a DVD rental service, and then expanded to the streaming business. Now Netflix had over 150 million paid subscriptions worldwide, including its 60 million US users. With streaming supported on over a thousand devices and around 3 billion hours watched every month, data is collected on over 100 billion events per day.

Data science is in the DNA of Netflix and Netflix leverages data science in improving every aspect of the user experience. Netflix has over the years been leveraging data science for its content recommendation engine, to decide which movies and tv shows to produce and to improve user experience.

The Data Science Role at Netflix

The role of a data scientist at Netflix is heavily determined by the team. However, general data scientist roles at Netflix cuts across business analytics, statistical modeling, machine learning, and deep learning implementation. Netflix is a large company that has data scientists working in over 30 different teams including personalization and algorithms, marketing analytics team, and the product research and tooling team, with skillsets ranging from basic analytics to heavy machine learning algorithms.

Netflix hires only qualified data scientist with at least five years of relevant experience. Their requirements are very specific and are recruiters are keen to hire specifically for each job role. It helps to have specific industry experience specific to the role on the team.

Other relevant qualifications include:

  • Advanced degree (MS or PhD) in Statistics, Econometrics, Computer Science, Physics, or a related quantitative field.
  • 5+ years of relevant experience with a proven track record of leveraging massive amounts of data to drive product innovation.
  • Experience with distributed analytic processing technologies (Spark, SQL, Pig, Presto, or Hive) and strong programming skills in Python, R, Java, or Scala.
  • Experience in building real-world machine learning models with demonstrated impact.
  • Deep statistical skills utilized in A/B testing, analyzing observational data, and modeling.
  • Experience in creating data products and dashboards in Tableau, R Shiny, or D3.

The term data science at Netflix encompasses a wide scope of fields and titles related to data science. The title data scientist comprises of roles and functions that span from product analytics-focused data scientists to data engineering and machine learning functions.

  • Personalization Algorithms: Collaborate with product and engineering teams to evaluate the performance and optimize personalization algorithms used to suggest movies, TV shows, artwork, and trailers to Netflix members.
  • Member UI Data Science and Engineering: Leveraging custom machine learning models to optimize the user experience of the product for all subscribers.
  • Product Research and Tooling: Developing and implementing methods to advance experimentation at Netflix at scale. This involves developing data visualization frameworks, tools, and analytics applications that provide other teams with insights into member behavior and product performance.
  • Growth Data Science and Engineering: Focus on growing the subscriber base by building and designing highly scalable data pipelines and clean datasets around key business metrics.
  • Marketing Data Science Engineering: Creating reliable, distributed data pipelines and building intuitive data products that provide stakeholders with means of leveraging data across domains in a self-service manner for all non-technical teams.

The Interview Process

Figure from Netflix Data Science Blog

The data science interview process at Netflix is similar to other big tech companies. The interview process starts with an initial phone screen with a recruiter and then a short hiring manager screen before proceeding to a technical interview. After passing the technical screen, an onsite interview will be scheduled. This interview comprises of two parts with 6 or 7 people.

The initial screen at Netflix is a 30 minute phone call with a recruiter. The recruiters at Netflix are highly specialized and very technical. Their job is to understand your resume see if your past experience, projects, and skillset matches up to the role. The second point of this part of the interview is to test your general communication skills and explain the role and its background to you.

Next is the hiring manager interview. This one will focus more on past experience and dive into more of the technical portion of what you’ve done within data science and machine learning. While the recruiter gets a sense of your projects at a high level to fit with the team, the hiring manager will ask you more in-depth questions like why you used certain algorithms for a project or how you built different machine learning or analytics systems.

The hiring manager will also get to tell you more about the roles and responsibilities of the team. Note that Netflix is big on the culture and values, and you may be asked to pick a value and explain how best it suits you.

After passing the initial screening, the technical screen is the next step in the interview. This interview is usually 45 minutes long, and it involves technical questions that span across SQL, experimentation and ab testing, and machine learning technical questions.

Example Questions:

  • What do you know about A/B testing in the context of streaming?
  • What are the differences between L1 and L2 regularization, why don’t people use L0.5 regularization for instance?
  • What is the difference between online and batch gradient descent?
  • What is the best way to communicate ML results to stakeholders?

If you’re interested in more interview questions from Netflix, check out the Interview Query data science interview prep newsletter and course.

Onsite Interview

The onsite interview is the last stage in the interview process, and it comprises of two-part interviews with a lunch break in-between. If you’re from out of state, Netflix will fly you out to Los Altos or Los Angeles for the on-site and you’ll first meet with the recruiter to go over the interview.

It involves one-on-one interviews with 6 or 7 people including data scientist team members, team managers, and a product manager. The Netflix onsite interview is a combination of product, machine learning, and various analytical concepts. This interview will comprise of questions around product sense, statistics including A/B testing (hypothesis testing), SQL and Python coding, experimental and metric design, and culture fit. If the role is more focused on engineering, expect more machine learning and possibly deep learning interview questions.

  • Remember, the goal of the interview is to assess how you can apply analytical concepts and machine learning algorithms and models to predict value in users and content. Brush up on knowledge of statistics and probability, A/B testing and experimental design, and regression and classification modeling concepts.
  • Please, please, please remember to read the Netflix culture deck. Culture is everything at Netflix and they have created a unique and famous work culture that they have transcribed into a 100+ page slide deck online.
  • At its core, Netflix’s culture is about building a team of high performers and setting them up in an environment that enables them to excel. This is represented by a healthy amount of freedom & responsibility, strong context provided by managers with limited top-down control, and a compensation and promotion system that rewards A-players.
  • In offer negotiation, note that the compensation packages at Netflix are extremely high. Their average salaries for technical hires exceed $300,000 and many times is almost always in cash with an option to convert some into RSUs. This is why their interviews are difficult, and baseline to hire is super high.

Netflix Data Science Interview Questions

  • Write the equation for building a classifier using Logistic Regression.
  • Given a month’s worth of login data from Netflix such as account_id, device_id, and metadata concerning payments, how would you detect payment fraud?
  • How would you design an experiment for a new content recommendation model we’re thinking of rolling out? What metrics would matter?
  • Write SQL queries to find a time difference between two events.
  • How would you build and test a metric to compare two users’s ranked lists of movie/tv show preferences?
  • How would you select a representative sample of search queries from five million?
  • Why is Rectified Linear Unit a good activation function?
  • If Netflix is looking to expand its presence in Asia, what are some factors that you can use to evaluate the size of the Asia market, and what can Netflix do to capture this market?
  • How would we approach to attribution modeling to measure marketing effectiveness?
  • How would you determine if the price of a Netflix subscription is truly the deciding factor for a consumer?

Stop making data scientists manage Kubernetes clusters Building models is hard enough

Disclaimer: The following is based on my observations of machine learning teams—not an academic survey of the industry. For context, I’m a contributor to Cortex, an open source platform for deploying models in production.

Production machine learning has an organizational problem, one that is a byproduct of its relative youth. While more mature fields—web development, for example—have developed best practices over decades, production machine learning hasn’t yet.

To illustrate, imagine you were tasked with growing a product engineering org for your startup, which develops a web app. Even if you had no experience building a team, you could find thousands of articles and books on how your engineering org should be structured and grown.

Now imagine you are at a startup that has dabbled with machine learning. You’ve hired a data scientist to lead the initial efforts, and the results have been good. As machine learning becomes more deeply embedded into your product, it becomes obvious that the machine learning team needs to grow, as the responsibilities of the data scientist have ballooned.

In this situation, there are not thousands of articles and books on how a production machine learning team should be structured.

This is not an uncommon scenario, and what frequently happens is that the new responsibilities of the machine learning org—infrastructure, in particular—get passed onto the data scientist(s).

This is a mistake.

The difference between machine learning and machine learning infrastructure

The difference between a platform and product engineer is pretty well understood at this point. Similarly, data analysts and data engineers are clearly differentiated roles.

Machine learning, at many companies, is still missing that specialization.

To see why the delineation between machine learning and machine learning infrastructure is important, it’s helpful to look at the work and tooling required for each.

To design and train new models, a data scientist is going to:

In other words, their responsibilities, skills, and tools are going to revolve around manipulating data to develop models, and their ultimate output will be models that deliver the most accurate predictions possible.

The infrastructure side is fundamentally different.

A common way to put a model into production is to deploy it to the cloud as a microservice. To deploy a model as a production API, an engineer is going to:

An easy way to visualize the difference in working on machine learning versus machine learning infrastructure is like this:

Machine learning vs. Machine learning infrastructure

Intuitively, it makes sense that a data scientist should handle the circle on the left, but not so much the circle on the right.

What’s wrong with having non-specialists manage infrastructure?

Let’s run this as a hypothetical. Say you had to assign someone to manage your machine learning infrastructure, but you didn’t want to dedicate someone full-time to it. Your only two options would be:

Both of these options have issues.

First, data scientists should spend as much time as possible doing what they’re best at—data science. While learning infrastructure certainly isn’t beyond them, both infrastructure and data science are full-time jobs, and splitting a data scientist’s time between them will reduce the quality of output in both roles.

Second, your organization needs someone dedicated specifically to machine learning infrastructure. Serving models in production is different than hosting a web app. You need someone specialized for the role, who can advocate for machine learning infrastructure within your org.

This advocacy piece turns out to be crucial. I get to see inside a lot of machine learning orgs, and you’d be surprised how often their bottlenecks stem not from technical challenges, but from organizational ones.

For instance, I’ve seen machine learning teams who need GPUs for inferencing—big models like GPT-2 basically require them for reasonable latency—but who can’t get them because their infrastructure is managed by the broader devops team, who don’t want to put the cost on their account.

Having someone dedicated to your machine learning infrastructure means you not only have a team member who is constantly improving your infrastructure, it means you have an advocate who can get your team what it needs.

Who should manage the infrastructure then?

Machine learning infrastructure engineers.

Now, before you disagree about the official title, let’s just acknowledge that it’s still early days for production machine learning and that it’s the wild west when it comes to titles. Different companies might call it:

We can already see mature machine learning organizations hiring for this role, including Spotify:

Source: Spotify

As well as Netflix:

Source: Netflix

As ML-powered features like Gmail’s Smart Compose, Uber’s ETA prediction, and Netflix’s content recommendation become ubiquitous in software, machine learning infrastructure is becoming more and more important.

If we want a future in which ML-powered software is truly commonplace, removing the infrastructure bottleneck is essential—and to do that, we need to treat it as a real specialization, and let data scientists focus on data science

to Prepare for a Data Science Related Interview

In 2012, Harvard Business Review announced that Data Science will be the sexiest job of the 21st Century. Since then, the hype around data science has only grown. Recent reports have shown that demand for data scientists far exceeds the supply.

However, Entry level data science can be really competitive because of the supply/demand dynamics. Data Scientists can come from all kinds of fields, ranging from Physics, Maths, Statistical background to Computer Science. Some may see this as an opportunity to rebrand themselves to influence freshmen looking to land their first role.

This list is created based on the difficulty level based on how time-consuming the task is ( Easier to tick done is listed as high priority and tasks which may take some time like learning some methodology in python is listed as a low priority)

Photo by João Silas on Unsplash

I would highly recommend looking over the description of the position and trying your best to find out what you would be doing. The type of position will heavily influence what kind of questions you would be getting in your interview.

Will you be…

  • Designing and interpreting experiments to test variants of the product? Expect some questions regarding A/B testing, questions regarding which metrics would be best to optimize, and questions about how to best evaluate your experimental results.
  • Doing deep dives to understand more about how users use your product? Expect questions that test your ability to carry a data project from end-to-end, and to effectively and faithfully communicate your findings. Expect to discuss projects from previous experiences or your education and communicate what you were able to find and what you did.
  • Doing applied research on inference, prediction, or optimization problems? These positions are a lot more custom and may require a PhD. I recommend reading through the job description to see what they might be looking for, and studying up on academic techniques to solve some problems that the team you’re interviewing for may be facing.
  • Developing algorithms for a data product? For example Uber’s Surge Pricing feature or LinkedIn’s People You May Know feature. Depending on your specific role, you may be getting a traditional software engineering interview with a focus on processing large amounts of data, or be asked about your previous experience with solving large-scale, difficult, and custom data problems.

Of course, there are many more roles of a data scientist — so do your research on both the product and the role before you set foot in the interview room.

Photo by Quino Al on Unsplash

Ultimately the key question you should be asking yourself is — within my role at the company, what is the best way to best understand and improve the product and the business using data?

Tinker around with the product

If you can use the product, use it as much as possible before the interview. One type of data scientist is heavily involved in the process of making decisions to help improve the product and the features — and to understand the product quantitatively as much as possible.

Let your curiosity run free, and answer questions like —

how can this product be improved?

What kind of metrics would you define to measure its success?

How could this product monetize?

How could this product make more money?

How would you define engagement on this product?

What could be some friction points?

What are the key funnels or actions that you want your users to go through?

Show that you understand the system that as a data scientist you’re going to be working to improve.

Photo by Esteban Lopez on Unsplash

Familiarize yourself with the product, as it’s very easy to reveal your lack of preparation if you don’t have basic knowledge of the product you may be working on. Also, interviewers will likely ask you data-related questions about the projects they are working on.


After playing around with the product as much as you can — ask yourself the following questions:

  1. What are the aspects of the product that you really enjoy? What are your favorite features? Why do you think those features exist?
  2. What are aspects of the product that you don’t enjoy? Why don’t you enjoy them? Why might the product even have such a feature if there are people that don’t enjoy it?
  3. If you could suggest some new features for the product, what would you recommend? Is this something that is aimed at increasing growth, engagement, revenue, or brand value? Do you think that your recommended features would be high-ROI?
  4. What are the ways in which the company could use data to help improve the product, that it doesn’t seem to be doing already?

These questions will get you in the shoes of thinking about the product and various tradeoffs that are done in making product decisions. This gets you in the right mindset to answer some of the questions that you might get about the product and what the data scientists are working on in helping make it the best it can be.


Photo by Meghan Holmes on Unsplash

After playing around with the product, think about this: what are some of the key metrics that the product might want to optimize? Part of a data scientist’s role in certain companies involves working closely with the product teams to help define, measure, and report on these metrics. This is an exercise you can go through by yourself at home, and can really help during your interview process.

There’s a useful intro to many commonly-used metrics at 16 Ways to Measure Network Effects


If you’re interviewing with a consumer internet company, chances are that they do some sort of A/B testing to decide on feature launches. This is usually one thing that many candidates are unprepared for when they start looking for data science positions, mostly because many universities don’t offer too many statistics classes. Understanding experimental design, what A/B testing is, and how to interpret results statistically are extremely important if you’re interviewing with a company that does A/B testing.

Ronny Kohavi (head of experimentation at Microsoft) has useful answers on applied A/B testing.


Data science positions that include some basic software engineering in the role will feature a scaled-down version of a typical software engineering interview. Normal prep for software engineering interviews will help here, as often you’ll be expected to implement code that accomplishes a certain task on the whiteboard.

Data science positions that feature a heavy “analytics” component in the role may evaluate you on SQL. I think SQL is one of the most straightforward topics to prepare for, given its more limited scope and availability of preparation resources. I list some of my favorite resources at William Chen’s answer to What is the best way to learn SQL for data science?.

Docker Best Practices for Data Scientists

As a data scientist, I grapple with Docker on a daily basis. Creating images, spinning up contains have become as common as writing Python scripts for me. And this journey has its achievements as well as moments, “I wish I knew that before”.

This article discusses some of the best practices while using Docker for your data science projects. By no means this is an exhaustive checklist. But this covers most things I’ve come across as a data scientist.

This article assumes basic-to-moderate knowledge of Docker. For example, you should know what Docker is used for and should be able to comfortably write a Dockerfile and understand Docker commands like RUN , CMD, etc. If not, have a read-through this article from official Docker site. You can also explore through the collection of articles found there.

Why Docker?

Since Docker has been released it has taken the world by a storm. Before the era of Docker, virtual machines used to fill that void. But Docker offers so much than virtual machines.

Advantages of docker

  • Isolation — isolated environment regardless of the changes in the underlying OS/infrastructure, installed software, updates

Primer on Docker

Docker has three important concepts.

Images — This is a set of runnable libraries and binaries that represents a development/production/testing environment. You can download/create an image in the following ways.

  • Pulling from an image registry: e.g. docker pull alpine . What happens here is that Docker will look locally in your computer for an image named alpine , if it’s not found, it looks in Dockerhub

Containers– This is a running instance of an image. You can stand up a container using the syntax `docker container run <arguments> <image> <command> , for example to create a container from the alpine image use, docker container run -it alpine /bin/bash command.

Volumes — Volumes are used to permanently/temporarily store data (e.g. logs, downloaded data) for containers to use. Additionally, volumes can be shared among multiple containers. You can use volumes in couple of ways.

  • Creating a volume: You can create a volume using docker volume create <volume_name> command. Note that, information/changes stored here will be lost if that volume is deleted.

1. Creating images

1. Keep the image small, avoid caching

Two common things you’d have to do when building images is,

  • Install Linux packages

When installing these packages and libraries the package mangers will cached data so local data will be used if you want to install them again. But this increases the image size unnecessarily. And docker images are supposed to be light-weight as possible.

When installing Linux packages remember to remove any cached data by adding the last line to your apt-get install command.

RUN apt-get update && apt-get install tini && \
 rm -rf /var/lib/apt/lists/*

When installing Python packages, to avoid caching, do the following.

RUN pip3 install <library-1> <library-2> --no-cache-dir`

2. Separate out Python libraries to a requirements.txt

The last command you saw brings us to the next point. It is better to separate Python libraries to a requirements.txt file and install libraries using that file using the following syntax.

RUN pip3 install -r requirements.txt --no-cache-dir

This gives a nice separation of Dockerfile doing “Docker stuff” and not (explicitly) worrying about “Python stuff”. Additionally, if you have multiple Dockerfiles (e.g. for production / development / testing) and they all want the same libraries installed, you can reuse this command easily. The requirements.txt file is just a bunch of library names.

numpy==1.18.0
scikit-learn==0.20.2
pandas==0.25.0

3. Fixing library versions

Note how in the requirements.txt I am freezing the version I want to install. This is very important. Because otherwise, every time you build your Docker image, you might be installing different versions of different things. “Dependency Hell” is real.

Running containers

1. Embrace the non-root user

When you run the containers, if you don’t specify an user to run as, it is going to assume root user. I’m not going to lie. my naive self used to love having the ability to use sudo or being root to get things my way (especially to get around permission). But if I’ve learnt one thing, it’s that having unnecessary privileges than needed is an exacerbation catalyst, leading to even more problems.

To run a container as a non-root user, simply do

  • docker run -it -u <user-id>:<group-id> <image-name> <command>

Or, if you want to jump into an existing container do,

  • docker exec -it -u <user-id>:<group-id> <container-id> <command>

For example, you can match the user id and group id of the host by assigning <user-id> as $(id -u) and <group-id> as $(id -g) .

Beware of how different operating systems assign user IDs and group IDs. For example your user ID/group ID on a MacOS might be a pre-assigned/reserved user ID / group ID inside an Ubuntu container.

2. Creating a non-priviledged user

It is great that we can log in as a non-root user to our host-away from host. But if you login like this, you’re a user without a username. Because, obviously the container has no-clue where that user id came from. And you need to remember and type these user id and group id everytime you want to spin-up a container or exec into one. So for that, you can include this user/group creation as a part of the Dockerfile .

ARG UID=1000
ARG GID=1000
  • First add ARG UID=1000 and ARG GID=1000 to the Dockerfile . UID and GID are environment variables in the container to which you’ll pass the value at docker buildstage (defaults to 1000).

Then during image build, you can pass values for these arguments like,

  • docker build <build_dir> -t <image>:<image_tag> --build-arg UID=<uid-value> --build-arg GID=<gid-value>

For example,

  • docker build . -t docker-tut:latest --build-arg UID=$(id -u) --build-arg GID=$(id -g)

Having a non-privileged user helps you to run processes that should not have root permission. For example, why run your Python script as root when all it does is reading from a dir (e.g. data) and writing to one (e.g. model). And as an added benefit, if you match the user ID and group ID of the host, within the container, all the files you created will have your host user’s ownership. So if you bind-mount these files (or create new files) they will still look like you created them on the host.

Creating volumes

1. Separate artifacts using volumes

As a data scientist, obviously you’ll be working with various artifacts (e.g. data, models and code). You can have the code in one volume (e.g. /app ) and data in another (e.g. /data ). This will provide a nice structure for your Docker image as well as get rid of any host-level artifact dependencies.

What did I mean by artifact dependencies? Say you have the code at /home/<user>/code/src and the data at /home/<user>/code/data . If you copy/mount /home/<user>/code/src to the volume/app and /home/<user>/code/data to the volume /data . It doesn’t matter if the location of the code and data changes on the host. They will always be available at the same location inside the Docker container as long as you mount those artifacts. So you could fix those paths nicely in your Python script as follows.

data_dir = "/data"
model_dir = "/models"
src_dir = "/app"

You can COPY the necessary code and data into the image using

COPY test-data /data
COPY test-code /app

Note that test-data and test-code are directories on the host.

2. Bind-mount directories during development

Great thing about bind-mounting is that, whatever you do in the container is reflected on the host itself. This is great when you’re doing developments and you want to debug your project. Let’s see this through an example.

Say you created your docker image by running:

docker build <build-dir> <image-name>:<image-version>

Now you can stand up a container from this image using:

docker run -it <image-name>:<image-version> -v /home/<user>/my_code:/code

Now you can run the code within the container and debug at the same time and the changes to the code will be reflected on the host. And this loops back to the benefit of using the same host user ID and group ID in your container. All changes you do, looks like came from the user on the host.

3. NEVER bind-mount critical directories of the host

Funny story! I once mounted the home directory of my machine to a Docker container and managed to change the permission of the home directory. No need to say that I was unable to log into the system afterwards and spent a good couple of hours fixing this. Therefore, mount only what is needed.

For example, say you have three directories that you want to mount during developments:

  • /home/<user>/my_data

You might be very tempted to mount /home/<user> with a single line of code. But it is definitely worth writing three lines to mount these individual sub directories separately, as it will save you several painstaking hours (if not days) of your life.

Additional tips

1. Know the difference between ADD and COPY

You probably know that there are two Docker commands called ADD and COPY . What’s the difference?

  • ADD can be used to download files from URLs when used like, ADD <url>

2. Difference between ENTRYPOINT and CMD

A great analogy that comes to my mind is, think of ENTRYPOINT as a vehicle and CMD as the controls in that vehicle (e.g. accelerator, brakes, steering wheel). ENTRYPOINT it self does nothing, it’s just a vessel for what you want to do within that container. It just stays stand-by for any incoming commands you push to the container.

A command CMD is what actually gets executed within a container. For example bash would create a shell in your container so you could work withing the container like you work on a normal terminal on Ubuntu.

3. Copying files to existing containers

Not again! I’ve created this container and forgot to add this file to the image. It takes so long to build the image. Is there any way I could cheat and add this to the existing container?

Yes there is, you could use docker cp command for this. Simply do,

docker cp <src> <container>:<dest>

Next time you jump into the container you will see the copied file at <dest> . But remember to actually change the Dockerfile to copy the necessary files at build time.

3. Conclusion

Great! That’s all folks. We discussed,

  • What Docker images / containers / volumes are?

Now you should have grown your confidence to look at Docker in the eyes and say “You can’t scare me”. Jokes aside, it always pays off to know what you’re doing with Docker. Because if you’re not careful you can bring down a whole server and disrupt the work of everyone else whose working on that same machine

Inthe past, I’ve gone beyond praise for the amazing multi-paradigm statistical language, Julia. This is for good reason because Julia is an incredibly high-level language that stretches the limits of what we thought could come from such high-level syntax. Additionally, in terms of being ideal for machine-learning, from how much I’ve used the big four, Julia is perfectly able to out-perform Python, R, and even Scala. I even briefly joked about getting a Julia tattoo on my head, even going as far as to create a semi-realistic rendering of that exact thing:

But with all of this flaunting of one of my absolute favorite programming languages of all time, I’ve much neglected one of the features that makes Julia even better:

Pkg is Julia’s package manager, and it is not a typical package manager like the ones you would find in Python or R. In fact, for the most part, I firmly believe that Julia’s package manager outmatches every Linux package manager I’ve ever used, including Dnf, Apt, Man, and likely the most robust: Pacman. There are a few key benefits that make managing packages in Julia a breeze with the Pkg package manager, and it’s likely that you’ll say


REPL and Package

A massive benefit to Pkg is that it has its own read evaluate print loop, or REPL. This is beneficial for a few reasons, the first of which I would like to mention is that it can add packages quicker than the library equivalent. This is because it makes it possible to skip recompiling a stale cache file entirely simply by pressing a key,

]

Additionally, this allows for non-syntactual package adding, making it super quick and easy to add a package with simply a command and a space:

add "Lathe"

In addition to having a REPL, however, Julia’s Pkg also comes in the form of a Julia package itself. This is useful because it means that you can not only add packages from within the REPL, but also the code itself. This was particularly useful whenever I created TopLoader, which is a Python package that allows you to create a virtual Pkg environment and then use the packages added to that environment.

Github Integration

(src = http://github.com/)

Julia’s package manager is entirely based off of Github, where packages are pushed into the Registrator.jl repository, which you can take a gander at here.

(src = https://github.com/JuliaRegistries/Registrator.jl)

Not only does this make it incredibly easy to push your Julia packages through Github, but also allows you to add unregistered packages with Git URLs. This makes unpublished software, which is usually stored in a Git-compatible repository always usable. Julia’s packages use Github as a foundation for development, and that is certainly not a bad idea, as even non-developers are well aware that Open-Source development essentially revolves around Github.

Virtual Environments

There are a lot of questions in life that have yet to be answered, but the age-old question of

has certainly already been answered. Compared to Pip, Package get, Quick Lisp, and the thousands of other package managers that are also called “ Pkg,” Julia’s Pkg once again outshines the competition. From the get-go, out of the box you can activate a virtual environment with two simple commands from the REPL, or a method from the package.

using Pkg
Pkg.activate("venv")julia> ]
pkg> activate "venv"

This makes managing and utilizing these virtual environments shockingly easy. It’s hard to justify this methodology of activating virtual environments without comparing it to a similar option of one of Julia’s competitors, Python. The package manager for Python is Pip, and though certainly not quite as robust as Pkg, I think Pip certainly stands the test of time and gets the job done in Python. But to activate a virtual environment for Pip, you need to install an entirely new package, that being venv, or something similar (there are a few different options.)

sudo pip3 install virtualenv
python3 
python3 -m venv venvname

This is certainly not terrible, but on the other hand: certainly not nearly as robust and simple as the Julia equivalent. Reiterating on my Python package for loading Julia modules in a virtual environment, development of said package was incredibly easy because I was able to use the activate() method.

Conclusion

I love Julia’s Pkg, and I tip my hat to the developers that put in the hard work to develop it, only to make the Julia programming experience that much better. I think many other package managers have a lot to learn from Pkg, because to me personally, it has always felt like the perfect package manager. I encourage you to try it out, as it feels like a breath of fresh air in a world filled with CLI package managers. Overall, I think that Pkg.jl contributes a lot to Julia, and makes the language even more enjoyable than it already is (hard to do) to use

NLP learning journey

As you read the title, I will share with you my amazing learning journey that started one year ago when I did my graduation project in the NLP field. Before starting this project, my knowledge about the field of natural language processing (NLP) wasn’t quite good enough.

“Never regret your past.
Rather, embrace it as the teacher that it is” — Robin Sharma.


Introduction

https://gist.github.com/AmalM7/5ed980c05e237e4d37ee30ac69205789#file-regular-expressions-py

Often when performing analysis, many data is numerical, such as sales numbers, physical measurements, quantifiable categories. Computers are very good at handling direct numerical information. However, what do we do about text data?

As humans, we can tell there is a plethora of information inside of text documents. But a computer needs specialized processing techniques to understand raw text data. As we know, text data is highly unstructured and can be in multiple languages!

That’s why NLP attempts to use a variety of techniques to create structure out of text data.

In this article, I want to give an overview of some of the topics I have to know while learning NLP techniques. I notice that many other posts cover the same thing, but writing about my learning journey has helped me in structuring the things that I know.

Table of contents

  1. NLP in industry
  2. Text Preprocessing
  3. Computational Linguistics and Word Embedding
  4. Fundamentals of Deep Learning
  5. Deep Learning for NLP

NLP in industry

We can notice that text data is everywhere. That’s why there is a wide array of applications natural language processing is responsible for.
In this section, I list some of these applications:

  • NLP enables some useful functions like auto-correct, grammar, and spell check, as well as auto-complete.
  • Extract and summarize information: NLP can extract and synthesize information from a variety of text sources.
  • Sentiment analysis (movie, book & product reviews).
  • Chatbots: are algorithms that use natural language processing to be able to understand your query and respond to your questions adequately, automatically, and in real-time.
  • Automated translation is a huge application for NLP that allows us to overcome barriers to communicating with individuals from around the world as well as understand tech manuals written in a foreign language.

You can check my previous articles here where I have done projects related to NLP.

Text Preprocessing

This section is about getting familiar and comfortable with the basic text preprocessing techniques.

Load Text data from multiple sources:

In this part, I will show you how to open text files from different sources, like CSV files, PDF files, etc.

Regular Expressions:

Regular Expressions (sometimes called regex) allow a user to search for strings using almost any sort of rule. For example, finding all capital letters in a string, or finding a phone number in a document. Regular expressions have to be able to filter out any string pattern you can imagine. They are handled using Python’s built-in re library. See the docs for more information.

Most of the industry is still using regular expressions to solve problems for that reason we can’t ignore its importance. They are normally the default way of data cleaning. Knowing their applicability, it makes sense to know them and use them appropriately.

Let’s see some code examples working with re

Spacy vs NLTK:

I will introduce a little bit nltk and spacy, both state-of-the-art libraries in NLP and the difference between them.

Spacy: is an open-source Python library that parses and “understands” large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.). Designed to handle NLP tasks with the most efficient implementation of common algorithms.

NLTKNatural Language Toolkit is a very popular open-source. Initially released in 2001, it is much older than Spacy (released 2015). It also provides many functionalities but includes less efficient implementations.

→ for many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose algorithmic implementations. However, Spacy does not include pre-created models for some applications, such as sentiment analysis, which typically easier to perform with NLTK.

Spacy installation and Setup: installation is a two-step process. First, install SpaCy using either conda orpip. Next, download the specific model you want, based on language. For more detail, visit this link.

From the command line or terminal:

conda install -c conda-forge spacy
or
pip install -U spacy

Next, python -m spacy download en_core_web_sm

Tokenization:

The first step in processing text is to split up all the parts (words & punctuation) into “tokens”. These tokens are annotated inside the Doc object to contain descriptive information. As with sentence segmentation, the punctuation marks can be challenging. As an example, the U.K. should be considered one token, while “we’re” should be split into two tokens: “we” and “ ‘re ’”. Luckily for us, SpaCy will isolate punctuation that does not form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned to their token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token. Here is an example, you can run it and see the result.

Named Entities:

Going a step beyond tokens, named entities add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through ents.

Stemming:

Stemming is a crude method for cataloging related words; it essentially chops off letters from the end until we reach the stem. This works fairly well in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. SpaCy doesn’t include a stemmer, opting instead to rely entirely on lemmatization. We discuss the virtues of lemmatization in the next part. Instead, we’ll use another NLP library called nltk.

Lemmatization:

In contrast to stemming, lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’. Further, the lemma of ‘meeting’ might be ‘meet’ or ‘meeting’ depending on its use in a sentence.

Lemmatization has more informative than simple stemming, which is why Spacy has opted to have Lemmatization available instead of Stemming. We should point out that although lemmatization looks at the surrounding text to determine a given word’s part of speech, it does not categorize phrases.

import spacy
nlp = spacy.load('en_core_web_sm')text = nlp(u"I am a runner always love running because I like to run since I ran everyday back to my childhood")for token in text:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

#### ----> In the above sentence, running, run and ran all point to the same lemma run. ####

Stop Words:

Stop words are words like “a” and “the” appear so frequently that they don’t require tagging as thoroughly as nouns, verbs, and modifiers. SpaCy holds a built-in list of some 305 English stop words.

After that, we can bring all these techniques together and add more if we need to correct spelling, remove extra spaces or punctuations and so on. To build a text normalizer and pre-process our text data.

“Data cleaning and preparation is the crucial step of Data science process”

image: source

Computational Linguistics and Word Embedding

In this section, I will cover several topics, such as:

  1. Extract linguistic features
  2. Text representation in Vector space
  3. Topic Modeling

Extract linguistic features

  • Part of Speech Tagging using Spacy:

Some words that look completely different mean almost the same thing. The same words in a different order can mean something completely different. However, we need to look at the POS of that word been used in not only the word. And that’s exactly what Spacy is designed to do: you put in raw text and get back a Doc object, that comes with a variety of annotations.

POS tags such as noun, verb, adjective and fine-grained tags like a plural noun, past-tense verb, superlative adjective.

Recall that you can obtain a particular token by its index position.

  • To view the coarse POS tag use token.pos_
  • To view the fine-grained tag use token.tag_
  • To view the description of either type of tag use spacy.explain(tag)

Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent. For more details and information, you can check here.

The result:

Visualizing POS: SpaCy offers an outstanding visualizer called displaCy .

#Import the displaCy library
from spacy import displacy#Render the dependency parse immediately inside Jupyter:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

POS doc visualization
  • NER using Spacy:

Named Entity Recognition (NER): seeks to locate and classify named entity mentions in unstructured text into predefined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Spacy has a “ner” pipeline component that identifies token spans fitting a predetermined set of named entities.

Result:

Spain - GPE-Countries, cities, states next April - DATE-Absolute or relative dates or periods the Alhambra Palace - FAC-Buildings, airports, highways, bridges, etc.

For more on Named Entity Recognition visit this link.

Text representation in Vector space

  • Discrete Representation of Words:

Suppose we have a language with a vocabulary of 10 words.

V = {cat, car, ship, city, man, laptop, word, woman, door, computer}

I will represent the words “laptop” & “car” as the following:

As we can see, that all the entries are “0”, except at the index of the word “laptop”=1, the same for word “car”. We call those vectors the “one-hot” representation of 10 words.

The problem with discrete representation is that it may not capture domain-specific semantics. In another word, we can’t capture semantic similarity and associations in the representations. For example, Home and Residence are similar and so should be close to each other. But in the English dictionary organized in alphabetical order, they are far away.

  • Neural Word Embedding as distributed representations:

A word will be represented by the mean of its neighbors. Build a dense vector for each word. Example:hello=[0.398, -0.678, 0.112, 0.825, -0.566]. We call these word embeddings or word representations.

Word2vec Model is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors. The purpose and usefulness of Word2vec are to group the vectors of similar words in vector space. Given enough data, usage, and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy”, “woman” is to “girl”). It can predict surrounding words given a center word and vice-versa.

It does so in either one of two ways, using context to predict a target word (a method known as the continuous bag of words, or CBOW) or using a word to predict a target context, which is called skip-gram.

Topic Modeling:

Topic Modeling allows us to analyze large volumes of text by clustering documents into topics. We will begin by examining how Latent Dirichlet Allocation can attempt to discover topics for a corpus of documents.

  • Latent Dirichlet Allocation (LDA):

LDA was introduced back in 2003 to tackle the problem of modeling text corpora and collections of discrete data. I will explain LDA based on example. LDA is a way of automatically discovering topics that these sentences contain. Suppose you have the following set of sentences:
– “Sugar is bad to consume. My sister likes to have sugar, but not my father.”
– “Health experts say that Sugar is not good for your lifestyle.”
– “My father spends a lot of time driving my sister around to dance practice.”
– “My father always drive my sister to school.”
– “Doctors suggest that driving may cause increased stress and blood pressure.”
LDA is a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like:
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 20% Topic A, 80% Topic B
Topic A: we can interpret topic A to be about sugar sickness).
Topic B: we can interpret topic B to be about driving).

Fundamentals of Deep Learning

Deep learning is at the heart of recent developments in NLP. From Google’s BERT to OpenAI’s GPT-2, every NLP enthusiast should at least have a basic understanding of how deep learning works to power these state-of-the-art NLP frameworks.

Before we launch straight into neural networks, we need to understand the individual components first, such as a single “neuron”.
Artificial Neural Network (ANN) have a basis in biology! let’s see how we can attempt to mimic biological neurons with an artificial neuron, known as a perceptron!

https://www.datacamp.com/community/tutorials/deep-learning-python

https://www.datacamp.com/community/tutorials/deep-learning-pythonhttps://www.datacamp.com/community/tutorials/deep-learning-python

The artificial neuron also has inputs and outputs! This simple model is known as a perceptron.
Inputs will be values of features, they are multiplied by a weight. These weights initially start as a random generation.
Then these results are passed to an activation function (many activation functions to choose from, we’ll cover this in more detail later). For now, our activation function will be very simple… If the sum of inputs is a positive return, 1 if the sum is negative output 0.
Therefore, there is a possible issue. What if the original inputs started as zero?
Then any weight multiplied by the input would still result in zero! We can fix this by adding in a bias term, in this case, we choose 1.

Introduction to neural network:

We’ve seen how a single perceptron behaves, now let’s expand this concept to the idea of a neural network!!
Let’s see how to connect many perceptrons.

image: source

Multiple Perceptron Network:
– The input layer (real values from data).
– 2 hidden layers (layers in between input and output (3 or more layers is “deep network”))
– 1 output layer (final estimate of the output)
About previous activation function: This is a pretty dramatic function since small changes aren’t reflected. Therefore, it would be nice if we could have a more dynamic function.
Let’s discuss a few more activation functions:

https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092

https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-9689331ba092

Different Activation Functions and their Graphs

ReLu tends to have the best performance in many situations for its simplicity and easy computation both during forward and backward. But, in certain cases other activation functions gives us better results like Sigmoid is used at the end layer when we want our outputs to be squashed between [0,1]
DL libraries have these functions built-in for us, so we don’t need to worry about having to implement them manually!!
Now that we understand the basics of neural network theory, we can move to more advanced topics.

Deep Learning for NLP

After we saw a brief overview of deep learning basics, it’s time to take things up a notch. Dive into advanced deep learning concepts like Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM), etc. These will help us gain mastery of industry-grade NLP use cases.
Recurrent Neural network: specifically designed to work with sequence data.
Let’s imagine a sequence [1,2,3,4,5,6], the question that you could ask: would you be able to predict a similar sequence shifted one tile step into the future?? Like that [1,2,3,4,5,6,7] .

Eventually, with RNN, we can do that.

https://www.oreilly.com/ideas/build-a-recurrent-neural-network-using-apache-mxnet

https://www.oreilly.com/ideas/build-a-recurrent-neural-network-using-apache-mxnethttps://www.oreilly.com/ideas/build-a-recurrent-neural-network-using-apache-mxnet

FFN vs RNN

RNN intuition:

  • Sequence to sequenceis about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French). We can use it for machine translation or question-answering task.
  • Sequence to vector: sentiment scores you can feed a sequence of words may be a paragraph of a movie review and then request back a vector indicated whether it is positive sentiment such as they like the movie or not.
  • Vector to sequence: could be just providing a single seed a word and then getting out an entire sequence of high probability sequence phrases that come out. (exp: word=Hello, the sequence generated for you= How are you).

RNN Architecture

Now that we understand basic RNNs we’ll move on to understanding a particular cell structure known as LSTM (Long Short Term Memory Units). It is going to be necessary to generate text that makes sense because we want to have the network be aware of not just the most recent texts but also the entire history of texts that it’s seen.

Long Short Term Memory (LSTM):
An issue that RNN faces is that after a while the network will begin to “forget” the first inputs, as information is lost at each step going through the RNN. Therefore, we need some sort of “long-term memory” for our networks.
The LSTM cell help address these RNN issues.
Let’s go through how an LSTM cell works! Keep in mind, there will be a lot of Math here! Check out the resource link for a full breakdown!

LSTM cell

Here we can see the entire LSTM cell seems so complex when we saw it in that format. However, it’s not so bad when you break it down in parts.

In this diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denotes its content being copied and the copies going to different locations.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

GRU: another variation of the LSTM. This was introduced quite recently around 2014 and what it ends up doing is it simplifies things a bit by combining the forget and input gates into a single one called update gates. Besides, it merges the cell state and hidden state and makes a few other changes.

What’s Next?

In this article, I want to start from zero by covered just the basic and some advanced techniques in NLP. However, I can’t jump to more advanced topics without passing by the basic concepts. Next time, we will discover modern NLP techniques, like Transfer Learning, the reason for rapid progress in NLP. So stay tuned!!!

Conclusion

“Let everything you do be done as if it makes a difference.” — William James

This was one of my longer articles if you are reading this; I want to give you a big clap for staying with me until the end. This article covers my learning journey in the NLP field with some examples, which I’m sure that should give you a good idea about how to start working with a corpus of text documents.
As I said, I will try to cover the transfer learning techniques in a future article. Meanwhile, feel free to use the comments section below to let me know your thoughts or ask any questions you might have about this article