# Hands-on intro to Language Processing (NLP)

This article discusses three techniques that practitioners could use to effectively start working with natural language processing (NLP). This will also give good visibility to people interested in having a sense of what NLP is about — if you are an expert, please feel free to connect, comment, or suggest. At erreVol, we leverage similar tools to extract useful insights from transcripts of earnings reports of public corporations — the interested reader can go test the platform.

Note, we will present lines of codes for the reader interested in replicating or using what is presented below. Otherwise, please feel free to skip those technical lines as the reading should result seamless.

We will reference classification problems where we have to train a model to recognize the sentiment of a text. Our training samples will have labels “0”, “1” and “2”, representing respectively negative, neutral, and positive sentiment. However, we are not interested in evaluating a model here, we are focused on the processing of the data, how to let the model interpret our text, arguably the most important phase. Here are the three processing solutions we will deal with:

TFIDF — sklearn frequency-based transformation: while based on frequencies of words, this statistical count is not a trivial one; it looks at both the importance of a term in characterizing the sentiment, and at its uniqueness throughout the entire corpus (all the sample texts). The interested reader can refer to this Wikipedia link .w2v — Gensim word2vector model to be trained “in-house” with our own text: this technique translates a word into an N-dimensional array (we can choose N). Each dimension will not necessarily represent something we can easily understand, it is more of a mathematical representation than an intuitive one. It is a technique based on the principle that after the transformation, similar or related terms will occupy similar regions of the N-space; moreover, the vector space would allow for operations in the form of vector[“person”] + vector[“male”/”female”] + vector[“crown”] -> vector[“king”/”queen”].w2v — Gensim word2vector pre-trained model: this is similar to the one right above, but we will see it has critical differences. We leverage here the N-dimensional space already built by Gensim (open-source library) and obtained by training the model with text from Wikipedia (we chose the 300-dimensional version for coherence with the previous point). If readers deemed news from papers or even text from Twitter better candidates for representing their domains, Gensim also offers models based on those sources: Gensim’s GitHub page.

The first two methods listed above do not handle never-seen data, meaning, if after training we present the model with text containing words never seen before during the training phase, those words will not be mapped. As bad as this may sound, if properly trained (with a proper training set) a model will be able to characterize new text fairly well since occasional never-seen data are unlikely to statistically characterize text. In our simple experience, the dynamics of text within a specific domain are often determined by words surprisingly simple and common. The relationships to extract among those simple terms may not be that simple to capture though — that is arguably where the difference is made. That is also why training on text from the specific domain we will be involved with could be more important than the “popularity” and the size of the corpus or dataset.

Whereas we will build the case as if all the three processing techniques above were then applied to a final neural network (KERAS), we suggest practitioners to start from more straightforward models as could be the case of logistic regression. Simplicity would allow for a better understanding of the effectiveness of the processing, probably the most important activity toward the overall result.

With “ corpus” we will refer to the entire dataset (e.g. all the available training set of text-labels pairs). With “text” we will refer to each sample within the corpus, that is, each text-label sample. In our specific case then, for sake of simplicity, each text will be constituted by only one phrase.

In general, the overall process would involve the following three phases:

PHASE 1. The first phase is about importing and tokenizing raw text because we cannot work with raw strings. To do that we leverage the Spacy module (open-source library) which is a pre-trained model able to analyze, characterize, and tokenize text (we show in the picture below what that means). Here are a few possible lines of code for the interested reader:

import spacy

from sklearn.feature_extraction.text import TfidfVectorizer

def spacy_tokenizer(sentence): # any text applied to this function gets tokenized

mytokens = nlp(sentence)

mytokens = [ here we can filter and further transform our text … if we wanted to]

return mytokens

X = df_imported_text[‘text’] # ‘text’ column with the text to classify

Y = df_imported_text [‘label’] # ‘label’ column with the label to train the model

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3) # train/test split

tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer) # calling the tokenizer and then the vectorizer

Starting from text like the following:

We would obtain tokenized sentences in the following form (only two texts because of the 70/30% training/testing split):

Whereas it may seem we are just placing words inside an array, the underlying spacy model has also studied their relationships among them and it is now able to identify pronouns, verbs, etc … all useful info for further processing not the object of this article — eliminating stop-words, etc. it should be common practice easily found on the web if interested.

PHASE 2. Now that we have tokenized our text, the second phase would be to apply one of the three processing techniques listed above and constituting the focus of this article. We will detail this immediately below, let us first cover phase 3 in a few lines.

PHASE 3. The last and third phase would be to feed the result of the previous step to a KERAS neural-network. This may be a common Dense KERAS neural-network or other types like an LSTM network (Long Short Term Memory) able to process information about words’ positioning, or a Convolutional NN usually adopted in image-processing. Here are examples of the first two:

#DENSE instance:

from keras.models import Sequential

from keras.layers import Dense, Dropout

def twoLayerFeedForward():

clf = Sequential()

return clf

#LSTM instance:

model = Sequential()

model.fit(tf_transformed_2keras_norm, y_train, epochs=10, batch_size=50, verbose=1)

Either one of the above can be packaged within a pipeline to be called through the following:

classifier_keras = KerasClassifier(twoLayerFeedForward, epochs=10, batch_size=50, verbose=1)

pipe = Pipeline([(‘classifier’, classifier_keras)])

It is now time to deal with the core topic of this article, PHASE 2, the processing of the tokenized text. Let us go through the three possible techniques previously listed.

(PHASE 2) Method 1 — TFIDF

We have already shown in phase 1 above that the TFIDF operator applied to the tokenized text can be called through the following:

tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

It can then be applied to the training set through this fit and transform method:

tf_transformed = tfidf_vector.fit_transform(X_train)

The TFIDF transformation applied to our corpus would result in the following (the reader interested in reconciling the numbers below with the frequencies of those few words should refer to the exact sklearn documentation since the computation varies slightly from the standard TFIDF computation):

Since we chose the three sample texts as not having mutual words, the text selected for testing (third one) is mapped into all zeroes because the model has not seen any of those words during the fitting phase (done through the first two texts). However, as stated right at the beginning, whereas we wanted to highlight on purpose this dynamic, this is unlikely to happen in real applications when proper training is executed and more complete text is applied.

(PHASE 2) Method 2 — w2v Gensim “in-house trained” model

Compared to the previous case, here we need to substitute the following line:

tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

We need something related to our “in-house” N-dimensional transformation of words. We choose a common 300-dimensional space, meaning each word is translated into a 300-dimensional vector. Here is a critical point: we can transform each word into a vector but, how to obtain a representation then of an entire text (of an entire phrase in our case since each text of the corpus is constituted by one phrase)? In this article, we will sum dimension by dimension each word constituting a single text, we will therefore still obtain a 300-dimensional representation of a single text as if it was a single word. It is important to remember the linearity of the space described above thanks to which summing up vectors has indeed logical meaning: vector[“person”] + vector[“female”] + vector[“crown”] = vector[“queen”]. True, considering the relationship among words within a text or phrase, the linearity would probably not be maintained, but still, taking the sum is not just a way to obtain a single number for each dimension — anyway, we are not arguing this is the best solution. Finally, we can now substitute the TFIDF of the previous technique with the following:

from gensim.models import Word2Vec

model = Word2Vec(sentences=X_tokenized, vector_size=300, window=5, min_count=1, workers=4)

inhouse_wv = model.wv

wordsInVocab = len(inhouse_wv)

def sent_vectorizer(sent, model):

sent_vec = np.zeros(300)

for w in sent:

try:

vc=model[w]

vc=vc[0:300]

except:

vc=0.0

return sent_vec

# here is the actual call to the transformation of our corpus and collection of texts into vectors

def vectorize_myDoc(sentences):

w2v_vec=[]

count=0

for sentence in sentences:

w2v_vec.append(sent_vectorizer(sentence, inhouse_wv))

count+=1

return w2v_vec

# calling the transformation initialized through the functions above

w2v_transformed = vectorize_myDoc(X_tokenized)

In this second case, we should obtain a description of the type shown right below. Even though the pictures only show 6 columns, those matrices have 300 columns (i.e. features of the representation), while they still have one row per text.

Note, it is good practice to normalize those numbers before feeding them to the following neural network.

(PHASE 2) Method 3 — w2v Gensim pre-trained model (Wikipedia text)

Finally, the implementation through the Gensim pre-trained vectorizer would be very similar to the previous case, however, the result should be very different. Operationally, we would just need to substitute the following lines:

from gensim.models import Word2Vec

model = Word2Vec(sentences=X_tokenized, vector_size=300, window=5, min_count=1, workers=4)

inhouse_wv = model.wv

with the following:

from gensim.models import KeyedVectors

The lines above work in case the Gensim “kv” file is already stored locally. Otherwise, it can be downloaded through the following API call:

The result would appear to be very similar to the previous case (“in-house trained” w2v), but we will see immediately below that that is not the case (again, those matrices have 300 columns even though the pictures only show 6 of them):

Because of the underlying model trained on a completely different corpus — text from Wikipedia vs “in-house” text — two words very close in the 300-dimensional space previously obtained, could be far away in this case. Moreover, the testing text is now mapped into a vector different from a series of zeroes. This is because, having the model been trained through a much bigger corpus (from Wikipedia), the model happens to have already encountered the words contained in that third text (not sure if all of them since we are computing a sum for the entire text which admits some words to be mapped with 300 zeroes).

Conclusion

We would like to close with some recommendations from our own experience.

We should keep it as simple as possible. A big vectorized space and a big KERAS neural-net are probably not the right starting point. A TFIDF transformation and a logistic regression are probably better candidates. That solution should already give a good indication of the quality of the corpus and text we are using. Results could be then improved with the more advanced w2v method and some KERAS tools.

Experimenting with different corpora could be useful too. Comparing results obtained from texts specific to our domain and more general ones (like the Wikipedia one) could allow for good guidance. This would also involve text different in style, meaning, comparing results obtained by using text from Twitter and text from Wikipedia — usually different in writing style.

Finally, it should be put in evidence that the solutions reported here are not the only ones possible. Those are meant to be good references for practitioners who need a starting point and for interested people who want a framework to understand natural language processing. However, what this article covers can still be considered the base of more advanced solutions probably involving only additional processing steps.