A tutorial on creating a vector space model for text in Python.

We need to start thinking about how to transform a collection of text into something quantifiable. The easiest way to do this is to consider word frequency.

I'm going to try as much as possible to try not to use the NLTK and Scikits-Learn packages. We will start by explaining some basic concepts using Python.

basic word frequency

First, we review how to get the number of words in each document: a word frequency vector.

#examples taken from here: /a/1750187
 
mydoclist = ['Julie loves me more than Linda loves me',
'Jane likes me more than Julie loves me',
'He likes basketball more than baseball']
 
#mydoclist = ['sun sky bright', 'sun sun bright']
 
from collections import Counter
 
for doc in mydoclist:
  tf = Counter()
  for word in ():
    tf[word] +=1
  print ()

[('me', 2), ('Julie', 1), ('loves', 2), ('Linda', 1), ('than', 1), ('more', 1)]
[('me', 2), ('Julie', 1), ('likes', 1), ('loves', 1), ('Jane', 1), ('than', 1), ('more', 1)]
[('basketball', 1), ('baseball', 1), ('likes', 1), ('He', 1), ('than', 1), ('more', 1)]

Here we introduce a new Python object called Counter, which is only available in Python 2.7 and higher.Counters are very flexible, and with them you can accomplish things like counting in a loop.

A first attempt at document quantization was made based on the number of words in each document. However, for those who have already learned the concept of "vectors" in the vector space model, the results of the first attempt at quantization cannot be compared. This is because they are not in the same lexical space.

What we really want is that the quantization results for each document have the same length, and here the length is determined by the total number of words in our corpus.

import string #allows for format()
   
def build_lexicon(corpus):
  lexicon = set()
  for doc in corpus:
    ([word for word in ()])
  return lexicon
 
def tf(term, document):
 return freq(term, document)
 
def freq(term, document):
 return ().count(term)
 
vocabulary = build_lexicon(mydoclist)
 
doc_term_matrix = []
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
for doc in mydoclist:
  print 'The doc is "' + doc + '"'
  tf_vector = [tf(word, doc) for word in vocabulary]
  tf_vector_string = ', '.join(format(freq, 'd') for freq in tf_vector)
  print 'The tf vector for Document %d is [%s]' % (((doc)+1), tf_vector_string)
  doc_term_matrix.append(tf_vector)
   
  # here's a test: why did I wrap (doc)+1 in parens? it returns an int...
  # try it! type((doc) + 1)
 
print 'All combined, here is our master document term matrix: '
print doc_term_matrix

Our word vector is[me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]

The word frequency vector for the document "Julie loves me more than Linda loves me" is [2, 0, 1, 0, 0, 0, 2, 0, 1, 0, 1, 1, 1].

The word frequency vector for the document "Jane likes me more than Julie loves me" is [2, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1].

(computer) file”He likes basketball more than baseball”The word frequency vector of：[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]

Together, this is the word matrix of our main document:

[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]

Okay, this seems reasonable. If you have any experience with machine learning, what you've just seen is the creation of a feature space. Now every document is in the same feature space, which means we can represent the entire corpus in the same dimensional space without losing too much information.

Normalize the vector so that its L2 norm is 1

Once you have the data in the same feature space, you can start applying some machine learning methods: classification, clustering, and so on. But in practice, we run into some of the same problems. Words don't all contain the same information.

If some words occur too frequently in a single document, they will disrupt our analysis. We want to scale each word frequency vector to make it more representative. In other words, we need to perform vector normalization.

We don't really have time to discuss too much about the math on this. For now just accept the fact that we need to make sure that the L2 paradigm of each vector is equal to 1. Here's some code that shows how this is accomplished.

import math
 
def l2_normalizer(vec):
  denom = ([el**2 for el in vec])
  return [(el / (denom)) for el in vec]
 
doc_term_matrix_l2 = []
for vec in doc_term_matrix:
  doc_term_matrix_l2.append(l2_normalizer(vec))
 
print 'A regular old document term matrix: '
print (doc_term_matrix)
print '\nA document term matrix with row-wise L2 norms of 1:'
print (doc_term_matrix_l2)
 
# if you want to check this math, perform the following:
# from numpy import linalg as la
# (doc_term_matrix[0])
# (doc_term_matrix_l2[0])

Formatted old document word matrix:

[[2 0 1 0 0 2 0 1 0 1 1]
[2 0 1 0 1 1 1 0 0 1 1]
[0 1 0 1 1 0 0 0 1 1 1]]

Document word matrices with L2 paradigm of 1 by row:

[[ 0.57735027 0. 0.28867513 0. 0. 0.57735027
0. 0.28867513 0. 0.28867513 0.28867513]
[ 0.63245553 0. 0.31622777 0. 0.31622777 0.31622777
0.31622777 0. 0. 0.31622777 0.31622777]
[ 0. 0.40824829 0. 0.40824829 0.40824829 0. 0.
0. 0.40824829 0.40824829 0.40824829]]

Not bad, and without delving too deeply into linear algebra, you can see right away that we've scaled down the individual vectors so that each of their elements is between 0 and 1, and not much valuable information is lost. You see that a word with a count of 1 no longer has the same value in one vector as it does in another.

Why do we care about this standardization? Consider this situation, if you want a document to seem more relevant to a particular topic than it actually is, you might increase the likelihood that it will be included to a topic by repeating the same word over and over again. Frankly, at some point, we get a result that decays in the informational value of that word. So we need to scale down the value of those words that appear frequently in a document.

IDF frequency weighting

We're not getting the results we want right now. Just as all words in a document do not have the same value, not all words in all documents have value. We tried to adjust each word weight using inverse document word frequency (IDF). Let's see what this entails:

def numDocsContaining(word, doclist):
  doccount = 0
  for doc in doclist:
    if freq(word, doc) > 0:
      doccount +=1
  return doccount 
 
def idf(word, doclist):
  n_samples = len(doclist)
  df = numDocsContaining(word, doclist)
  return (n_samples / 1+df)
 
my_idf_vector = [idf(word, mydoclist) for word in vocabulary]
 
print 'Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']'
print 'The inverse document frequency vector is [' + ', '.join(format(freq, 'f') for freq in my_idf_vector) + ']'

Our word vector is[me, basketball, Julie, baseball, likes, loves, Jane, Linda, He, than, more]

The inverse document word frequency vector is [1.609438, 1.386294, 1.609438, 1.386294, 1.609438, 1.609438, 1.386294, 1.386294, 1.386294, 1.386294, 1.791759, 1.791759]

Now, for each word in the vocabulary, we have an information value in the conventional sense that explains their relative frequency in the corpus as a whole. Recall that this information value is an "inverse"! That is, the smaller the information value, the more frequent the word is in the corpus.

We're getting close to getting the desired result. To get the TF-IDF weighted word vector, you have to do a simple calculation: tf * idf.

Now let's take a step back and think about it. Think back to linear algebra: if you multiply a vector of AxB by another vector of AxB, you will get a vector of size AxA, or a scalar. We won't do that, because what we want is a vector of words with the same dimensions (1 x the number of words), with each element of the vector already weighted by its own idf weight. How can we implement such a calculation in Python?

We could write the full function here, but instead of doing that, we're going to give an introduction to numpy.

import numpy as np
 
def build_idf_matrix(idf_vector):
  idf_mat = ((len(idf_vector), len(idf_vector)))
  np.fill_diagonal(idf_mat, idf_vector)
  return idf_mat
 
my_idf_matrix = build_idf_matrix(my_idf_vector)
 
#print my_idf_matrix

Awesome! Now we've transformed the IDF vector into a BxB matrix, and the diagonal of the matrix is the IDF vector. This means we can now multiply each word frequency vector by the inverse document word frequency matrix. Next, to make sure we also consider words that appear too frequently in documents, we will normalize each document's vector so that its L2 norm equals 1.

doc_term_matrix_tfidf = []
 
#performing tf-idf matrix multiplication
for tf_vector in doc_term_matrix:
  doc_term_matrix_tfidf.append((tf_vector, my_idf_matrix))
 
#normalizing
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
  doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
                   
print vocabulary
print (doc_term_matrix_tfidf_l2) # () just to make it easier to look at

set(['me', 'basketball', 'Julie', 'baseball', 'likes', 'loves', 'Jane', 'Linda', 'He', 'than', 'more'])

[[ 0.57211257 0. 0.28605628 0. 0. 0.57211257
0. 0.24639547 0. 0.31846153 0.31846153]
[ 0.62558902 0. 0.31279451 0. 0.31279451 0.31279451
0.26942653 0. 0. 0.34822873 0.34822873]
[ 0. 0.36063612 0. 0.36063612 0.41868557 0. 0.
0. 0.36063612 0.46611542 0.46611542]]

Awesome! You've just seen an example showing how tedious it is to build a TF-IDF-weighted document word matrix.

Here comes the best part: you don't even need to calculate the above variables manually, just use scikit-learn.

Remember that everything in Python is an object, and the object itself takes up memory while the object performs operations that take up time. Use the scikit-learn package to make sure you don't have to worry about the efficiency of all the previous steps.

Note: The values you get from TfidfVectorizer/TfidfTransformer will be different from the ones we calculated manually. This is because scikit-learn uses a modified version of Tfidf to handle division by zero errors. There is a more in-depth discussion of this here.

from sklearn.feature_extraction.text import CountVectorizer
 
count_vectorizer = CountVectorizer(min_df=1)
term_freq_matrix = count_vectorizer.fit_transform(mydoclist)
print "Vocabulary:", count_vectorizer.vocabulary_
 
from sklearn.feature_extraction.text import TfidfTransformer
 
tfidf = TfidfTransformer(norm="l2")
(term_freq_matrix)
 
tf_idf_matrix = (term_freq_matrix)
print tf_idf_matrix.todense()

Vocabulary: {u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
[[ 0. 0. 0. 0. 0.28945906 0.
0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]
[ 0. 0. 0. 0.41715759 0.3172591 0.3172591
0. 0.3172591 0.6345182 0.24637999 0.24637999]
[ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358
0. 0. 0. 0.28561676 0.28561676]]

In fact, you can do all the steps with one function: TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
 
tfidf_vectorizer = TfidfVectorizer(min_df = 1)
tfidf_matrix = tfidf_vectorizer.fit_transform(mydoclist)
 
print tfidf_matrix.todense()

[[ 0. 0. 0. 0. 0.28945906 0.
0.38060387 0.57891811 0.57891811 0.22479078 0.22479078]
[ 0. 0. 0. 0.41715759 0.3172591 0.3172591
0. 0.3172591 0.6345182 0.24637999 0.24637999]
[ 0.48359121 0.48359121 0.48359121 0. 0. 0.36778358
0. 0. 0. 0.28561676 0.28561676]]

And we can use this vocabulary space to process new observation documents like this:

new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball']
new_term_freq_matrix = tfidf_vectorizer.transform(new_docs)
print tfidf_vectorizer.vocabulary_
print new_term_freq_matrix.todense()

{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2}
[[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0.
0. 0. 0. 0. ]
[ 0. 0.68091856 0. 0. 0.51785612 0.51785612
0. 0. 0. 0. 0. ]
[ 0.62276601 0. 0. 0.62276601 0. 0. 0.
0.4736296 0. 0. 0. ]]

Note that the word "watches" is not in the new_term_freq_matrix. This is because the documents we used for training are documents in mydoclist, and this word does not appear in the vocabulary of that corpus. In other words, it is outside our lexicon.

Back to Amazon Review Text

Exercise 2

Now it's time to try using what you've learned. Using TfidfVectorizer, you can try to build a TF-IDF weighted document word moment on a list of strings of Amazon review text.

import os
import csv
 
#('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/')
 
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
  amazon_reader = (csvfile, delimiter=',')
  amazon_reviews = [row['review_text'] for row in amazon_reader]
 
  #your code here!!!