How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?

I am using scikit learn’s Kmeans algorithm to cluster comments.

sentence_list=['hello how are you', "I am doing great", "my name is abc"]

vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')

km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)

when i print the output of vectorized, it gives me the index of the words and the tf-idf scores of the index.

So im wondering, given we only get the tfidf scores of words, how is it that we manage to cluster documents based on individual words and not the score of an entire document? Or maybe it does this..Can someone explain to me the concept behind this?

Best answer

You should take a look at how the Kmeans algorithm works. First the stop words never make it to vectorized, therefore are totally ignored by Kmeans and don’t have any influence in how the documents are clustered. Now suppose you have:

sentence_list=["word1", "word2", "word2 word3"]

Lets say you want 2 clusters. In this case you expect the second and the third document to be in the same cluster because they share a common word. Lets see how this happens.

The numeric representation of the docs vectorized looks like:

word1     word3     word2
    1  0.000000  0.000000     # doc 1
    0  1.000000  0.000000     # doc 2
    0  0.605349  0.795961     # doc 3

In the first step of Kmeans, some centroids are randomly chosen from the data, for example, the document 1 and the document 3 will be the initial centroids:

Centroid 1:     [1, 0.000000, 0.000000]

Centroid 2:     [0, 0.605349, 0.795961]

Now if you calculate the distances from every point(document) to each one of the two centroids, you will see that:

  • document 1 has distance 0 to centroid 1 so it belongs to centroid 1
  • document 3 has distance 0 to centroid 2 so it belongs to centroid 2

Finally we calculate the distance between the remaining document 2 and each centroid to find out which one it belongs to:

>>> from scipy.spatial.distance import euclidean

>>> euclidean([0, 1, 0], [1, 0, 0])               # dist(doc2, centroid1)

>>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2)

So the 2nd document and the second centroid are closer, this means that the second document is assigned to the 2nd centroid.