A user wants to know in the doc2vec model, can we cluster on the vectors themselves? Should we cluster each resulting model.docvecs[1]vector? How to implement the clustering model? Below is the code implementation.

1.4K Asked by TristaBrigman in Data Science , Asked on Dec 23, 2019

model = gensim.models.doc2vec.Doc2Vec(size= 100, min_count = 5,window=4, iter = 50, workers=cores)

model.build_vocab(res)

model.train(res, total_examples=model.corpus_count, epochs=model.iter)

# each of length 100

len(model.docvecs[1])

We can use the document vectors directly from the model to fit an unsupervised algorithm like k-means clustering algorithm. Then we can use the centroids to label the documents.

from scipy.cluster.vq import kmeans,vq

NUMBER_OF_CLUSTERS = 15

centroids, _ = kmeans(model.docvecs, NUMBER_OF_CLUSTERS)

# computes cluster Id for document vectors

doc_ids,_ = vq(model.docvecs,centroids)

# zips cluster Ids back to document labels

doc_labels = zip(model.docvecs.doctags.keys(), doc_ids)

# outputs document label and the corresponding cluster label

[('doc_216', 0),

('doc_217', 12),

('doc_214', 13),

('doc_215', 11),

('doc_212', 13),

('doc_213', 11),

('doc_210', 5),

('doc_211', 13),

('doc_165', 0),

... ]

Using gensim, centroids can be used for retrieval. If matching every document with a cluster is not needed, for example, if we need to get the nearest 10 documents to centroid(cluster) 1 we can implement the following code.

model.docvecs.most_similar(positive = [centroids[1]], topn = 10)

# outputs document label and a similarity score

[('doc_243', 0.9186744689941406),

('doc_74', 0.9134798049926758),

('doc_261', 0.8858329057693481),

('doc_88', 0.8851054906845093),

('doc_276', 0.8691701292991638),

('doc_249', 0.8666893243789673),

('doc_233', 0.8334537148475647),

('doc_292', 0.8269758224487305),

('doc_98', 0.8193566799163818),

('doc_82', 0.808419942855835)]

A user wants to know in the doc2vec model, can we cluster on the vectors themselves? Should we cluster each resulting model.docvecs[1]vector? How to implement the clustering model? Below is the code implementation.

Your Answer