A user has to classify very large amounts of text in over 10,000 categories. How to do that?

1.0K Asked by LaunaKirchner in Data Science , Asked on Dec 23, 2019

Answered by Launa Kirchner

Let us assume of the following data

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD

X = ["When I wake up in the morning I always eat apples",

"What do you eat in the morning",

"Usually I only drink coffee",

"How awful, I really cannot stand coffee"]

After applying Tf Idf transformation, we will get a matrix of shape (4,21)

So it consists of 21 columns and we want to reduce it

We can use dimension reduction technique and lets perform truncated SVD

svd = TruncatedSVD(n_components=2)

reduced_X = svd.fit_transform(vectorized_X)

reduced_X.shape

>>> (4,2)

The advantage of this technique is we can define the number of columns we want and it will perform linear dimensionality reduction on your tfidf vectorization.

A user has to classify very large amounts of text in over 10,000 categories. How to do that?

Your Answer