A user has to classify very large amounts of text in over 10,000 categories. How to do that?

858    Asked by LaunaKirchner in Data Science , Asked on Dec 23, 2019
Answered by Launa Kirchner

Let us assume of the following data

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import TruncatedSVD

X = ["When I wake up in the morning I always eat apples",

"What do you eat in the morning",

"Usually I only drink coffee",

"How awful, I really cannot stand coffee"]

After applying Tf Idf transformation, we will get a matrix of shape (4,21)

So it consists of 21 columns and we want to reduce it

We can use dimension reduction technique and lets perform truncated SVD

svd = TruncatedSVD(n_components=2)

reduced_X = svd.fit_transform(vectorized_X)

reduced_X.shape

>>> (4,2)

The advantage of this technique is we can define the number of columns we want and it will perform linear dimensionality reduction on your tfidf vectorization.



Your Answer

Interviews

Parent Categories