A user has to classify very large amounts of text in over 10,000 categories. How to do that?
Let us assume of the following data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
X = ["When I wake up in the morning I always eat apples",
"What do you eat in the morning",
"Usually I only drink coffee",
"How awful, I really cannot stand coffee"]
After applying Tf Idf transformation, we will get a matrix of shape (4,21)
So it consists of 21 columns and we want to reduce it
We can use dimension reduction technique and lets perform truncated SVD
svd = TruncatedSVD(n_components=2)
reduced_X = svd.fit_transform(vectorized_X)
reduced_X.shape
>>> (4,2)
The advantage of this technique is we can define the number of columns we want and it will perform linear dimensionality reduction on your tfidf vectorization.