Using counts and tfidf as features with scikit learn

I’m trying to use both counts and tfidf as features for a multinomial NB model. Here’s my code:

text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)

tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)

classifier = MultinomialNB()
classifier.fit(combined_features, labels)

But I’m getting an error with FeatureUnion and tfidf:

TypeError: no supported conversion for types: (dtype('S18413'),)

Any idea why this could be happening? Is it not possible to have both counts and tfidf as features?

Best answer

The error didn’t come from the FeatureUnion, it came from the TfidfTransformer

You should use TfidfVectorizer instead of TfidfTransformer, the transformer expects a numpy array as input and not plaintext, hence the TypeError

Also your test sentence is too small for Tfidf testing so try using a bigger one, here’s an example:

from nltk.corpus import brown

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB

# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)