Scikit Learn – Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents

Scikit-Learn’s TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features. Instead of raw documents, I would like to convert a matrix of feature names to TF-IDF features.

The corpus you feed fit_transform() is supposed to be an array of raw documents, but instead I’d like to be able to feed it (or a comparable function) an array of arrays of features per document. For example:

corpus = [
    ['orange', 'red', 'blue'],
    ['orange', 'yellow', 'red'],
    ['orange', 'green', 'purple (if you believe in purple)'],
    ['orange', 'reddish orange', 'black and blue']
]

… as opposed to a one dimensional array of strings.

I know that I can define my own vocabulary for the TfidfVectorizer to use, so I could easily make a dict of unique features in my corpus and their indices in the feature vectors. But the function still expects raw documents, and since my features are of varying lengths and occasionally overlap (for example, ‘orange’ and ‘reddish orange’), I can’t just concatentate my features into single strings and use ngrams.

Is there a different Scikit-Learn function I can use for this that I’m not finding? Is there a way to use the TfidfVectorizer that I’m not seeing? Or will I have to homebrew my own TF-IDF function to do this?

Best answer

You can write custom functions to override the built in preprocessor and tokenizer.

From the docs:

Preprocessor – A callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.

Tokenizer – A callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.

In this case, there is no preprocessing to perform (because there are no raw documents). The tokenizing is also unnecessary, because we already have arrays of features. Therefore, we can do the following:

tfidf = TfidfVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x)
tfidf_matrix = tfidf.fit_transform(corpus)

We skip both the preprocessor and the tokenizer steps by simply passing on the entire corpus with lambda x: x. Once the built-in analyzer receives the arrays of features, it builds the vocabulary itself and performs TF-IDF on the “tokenized” corpus as normal.