Python: tf-idf-cosine: aby znaleźć podobieństwo dokumentów

Question

Aug 25, 2012, 04:41 AM

python information-retrieval machine-learning tf-idf nltk

Python: tf-idf-cosine: aby znaleźć podobieństwo dokumentów

Śledziłem samouczek dostępny na stronieCzęść 1 & Część 2. Niestety autor nie miał czasu na ostatnią sekcję, która polegała na zastosowaniu podobieństwa cosinusów, aby znaleźć odległość między dwoma dokumentami. Śledziłem przykłady w artykule za pomocą następującego linku zprzepełnienie stosu, dołączony jest kod wspomniany w powyższym linku (tak, aby ułatwić życie)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

w wyniku powyższego kodu mam następującą macierz

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

Nie jestem pewien, jak użyć tego wyjścia do obliczenia podobieństwa cosinusów. Wiem, jak zaimplementować podobieństwo cosinusa w odniesieniu do dwóch wektorów o podobnej długości, ale tutaj nie jestem pewien, jak zidentyfikować dwa wektory.