# Latent Semantic Analysis in Python discrepancy

I’m trying to follow the Wikipedia Article on latent semantic indexing in Python using the following code:

``````documentTermMatrix = array([[ 0.,  1.,  0.,  1.,  1.,  0.,  1.],
[ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
[ 0.,  0.,  0.,  0.,  0.,  1.,  1.],
[ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
[ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
[ 1.,  0.,  0.,  1.,  0.,  0.,  0.],
[ 0.,  0.,  0.,  0.,  1.,  1.,  0.],
[ 0.,  0.,  1.,  1.,  0.,  0.,  0.],
[ 1.,  0.,  0.,  1.,  0.,  0.,  0.]])
u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False)

sigma = diag(s)
## remove extra dimensions...
numberOfDimensions = 4
for i in range(4, len(sigma) -1):
sigma[i][i] = 0
queryVector = array([[ 0.], # same as first column in documentTermMatrix
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 1.],
[ 0.],
[ 0.],
[ 1.]])
``````

How the math says it should work:

``````dtMatrixToQueryAgainst = dot(u, dot(s,vt))
queryVector = dot(inv(s), dot(transpose(u), queryVector))
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0]
# gives 'matrices are not aligned' error.  should be 1 because they're the same
``````

What does work, with math that looks incorrect: ( from here)

``````dtMatrixToQueryAgainst = dot(s, vt)
queryVector  = dot(transpose(u), queryVector)
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0])
# gives 1, which is correct
``````

Why does route work, and the first not, when everything I can find about the math of LSA shows the first as correct? I feel like I’m missing something obvious…

There are several inconsistencies in your code that cause errors before your point of confusion. This makes it difficult to understand exactly what you tried and why you are confused (clearly you did not run the code as it is pasted, or it would have thrown an exception earlier).

That said, if I follow your intent correctly, your first approach is nearly correct. Consider the following code:

``````documentTermMatrix = array([[ 0.,  1.,  0.,  1.,  1.,  0.,  1.],
[ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
[ 0.,  0.,  0.,  0.,  0.,  1.,  1.],
[ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
[ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
[ 1.,  0.,  0.,  1.,  0.,  0.,  0.],
[ 0.,  0.,  0.,  0.,  1.,  1.,  0.],
[ 0.,  0.,  1.,  1.,  0.,  0.,  0.],
[ 1.,  0.,  0.,  1.,  0.,  0.,  0.]])
numDimensions = 4
u, s, vt = linalg.svd(documentTermMatrix, full_matrices=False)
u = u[:, :numDimensions]
sigma = diag(s)[:numDimensions, :numDimensions]
vt = vt[:numDimensions, :]
lowRankDocumentTermMatrix = dot(u, dot(sigma, vt))
queryVector = documentTermMatrix[:, 0]
lowDimensionalQuery = dot(inv(sigma), dot(u.T, queryVector))
lowDimensionalQuery
vt[:,0]
``````

You should see that `lowDimensionalQuery` and `vt[:,0]` are equal. Think of `vt` as a representation of the documents in a low-dimensional subspace. First we map our query into that subspace to get `lowDimensionalQuery`, and then we compare it with the corresponding column of `vt`. Your mistake was trying to compare the transformed query to the document vector from `lowRankDocumentTermMatrix`, which lives in the original space. Since the transformed query has fewer elements than the “reconstructed” document, Python complained.