How to plot text documents in a scatter map?

How to plot text documents in a scatter map? - python

I'm using scikit to perform text classification and I'm trying to understand where the points lie with respect to my hyperplane to decide how to proceed. But I can't seem to plot the data that comes from the CountVectorizer() function. I used the following function: pl.scatter(X[:, 0], X[:, 1]) and it gives me the error: ValueError: setting an array element with a sequence.
Any idea how to fix this?`

If X is a sparse matrix, you probably need X = X.todense() in order to get access to the data in the correct format. You probably want to check X.shape before doing this though, as if X is very large (but very sparse) it may consume a lot of memory when "densified".

Related

Auto broadcasting in Scipy

I have two np.ndarrays, data with shape (8000, 500) and sample with shape (1, 500).
What I am trying to achieve is measure various types of metrics between every row in data to sample.
When using from sklearn.metrics.pairwise.cosine_distances I was able to take advantage of numpy's broadcasting executing the following line
x = cosine_distances(data, sample)
But when I tried to use the same procedure with scipy.spatial.distance.cosine I got the error
ValueError: Input vector should be 1-D.
I guess this is a broadcasting issue and I'm trying to find a way to get around it.
My ultimate goal is to iterate over all of the distances available in scipy.spatial.distance that can accept two vectors and apply them to the data and the sample.
How can I replicate the broadcasting that automatically happens in sklearn's in my scipy version of the code?

OK, looking at the docs, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
With (800,500) and (1,500) inputs ((samples, features)), you should get back a (800,1) result ((samples1, samples2)).
I wouldn't describe that as broadcasting. It's more like dot product, that performs some sort calculation (norm) over features (the 500 shape), reducing that down to one value. It's more like np.dot(data, sample.T) in its handling of dimensions.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html is Computes the Cosine distance between 1-D arrays, more like
for row in data:
for s in sample:
d = cosine(row, s)
or since sample has only one row
distances = np.array([cosine(row, sample[0]) for row in data])
In other words, the sklearn version does the pairwise iteration (maybe in compiled code), while the spartial just evaluates the distance for one pair.
pairwise.cosine_similarity does
# K(X, Y) = <X, Y> / (||X||*||Y||)
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
That's the dot like behavior that I mentioned earlier, but with the normalization added.

Saving confusion matrix

Is there any possibility to save the confusion matrix which is generated by sklearn.metrics?
I would like to save multiple results of different classification algorithms in an array or maybe a pandas data frame so I can show which algorithm works best.
print('Neural net: \n',confusion_matrix(Y_test, Y_pred), sep=' ')
How could I save the generated confusion matrix within a loop? (I am training over a set of 200 different target variables)
array[i] = confusion_matrix(Y_test,Y_pred)
I run into some definition problems here [array is not defined whereas in the non [i] - version it runs smoothly]
Additionally, I am normalizing the confusion matrix. How could I print out the average result of the confusion matrix after the whole loop? (average of the 200 different confusion matrices)
I am not that fluent with python yet.

First getting to array not defined problem.
In python list is declared as :
array=[]
Since size of list is not given during declaration, no space is allocated. Hence we can't assign values the place which is not allocated.
array[i]=some value, but no space is allocated for array
So if you know the required size of array,fill zeroes during declaration and the use array this way or use array.append() method inside the loop.
Now for saving confusion matrix:
Since confusion matrix returns 2-D array and you need to save multiple such arrays, use 3-D array for saving the value.
import numpy as np
matrix_result=np.zeroes((200,len(y_pred),len(y_pred)))
for i in range(200):
matrix_result[i]=confusion_matrix(X_pred,y_pred)
For averaging
matrix_result_average=matrix_result.mean(axis=0)

I'm not sure what you mean by training over a set of target variables (please elaborate), but here is a start at averaging over confusion matrices, using numpy.
First an empty result matrix is created, which is three-dimensional and the size of 200 stacked confusion matrices. These are then filled in one-by-one in the for-loop. Finally the resulting matrix is averaged along the dimension of the targets, resulting in the average confusion matrix.
import numpy as np
N = len(Y_pred)
result = np.zeros((len(targets), N, N))
for i, target in enumerate(targets):
result[i] = confusion_matrix(Y_test, Y_pred) # do someting with target?
print(result.mean(axis=0))

Getting Around "ValueError: operands could not be broadcast together"

The code below yields the following value error.
ValueError: operands could not be broadcast together with shapes (8,8) (64,)
It first arose when I expanded the "training" data set from 10 images to 100. The interpreter seems to be telling me that I can't perform any coordinate-wise operations on these data points because one of the coordinate pairs is missing a value. I can't argue with that. Unfortunately, my work arounds haven't exactly worked out. I attempted to insert an if condition followed by a continue statement (i.e., if this specific coordinate comes up, it should continue from the top of the loop). The interpreter didn't like this idea and muttered something about the truth of that statement not being as cut and dry as I thought. It suggested I try a.any() or a.all(). I checked out examples of both, and tried placing the problematic coordinate pair in the parenthesis and in place of the "a." Both approaches got me nowhere. I'm unaware of any Python functions similar to the functions I would use in C to exclude inputs that don't meet specific criteria. Other answers pertaining to similar problems recommend changing the math one uses, but I was told that this is how I am to proceed, so I'm looking at it as an error handling problem.
Does anyone have any insight concerning how one might handle this issue? Any thoughts would be greatly appreciated!
Here's the code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
#print the 0th image in the image database as an integer matrix
print(digits.images[0])
#plot the 0th image in the database assigning each pixel an intensity of black
plt.figure()
plt.imshow(digits.images[0], cmap = plt.cm.gray_r, interpolation = 'nearest')
plt.show()
#create training subsets of images and targets(labels)
X_train = digits.images[0:1000]
Y_train = digits.target[0:1000]
#pick a test point from images (345)
X_test = digits.images[345]
#view test data point
plt.figure()
plt.imshow(digits.images[345], cmap = plt.cm.gray_r, interpolation = 'nearest')
plt.show()
#distance
def dist(x, y):
return np.sqrt(np.sum((x - y)**2))
#expand set of test data
num = len(X_train)
no_errors = 0
distance = np.zeros(num)
for j in range(1697, 1797):
X_test = digits.data[j]
for i in range(num):
distance[i] = dist(X_train[i], X_test)
min_index = np.argmin(distance)
if Y_train[min_index] != digits.target[j]:
no_errors += 1
print(no_errors)

You need to show us where the error occurs, and some of the error stack.
Then you need to identify which arrays are causing the problem, and examine their shape. Actually the error tells us that. One operand is a 8x8 2d array. The other has the same number of elements but with a 1d shape. You may have to trace some variables back to your own code.
Just to illustrate the problem:
In [381]: x = np.ones((8,8),int)
In [384]: y = np.arange(64)
In [385]: x*y
...
ValueError: operands could not be broadcast together with shapes (8,8) (64,)
In [386]: x[:] = y
...
ValueError: could not broadcast input array from shape (64) into shape (8,8)
Since the 2 arrays have the same number of elements, a fix likely involves reshaping one or the other:
In [387]: x.ravel() + y
Out[387]:
array([ 1, 2, 3, 4, 5, ... 64])
or x-y.reshape(8,8).
My basic point is, you need to understand what array shapes mean, and how arrays of different shape can be used together. You don't 'get around' the error, you fix the inputs so they are 'broadcasting' compatible.
I don't think problem is with the value of a specific element.
The truth value error occurs when you try to test an array in a if context. if expects a simple True or False, not an array of True/False values.
In [389]: if x>0:print('yes')
....
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Caffe Multi-Label Matrix Input

I am solving a detection problem using ConvNet. However in my case the labels are matrix of dimension [3 x 5] for each image. I use Caffe for this work. I read the images using the Datalayer while I read the labels using HDF5Layer.
The HDF5Layer reads the [3x5] label matrix as [1x15] dimensional vector.
So I used Reshape Layer with reshape the vector into matrix before computing the L2-loss. However I realized that the reshape layer formats the data in H x W while my label matrix is [W x H] i.e., [w=3, h=5]
hence the reshape is incorrect. I wonder is there a way to reshape the [1x15] label vector in right order i.e., [3x5] and not [5x3]
Another way I thought I can work around is by flatten the output form Convolutional layer into [1 x 15] and then compute the loss using my [1 x 15] label.
I am showing the problem using Figures for better understanding because of my poor English.
Eample of my input matrix label (Note the images are just enlarged for illustration)
Result of Caffe Reshape Layer
Any suggestion if I am doing it right?

Either way of computing the loss is just fine. In fact, computing in the 1x15 shape will save you the time of converting. The loss computation is still pixel-by pixel; the logical organization doesn't matter.
Using the same idea, it doesn't really matter whether you compute 3x5 or 5x3; all that matters is that your convolutional output and your label properly match each other.
If you want the display (graph, picture, etc.) to match, perhaps you can just switch the x and y designations before you plot the output.

Fastest way to Iterate a Matrix with vectors as entries in numpy

I'm using a function in python's opencv library to get the light flow movement of my hand as I move it around. Specifically http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_tracking.html#calcopticalflowfarneback
This function outputs a numpy array
flow = cv2.calcOpticalFlowFarneback(prevgray, gray, 0.5, 3, 15, 3, 5, 1.2, 0)
print flow.shape # prints (480,320,2)
So flow is a matrix with each entry a vector. I want a way to quantify this matrix so I though of using the L1 Matrix norm (numpy.linalg.norm(flow, 1)) Which throws a improper dimensions to norm error.
I'm thinking about getting around this by calculating the euclidean norm of every vector and then finding the L1 norm of a matrix with the distances of the vectors.
I'm having trouble iterating through the flow matrix efficiently. I have done it using two for loops by going first through columns and then rows, but it's way too slow.
r,c,d = flow.shape
flowprime = numpy.zeros((r,c),flow.dtype)
for i in range(0,r):
for j in range (0,c):
flowprime[i,j] = numpy.linalg.norm(flow[i,j], 2)
print(numpy.linalg.norm(flowprime, 1))
I had also tried using numpy.nditer but
for x in numpy.nditer(flow, op_flags=['readwrite']):
print x
just prints a single value rather than a vector.
What would be the fastest way to iterate through a numpy matrix with vectors as entries, norm them and then take the L1 norm?

As of numpy version 1.9, norm takes an axis argument.
Aside from that, say what you want ideally, and almost surely you can ask numpy to do it. E.g., assuming no complex entries or missing values, the simplest case np.sqrt((flow**2).sum()) or the case I think you describe np.linalg.norm(np.sqrt((flow**2).sum(axis=-1)),1).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot text documents in a scatter map? - python

If X is a sparse matrix, you probably need X = X.todense() in order to get access to the data in the correct format. You probably want to check X.shape before doing this though, as if X is very large (but very sparse) it may consume a lot of memory when "densified".

Related

Auto broadcasting in Scipy

Saving confusion matrix

Getting Around "ValueError: operands could not be broadcast together"

Caffe Multi-Label Matrix Input

Fastest way to Iterate a Matrix with vectors as entries in numpy

Categories

Resources