Python kernel crash in Jupyter Notebook when calling TSNE.fit_transform() - python

I have an output of sklearn's tf-idf which I want to visualize with T-SNE. However, when calling fit_transform on sklearn's T-SNE object, I get the error message:
"Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a
previous cell. Please review the code in the cell(s) to identify a
possible cause of the failure. Click here for more info. View Jupyter
log for further details."
Why is this happening? Code below.
dense = np.array(
[[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 1. , 0. ],
[0. , 0. , 0. , 1. , 0. ],
[0.70710678, 0.70710678, 0. , 0. , 0. ],
[0. , 0. , 0.70710678, 0. , 0.70710678],
[0.70710678, 0.70710678, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0.70710678, 0. , 0.70710678]])
from sklearn.manifold import TSNE
tsne = TSNE(n_components = 2, verbose = 1, perplexity = 50, n_iter = 1000)
results = tsne.fit_transform(dense)

I wasn't able to reproduce the error in a Google Colab. It works fine on my end, with the following output:
[t-SNE] Computing 10 nearest neighbors...
[t-SNE] Indexed 11 samples in 0.000s...
[t-SNE] Computed neighbors for 11 samples in 0.009s...
[t-SNE] Computed conditional probabilities for sample 11 / 11
[t-SNE] Mean sigma: 1125899906842624.000000
[t-SNE] KL divergence after 250 iterations with early exaggeration: 39.474655
[t-SNE] KL divergence after 1000 iterations: 0.268328
I've found an old thread on GitHub that may address the problem. It is a Mac related issue, but I don't know what OS does your machine has.
There is a chance that they fixed the error in newer versions of sklearn, so my first suggestion is to try upgrading, if you haven't already.
If the issue still persists, since the problem may be due to a dependency that sklearn uses (and even if you do not have a Mac, you still have a problem), I would recommend using a different library. I know about python-bhtsne that can be used in a similar way as sklearn's.

Related

How do I implement separator in a dataset loaded directly from sklearn library?

I know how to use separator (sep ="") when importing the dataset using pd.read_csv
but I don't know what to use to implement the separator on a dataset loaded from sklearn itself, like the digits dataset i used below where i want to implement the \n separator.
code:
from sklearn.datasets import load_digits
import pandas as pd
df = load_digits()
print(df)
If you look at carefully, you'll see that load_digits is a dictionary. You can reach its elements by
df.keys()
which returns
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
So, if you want to get the data, just call the data key
df['data']
returns
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]

model.predict classes don't match dataset classes

I'm writing a CNN classifier, using Keras, that it supposed to classify a set of 40k+ pictures of road signs to one of 43 classes. Everything is fine until I try to find out what mistakes the model has made while classifying unseen data. It appears that the classes in the output file are mismatched to the classes from the dataset, and I don't know how to determine which class is which. The problem is better explained at the end of the question.
The batch size is 64. The output file is very large, but it has a structure as follows:
[[3.81430182e-05 3.55855487e-02 3.77756208e-02 ... 3.93179851e-03 4.57952236e-04 1.19631949e-07]
[2.46175125e-09 8.71188703e-08 9.04489157e-12 ... 7.63094476e-08 2.24849509e-06 9.93708588e-13]
...
[1.31991830e-13 1.99924495e-12 7.65954244e-10 ... 1.51650678e-13 1.77550303e-14 9.25261628e-16]]
-
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
...
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
This is the output for one batch, there are 198 such batches in total. First there are 64 rows with 43 values each representing the output of the neural network. Then there are 64 rows with 43 values each, that represent which class is the correct classification.
In the test set, the classes are denoted by a folder structure as follows:
Test_New/0
00245.png
00252.png
00403.png
...
Test_New/1
00001.png
00024.png
00076.png
...
...
Test_New/42
00315.png
00507.png
00755.png
...
The problem is, that the classes from the file don't match up with the classes from the output file! In other words, I would expect that this in the output file:
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Would mean that the correct classification for this particular image was the third class, because the 1 is in the 3rd spot. But this is not the case. How do I know? Because I know that there are exactly 750 files in the "Test_New/2" folder which represents the third class, but when I use the find function in notepad++ to find all instances of the
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
line, it returns a number of 660. That means that there are 660 instances of this line in the file, and that means that it cannot represent the third class. In fact, it represents the 11th class because it's the only one with this many files in it. This wouldn't be a problem if all the folders had a different number of files, but unfortunately some of them share the same number of files.
My question is why are the output classes shuffled in the output file, and how do I fix this? How do I know which class is which? If you don't know, do you know if there is a different way to know which images were wrongly classified? Please help, I've been pulling my hair out for the last 3 hours or so. I'm sorry that there is so much code, I just don't know where the error is. Thank you!
In your test and validation generators set shuffle-False.
In model.fit do not specify steps_per_epoch or validation_steps let model.fit determine these values internally. Now one of the things you must remember is that python functions like flow_from_directory process the filenames in alphanumeric order. So if for example you have files in a directory labelled 1.jpg, 2.jpg ----9.jpg, 10,jpg ---- the order the files are processed in is 1.jpg, 10.jpg, 11.jpg-----19.jpg, 2.jpg ---. So if you expect the order to be strictly numerical it is not. Below is the code for a function that will detect misclassified test files and print out the file names, true class, predicted class and the probability of the prediction for all misclassified test files.
def print_info( test_dir, test_gen, preds, print_code ):
# test_dir is the full path to the directory containing the test images
# test_gen is the name of your test generator
# preds are the prediction from preds=model.predict
# print code is an integer specifying the maximum number of error files you want to print out
class_dict=test_gen.class_indices
labels= test_gen.labels
file_names= test_gen.filenames
error_list=[]
true_class=[]
pred_class=[]
prob_list=[]
new_dict={}
error_indices=[]
y_pred=[]
for key,value in class_dict.items():
new_dict[value]=key # dictionary {integer of class number: string of class name}
classes=list(new_dict.values()) # list of string of class names
errors=0
for i, p in enumerate(preds):
pred_index=np.argmax(p)
true_index=labels[i] # labels are integer values
if pred_index != true_index: # a misclassification has occurred
error_list.append(file_names[i])
true_class.append(new_dict[true_index])
pred_class.append(new_dict[pred_index])
prob_list.append(p[pred_index])
error_indices.append(true_index)
errors=errors + 1
y_pred.append(pred_index)
if print_code !=0:
if errors>0:
if print_code>errors:
r=errors
else:
r=print_code
msg='{0:^28s}{1:^28s}{2:^28s}{3:^16s}'.format('Filename', 'Predicted Class' , 'True Class', 'Probability')
print(msg)
for i in range(r):
msg='{0:^28s}{1:^28s}{2:^28s}{3:4s}{4:^6.4f}'.format(error_list[i], pred_class[i],true_class[i], ' ', prob_list[i])
print(msg)
else:
msg='With accuracy of 100 % there are no errors to print'
print(msg)

Prohibit automatic linebreaks in Pycharm Output when using large Matrices

I'm working in PyCharm on Windows. In the project I'm currently working on I have "large" matrices, but when i output them Pycharm automatically adds linebreaks so that one row occupys two lines instead of just one:
[[ 3. -1.73205081 0. 0. 0. 0. 0.
0. 0. 0. ]
[-1.73205081 1. -1. -2. 0. 0. 0.
0. 0. 0. ]
[ 0. -1. 1. 0. -1.41421356 0. 0.
0. 0. 0. ]
[ 0. -2. 0. 1. -1.41421356 0.
-1.73205081 0. 0. 0. ]
[ 0. 0. -1.41421356 -1.41421356 0. -1.41421356
0. -1.41421356 0. 0. ]
[ 0. 0. 0. 0. -1.41421356 0. 0.
0. -1. 0. ]
[ 0. 0. 0. -1.73205081 0. 0. 3.
-1.73205081 0. 0. ]
[ 0. 0. 0. 0. -1.41421356 0.
-1.73205081 1. -2. 0. ]
[ 0. 0. 0. 0. 0. -1. 0.
-2. 0. -1.73205081]
[ 0. 0. 0. 0. 0. 0. 0.
0. -1.73205081 0. ]]
It make my results very hard to reed and to compare. The window is big enough so that everything should be displayed but it still breaks the rows. Is there any setting to prevent this?
Thanks in advance!
PyCharm default console width is set to 80 characters.
Lines are printed without wrapping unless you set soft wrap in options:
File -> Settings -> Editor -> General -> Console -> Use soft wraps in console.
However both options make reading big matrices hard.
You can fix this in few ways.
With this test code:
import random
m = [[random.random() for a in range(10)] for b in range(10)]
print(m)
You can try one of these:
Pretty print
Use pprint module, and override line width:
import pprint
pprint.pprint(m, width=300)
Numpy
For numpy version 1.13 and lower:
If you use numpy module, configure arrayprint option:
import numpy
numpy.core.arrayprint._line_width = 300
print(numpy.matrix(m))
For numpy version 1.14 and above (thanks to #Alex Johnson):
import numpy
numpy.set_printoptions(linewidth=300)
print(numpy.matrix(m))
Pandas
If you use pandas module, configure display.width option:
import pandas
pandas.set_option('display.width', 300)
print(pandas.DataFrame(m))

How do I change column type in Python from int to object for sklearn?

I am really new to Python and scikit-learn (sklearn) and I am trying to load this dataset which consists of 7 columns of attributes and 1 column of the data classification (class/data target). But there's this one attribute which consists of data [1,2,3,4,5] which actually marks a stage of something, thus making it a nominal, not numeric. But of course python recognizes it as a numerical data (int64), when in fact I want it to be treated as a nominal data (object). How do I change the column type to nominal?
I have done the following.
print(data.dtypes)
data["col_name"]=data["col_name"].astype(numpy.object)
print(data.dtypes)
In the first print, it still recognizes my data["col_name"] as an int64, but after the astype line, it has changed it object. But it doesn't make any difference to the data, since when I try to use matplotlib and create a histogram, it still recognizes both the X and Y as numbers instead of object.
Also I have read about the One Hot Encoding and Label Encoding on the documentation, but I figured they are not what I need in my case. I wonder if I have misunderstood something or maybe there's another solution.
Thanks
Reading through the documents for sklearn. This package has thorough documentation. In particular the Preprocessing section on encoding categorical features:
In regards to keeping categorical features represented in an array of integers, ie [1,2,3,4,5], we have this:
Such integer representation can not be used directly with scikit-learn
estimators, as these expect continuous input, and would interpret the
categories as being ordered, which is often not desired (i.e. the set
of browsers was ordered arbitrarily). One possibility to convert
categorical features to features that can be used with scikit-learn
estimators is to use a one-of-K or one-hot encoding, which is
implemented in OneHotEncoder. This estimator transforms each
categorical feature with m possible values into m binary features,
with only one active.
So what you can to do is convert your array into 5 new columns (this case, since you have 5 possible values) using one-hot encoding.
Here is some working code. The input is a column of categorical parameters [1,2,3,4,5], the ouput is a matrix, 5 columns, 1 for each of the 5 possible choices:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[1],[2],[3],[4],[5]])
OneHotEncoder(categorical_features='all', dtype='numpy.float64', handle_unknown='error', n_values='auto', sparse=True)
print enc.transform([[1],[2],[3],[4],[5]]).toarray()
Output:
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
Say your categorical parameters were in this order: [1,3,2,5,4,3,2,1,3,4,2]. You would get this output:
[[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 1. 0. 0. 0.]]
So this 1 column will convert into 5 columns.
print(data.dtypes)
data["col_name"]=data["col_name"].astype(str)
print(data.dtypes)

Sklearn digits dataset

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
print(digits.data)
classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.data[:-1], digits.target[:-1]
x = x.reshape(1,-1)
y = y.reshape(-1,1)
print((x))
classifier.fit(x, y)
###
print('Prediction:', classifier.predict(digits.data[-3]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
I have reshaped the x and y as well. Still I'm getting an error saying :
Found input variables with inconsistent numbers of samples: [1, 1796]
Y has 1-d array with 1796 elements whereas x has many. How does it show 1 for x?
Actually scrap what I suggested below:
This link describes the general dataset API. The attribute data is a 2d array of each image, already flattened:
import sklearn.datasets
digits = sklearn.datasets.load_digits()
digits.data.shape
#: (1797, 64)
This is all you need to provide, no reshaping required. Similarly, the attribute data is a 1d array of each label:
digits.data.shape
#: (1797,)
No reshaping necessary. Just split into training and testing and run with it.
Try printing x.shape and y.shape. I feel that you're going to find something like: (1, 1796, ...) and (1796, ...) respectively. When calling fit for classifiers in scikit it expects two identically shaped iterables.
The clue, why are the arguments when reshaping different ways around:
x = x.reshape(1, -1)
y = y.reshape(-1, 1)
Maybe try:
x = x.reshape(-1, 1)
Completely unrelated to your question, but you're predicting on digits.data[-3] when the only element left out of the training set is digits.data[-1]. Not sure if that was intentional.
Regardless, it could be good to check your classifier over more results using the scikit metrics package. This page has an example of using it over the digits dataset.
The reshaping will transform your 8x8 matrix to a 1-dimensional vector, which can be used as a feature. You need to reshape the entire X vector, not only those of the training data, since the one's you will use for prediction need to have the same format.
The following code shows how:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn import svm
digits = datasets.load_digits()
classifier = svm.SVC(gamma=0.4, C=100)
x, y = digits.images, digits.target
#only reshape X since its a 8x8 matrix and needs to be flattened
n_samples = len(digits.images)
x = x.reshape((n_samples, -1))
print("before reshape:" + str(digits.images[0]))
print("After reshape" + str(x[0]))
classifier.fit(x[:-2], y[:-2])
###
print('Prediction:', classifier.predict(x[-2]))
###
plt.imshow(digits.images[-2], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
###
print('Prediction:', classifier.predict(x[-1]))
###
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
It will output:
before reshape:[[ 0. 0. 5. 13. 9. 1. 0. 0.]
[ 0. 0. 13. 15. 10. 15. 5. 0.]
[ 0. 3. 15. 2. 0. 11. 8. 0.]
[ 0. 4. 12. 0. 0. 8. 8. 0.]
[ 0. 5. 8. 0. 0. 9. 8. 0.]
[ 0. 4. 11. 0. 1. 12. 7. 0.]
[ 0. 2. 14. 5. 10. 12. 0. 0.]
[ 0. 0. 6. 13. 10. 0. 0. 0.]]
After reshape[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5.
0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8.
8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1.
12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13.
10. 0. 0. 0.]
And a correct prediction for the last 2 images, which weren't used for training - you can decide however to make a bigger split between testing and training set.

Categories