Error using Sklearn in a for loop - python

I am running Python 3, and when I attempt to run this code:
from sklearn.preprocessing import LabelEncoder
cv=train.dtypes.loc[train.dtypes=='object'].index
print (cv)
le=LabelEncoder()
for i in cv:
train[i]=le.fit_transform(train[i])
test[i]=le.fit_transform(test[i])
However, i get this error:
le=LabelEncoder()
for i in cv:
train[i]=le.fit_transform(train[i])
test[i]=le.fit_transform(test[i])
Traceback (most recent call last):
File "<ipython-input-5-8739984f61b2>", line 3, in <module>
train[i]=le.fit_transform(train[i])
File "C:\Users\myname\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 127, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "C:\Users\myname\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 195, in unique
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: unorderable types: str() > float()
Oddly enough, if I call the encoder on a specified column in my data, the output is successful. For instance:
le.fit_transform(test['Race'])
Results in:
le.fit_transform(test['Race'])
Out[7]: array([2, 4, 4, ..., 4, 1, 4], dtype=int64)
I've tried:
float(le.fit_transform(train[i]))
str(le.fit_transform(train[i]))
Both have not worked.
Could someone please provide help me out?

Related

Relative difference in numpy.testing.assert_allclose

I could not understand how numpy.testing.assert_allclose method is calculating relative difference between two arrays. Is it calculating in percentage or without taking percentage? For example, If I have two arrays
import numpy as np
gfg1 = [1, 2, 3]
gfg2 = np.array([4, 8, 9])
np.testing.assert_allclose(gfg1, gfg2)
the following error occurs:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1515, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/home/anaconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 841, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatch: 100%
Max absolute difference: 6
Max relative difference: 0.75
Max absolute difference is understood but what about relative difference?
If you go to the source code of assert_allclose you will see that it calls assert_array_compare. And inside the assert_array_compare you can see that the maximum relative difference is calculated as max(error[nonzero] / abs(y[nonzero])) where the error is abs(x - y).
So, in your case, for x = np.array([1, 2, 3]) and y = np.array([4, 8, 9]), you get
max_rel_error == max(|1-4|/|4|, |2-8|/|8|, |3-9|/|9|) == 0.75

Autograd breaks np.empty_like

I'm trying to take the gradient of a function in which I assign numpy array elements individually (assigning local forces to a global force vector in an FEA), but this appears to break Autograd -- if I use np.zeros for the global array I get ValueError: setting an array element with a sequence, while if I use np.empty I get NotImplementedError: VJP of empty_like wrt argnums (0,) not defined.
Example:
import autograd.numpy as np
from autograd import jacobian, grad
def test(input):
a = np.empty_like(input)
a[:] = input[:]
grad(test)(np.array([0.]))
Gives the error:
C:\Miniconda3\python.exe C:/Users/JoshuaF/Desktop/gripper/softDrone/bug_test.py
Traceback (most recent call last):
File "C:\Miniconda3\lib\site-packages\autograd\core.py", line 31, in __init__
vjpmaker = primitive_vjps[fun]
KeyError: <function primitive.<locals>.f_wrapped at 0x000001AB1D0AA8C8>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/JoshuaF/Desktop/gripper/softDrone/bug_test.py", line 8, in <module>
grad(test)(np.array([0.]))
File "C:\Miniconda3\lib\site-packages\autograd\wrap_util.py", line 20, in nary_f
return unary_operator(unary_f, x, *nary_op_args, **nary_op_kwargs)
File "C:\Miniconda3\lib\site-packages\autograd\differential_operators.py", line 25, in grad
vjp, ans = _make_vjp(fun, x)
File "C:\Miniconda3\lib\site-packages\autograd\core.py", line 10, in make_vjp
end_value, end_node = trace(start_node, fun, x)
File "C:\Miniconda3\lib\site-packages\autograd\tracer.py", line 10, in trace
end_box = fun(start_box)
File "C:\Miniconda3\lib\site-packages\autograd\wrap_util.py", line 15, in unary_f
return fun(*subargs, **kwargs)
File "C:/Users/JoshuaF/Desktop/gripper/softDrone/bug_test.py", line 5, in test
a = np.empty_like(input)
File "C:\Miniconda3\lib\site-packages\autograd\tracer.py", line 45, in f_wrapped
node = node_constructor(ans, f_wrapped, argvals, kwargs, argnums, parents)
File "C:\Miniconda3\lib\site-packages\autograd\core.py", line 35, in __init__
.format(fun_name, parent_argnums))
NotImplementedError: VJP of empty_like wrt argnums (0,) not defined
Is there any way to use Autograd on a numpy array which is assembled element-wise?
Based on the tutorial https://github.com/HIPS/autograd/blob/master/docs/tutorial.md, it looks like array assignment is unfortunately not supported in autograd functions.

Numpy giving error when not imported.

So I'm trying out machine learning, and following a tutorial I found online.
For some reason when I run my code numpy is giving me an error, even-though I am not importing that library. (I've been having problems with numpy)
Code:
#!/usr/bin/env python
from sklearn import tree
#1 = smooth 0 = bumpy
features = [[140, 1], [130, 1], [150, 0], [170, 0]] #input
labels = ["apple", "apple", "orange", "orange"] #desired output
#0 = apple 1 = orange
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
print clf.predict([[160, 0]])
Error:
C:\Windows\system32\cmd.exe /c (python ^<C:\Users\me\AppData\Local\Temp\22\V
Ii532A.tmp)
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "E:\Python27\lib\site-packages\sklearn\__init__.py", line 134, in <module
>
from .base import clone
File "E:\Python27\lib\site-packages\sklearn\base.py", line 9, in <module>
import numpy as np
File "E:\Python27\lib\site-packages\numpy\__init__.py", line 142, in <module>
from . import add_newdocs
File "E:\Python27\lib\site-packages\numpy\add_newdocs.py", line 13, in <module
>
from numpy.lib import add_newdoc
File "E:\Python27\lib\site-packages\numpy\lib\__init__.py", line 8, in <module
>
from .type_check import *
File "E:\Python27\lib\site-packages\numpy\lib\type_check.py", line 11, in <mod
ule>
import numpy.core.numeric as _nx
File "E:\Python27\lib\site-packages\numpy\core\__init__.py", line 21, in <modu
le>
from . import function_base
File "E:\Python27\lib\site-packages\numpy\core\function_base.py", line 7, in <
module>
from .numeric import (result_type, NaN, shares_memory, MAY_SHARE_BOUNDS,
ImportError: cannot import name shares_memory
shell returned 1
Hit any key to close this window...
Thanks
P.S.
Also looking for a couple tutorial suggestions, one with machine learning and NLP would be great
Numpy is a scikitlearn dependency. That means SKlearn is made on top of numpy.
Creating a virtualenv is a great idea so as to understand what the real issue is.
The same code worked for me and I can tell you the prediction is "orange". :P

python-Cannot call a function in script but can in the interactive mode

It's a simple task about kNN, and I'm a newbee of pyhton.
# coding=utf-8
from numpy import *
import operator
def createDataSet():
group = array([[112, 110], [128, 162], [83, 206], [142, 267], [188, 184], [218, 268], [234, 108], [256, 146], [
333, 177], [350, 86], [364, 237], [378, 117], [409, 147], [485, 130], [326, 344], [387, 326], [378, 435], [434, 375]])
labels = [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3]
return group, labels
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
tempSet = array(tile(inX, (dataSetSize, 1)))
diffMat = tempSet - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndices = distances.argsort()
classCount = {}
for i in range(k):
voteLabel = labels[sortedDistIndices[i]]
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
sortedClassCount = sorted(classCount.iteritems(),
key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
# TRY1
# def with_intput():
# sample = array(raw_input('Enter you data:'))
# group, labels = createDataSet()
# sampleClass = classify0(sample, group, labels, 3)
# print sampleClass
# with_intput()
# TRY1
# TRY2
# sample = array(raw_input('Enter your sample data:'))
# group, labels = createDataSet()
# sampleClass = classify0(sample, group, labels, 3)
# print sampleClass
# TRY2
There is something really strange. I created a function name classify0(), but if i call it while writing the codes(uncomment the #TRY1),or use it to make assingment(if uncomment the #TRT2), it will return error when I run this file.
Appears likes:
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S11') dtype('S11') dtype('S11')
Here is the traceback of TRY1:
Traceback (most recent call last):
File "C:\Users\zhongzheng\Desktop\ML_Code\temp2.py", line 39, in <module>
with_intput()
File "C:\Users\zhongzheng\Desktop\ML_Code\temp2.py", line 36, in with_intput
sampleClass = classify0(sample, group, labels, 3)
File "C:\Users\zhongzheng\Desktop\ML_Code\temp2.py", line 17, in classify0
diffMat = tempSet - dataSet
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S11') dtype('S11') dtype('S11')
And the traceback of TRY2:
Traceback (most recent call last):
File "C:\Users\zhongzheng\Desktop\ML_Code\temp2.py", line 46, in <module>
sampleClass = classify0(sample, group, labels, 3)
File "C:\Users\zhongzheng\Desktop\ML_Code\temp2.py", line 17, in classify0
diffMat = tempSet - dataSet
TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('S11') dtype('S11') dtype('S11')
But if I save the file without uncommenting either TRT1 or TRY2, save and run the file with only two functions in it, then enter these commands line by line in interactive mode in cmd or ipython:
>>>group,labels = createDataSet()
>>>sampleClass = classify0(array([111,111]), group, labels, 3)
>>>print sampleClass
It will work just fine.
Cannot figure out why.
One more question, why my sublime3(subliemlinter, pep8linter installed) keeps warnning from numpy import * or import numpy or import numpy as np is wrong.
Thanks for your patience.
Your raw_input is not taken what you expect as input for the classify0 function.
sample = array(raw_input('Enter you data:'))
This would give something like ["111 111"]
sample = [int(x) for x in raw_input().split()]
This would give [111,111]
You could also change the delimiter to split on, i.e. use a , if input is comma separated

Using a custom metric with sklearn.neighbors.BallTree gives wrong input?

I'm trying to use a custom metric with sklearn.neighbors.BallTree, but when it calls my metric the inputs do not look correct. If I use scipy.spatial.distance.pdist with the same custom metric, it works as expected. However, if I try to instantiate a BallTree, an exception is raised when I try to reshape the input. If I look at the actual inputs, the shape and values do not look correct.
import numpy as np
import scipy.spatial.distance as spdist
import sklearn.neighbors.ball_tree as ball_tree
# custom metric
def minimum_average_direct_flip(x, y):
x = np.reshape(x, (-1, 3))
y = np.reshape(y, (-1, 3))
direct = np.mean(np.sqrt(np.sum(np.square(x - y), axis=1)))
flipped = np.mean(np.sqrt(np.sum(np.square(np.flipud(x) - y), axis=1)))
return min(direct, flipped)
# create an X to test
X = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9], [11, 12, 13, 14, 15, 16, 17, 18, 19], [21, 22, 23, 24, 25, 26, 27, 28, 29]])
# works as expected
distances = spdist.pdist(X, metric=minimum_average_direct_flip)
# outputs: [ 17.32050808 34.64101615 17.32050808]
print distances
# raises exception, inputs to minimum_average_direct_flip look wrong
# Traceback (most recent call last):
# File ".../test_script.py", line 23, in <module>
# ball_tree.BallTree(X, metric=minimum_average_direct_flip)
# File "sklearn/neighbors/binary_tree.pxi", line 1059, in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn\neighbors\ball_tree.c:8381)
# File "sklearn/neighbors/dist_metrics.pyx", line 262, in sklearn.neighbors.dist_metrics.DistanceMetric.get_metric (sklearn\neighbors\dist_metrics.c:4032)
# File "sklearn/neighbors/dist_metrics.pyx", line 1091, in sklearn.neighbors.dist_metrics.PyFuncDistance.__init__ (sklearn\neighbors\dist_metrics.c:10586)
# File "C:/Users/danrs/Documents/neuro_atlas/test_script.py", line 8, in minimum_average_direct_flip
# x = np.reshape(x, (-1, 3))
# File "C:\Anaconda2\lib\site-packages\numpy\core\fromnumeric.py", line 225, in reshape
# return reshape(newshape, order=order)
# ValueError: total size of new array must be unchanged
ball_tree.BallTree(X, metric=minimum_average_direct_flip)
In the first call to minimum_average_direct_flip from the BallTree code, the inputs are:
x = [ 0.4238394 0.55205233 0.04699435 0.19542642 0.20331665 0.44594837 0.35634537 0.8200018 0.28598294 0.34236847]
y = [ 0.4238394 0.55205233 0.04699435 0.19542642 0.20331665 0.44594837 0.35634537 0.8200018 0.28598294 0.34236847]
These look completely incorrect. Am I doing something wrong in the way I am calling this or is this a bug in sklearn?
It seems that this is a known issue:
https://github.com/scikit-learn/scikit-learn/issues/6287
They do some kind of validation step that is problematic. As a workaround I guess I can add a check on the input size, but as the issue notes this is undesirable because I can't do actual validation checks myself.

Categories