How to use sklearn FeatureHasher? - python

I have a dataframe like this:
import pandas as pd
test = pd.DataFrame({'type': ['a', 'b', 'a', 'c', 'b'], 'model': ['bab', 'ba', 'ba', 'ce', 'bw']})
How do I use the sklearn FeatureHasher
on it?
I tried:
from sklearn.feature_extraction import FeatureHasher
FH = FeatureHasher()
train = FH.transform(test.type)
but it doesn't like it? it seems it wants a string or a list so I try
FH.transform(test.to_dict(orient='list'))
but that doesn't work either?
I get:
AttributeError: 'str' object has no attribute 'items'
thanks

You need to specify the input type when initializing your instance of FeatureHasher:
In [1]:
from sklearn.feature_extraction import FeatureHasher
h = FeatureHasher(n_features=5, input_type='string')
f = h.transform(test.type)
f.toarray()
Out[1]:
array([[ 1., 0., 0., 0., 0.],
[ 0., -1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., -1., 0., 0.],
[ 0., -1., 0., 0., 0.]])
Note that this will assume the value of these feature is 1 according to the documentation linked above (bold emphasis is mine):
input_type : string, optional, default “dict”
Either “dict” (the default) to accept dictionaries over (feature_name, value);
“pair” to accept pairs of (feature_name,
value);
or “string” to accept single strings. feature_name should be a
string, while value should be a number. In the case of “string”, a
value of 1 is implied.
The feature_name is hashed to find the
appropriate column for the feature. The value’s sign might be flipped
in the output (but see non_negative, below).

Related

unable to get length of items in .npy file

I have a .npy file here
Its just a file with an object that is a list of images and their labels. for example:
{
'2007_002760': array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,0., 0., 0.], dtype=float32),
'2008_004036': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,0., 0., 0.], dtype=float32)
}
I want to open the file and get its length, and then possibly add to it or modify it
I am able to open the file, but I cant get the length of items in it.
Heres how i open it:
import numpy as np
file = np.load('cls_labels.npy', allow_pickle = True)
print(file.size)
What am I missing here?
Your file contains a dictionary wrapped inside a 0-dimensional numpy object. The magic to extract the actual information is:
my_dictionary = file[()]
This is a standard dictionary whose keys are strings like '2008_004036' and whose values are numpy arrays.
Edit: And as mentioned above, you shouldn't be saving dictionaries using numpy.save(), you should have been using pickle. You end up with horrors like file[()].
here is the correct and easiest way to do it:
cls_labels = np.load('cls_labels.npy', allow_pickle = True).item()

How to avoid two variables refering to the same data? #Pytorch

During initializing, I tried to reduce the repeats in my code, so instead of:
output= (torch.zeros(2, 3),
torch.zeros(2, 3))
I wrote:
z = torch.zeros(2, 3)
output= (z,z)
However, I find that the second method is wrong.
If I assign the data to variables h,c, any change on h would also be applied to c
h,c = output
print(h,c)
h +=torch.ones(2,3)
print('-----------------')
print(h,c)
results of the test above:
tensor([[0., 0., 0.],
[0., 0., 0.]]) tensor([[0., 0., 0.],
[0., 0., 0.]])
-----------------
tensor([[1., 1., 1.],
[1., 1., 1.]]) tensor([[1., 1., 1.],
[1., 1., 1.]])
Is there a more elegent way to initialize two indenpendent variables?
I agree that your initial line needs no modification but if you do want an alternative, consider:
z = torch.zeros(2, 3)
output= (z,z.clone())
The reason the other one (output = (z,z)) doesn't work, as you've correctly discovered is that no copy is made. You're only passing the same reference in each entry of the tuple to z
Assign it in a single statement but use separate reference for each, as below:
h, c = output, output

scikit learn LDA giving unexpected results

I am attempting to classify some data with the scikit learn LDA classifier. I'm not entirely sure what to "expect" from it, but what I am getting is weird. Seems like a good opportunity to learn about either a shortcoming of the technique, or a way in which I am applying it wrong. I understand that no line could completely separate this data, but it seems that there are much "better" lines than the one it is finding. I'm just using the default options. Any thoughts on how to do this better? I'm using LDA because it is linear in the size of my dataset. Although I think a linear SVM has a similar complexity. Perhaps it would be better for such data? I will update when I have tested other possibilities.
The picture: (light blue is what my LDA classifier predicts will be dark blue)
The code:
import numpy as np
from numpy import array
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
import itertools
X = array([[ 0.23125754, 0.79170351],
[ 0.78021491, -0.24999486],
[ 0.00856446, 0.41452734],
[ 0.66381753, -0.09872504],
[-0.03178685, 0.04876317],
[ 0.65574645, -0.68214948],
[ 0.14290684, 0.38256002],
[ 0.05156987, 0.11094875],
[ 0.06843403, 0.19110019],
[ 0.24070898, -0.07403764],
[ 0.03184353, 0.4411446 ],
[ 0.58708124, -0.38838008],
[-0.00700369, 0.07540799],
[-0.01907816, 0.07641038],
[ 0.30778608, 0.30317186],
[ 0.55774143, -0.38017325],
[-0.00957214, -0.03303287],
[ 0.8410637 , 0.158594 ],
[-0.00294113, -0.00380608],
[ 0.26577841, 0.07833684],
[-0.32249375, 0.49290502],
[ 0.11313078, 0.35697211],
[ 0.41153679, -0.4471876 ],
[-0.00313315, 0.30065913],
[ 0.14344143, -0.19127107],
[ 0.04857767, 0.01339191],
[ 0.5865007 , 0.71209886],
[ 0.08157439, 0.40909955],
[ 0.72495202, 0.29583866],
[-0.09391461, 0.17976605],
[ 0.06149141, 0.79323099],
[ 0.52208024, -0.2877661 ],
[ 0.01992141, -0.00435266],
[ 0.68492617, -0.46981335],
[-0.00641231, 0.29699622],
[ 0.2369677 , 0.140319 ],
[ 0.6602586 , 0.11200433],
[ 0.25311836, -0.03085372],
[-0.0895014 , 0.45147252],
[-0.18485667, 0.43744524],
[ 0.94636701, 0.16534406],
[ 0.01887734, -0.07702135],
[ 0.91586801, 0.17693792],
[-0.18834833, 0.31944796],
[ 0.20468328, 0.07099982],
[-0.15506378, 0.94527383],
[-0.14560083, 0.72027034],
[-0.31037647, 0.81962815],
[ 0.01719756, -0.01802322],
[-0.08495304, 0.28148978],
[ 0.01487427, 0.07632112],
[ 0.65414479, 0.17391618],
[ 0.00626276, 0.01200355],
[ 0.43328095, -0.34016614],
[ 0.05728525, -0.05233956],
[ 0.61218382, 0.20922571],
[-0.69803697, 2.16018536],
[ 1.38616732, -1.86041621],
[-1.21724616, 2.72682759],
[-1.26584365, 1.80585403],
[ 1.67900048, -2.36561699],
[ 1.35537903, -1.60023078],
[-0.77289615, 2.67040114],
[ 1.62928969, -1.20851808],
[-0.95174264, 2.51515935],
[-1.61953649, 2.34420531],
[ 1.38580104, -1.9908369 ],
[ 1.53224512, -1.96537012]])
y = array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.])
classifier = LDA()
classifier.fit(X,y)
xx = np.array(list(itertools.product(np.linspace(-4,4,300), np.linspace(-4,4,300))))
yy = classifier.predict(xx)
b_colors = ['salmon' if yyy==0 else 'deepskyblue' for yyy in yy]
p_colors = ['r' if yyy==0 else 'b' for yyy in y]
plt.scatter(xx[:,0],xx[:,1],s=1,marker='o',edgecolor=b_colors,c=b_colors)
plt.scatter(X[:,0], X[:,1], marker='o', s=5, c=p_colors, edgecolor=p_colors)
plt.show()
UPDATE: Changing from using sklearn.discriminant_analysis.LinearDiscriminantAnalysis to sklearn.svm.LinearSVC also using the default options gives the following picture:
I think using the zero-one loss instead of the hinge loss would help, but sklearn.svm.LinearSVC doesn't seem to allow custom loss functions.
UPDATE: The loss function to sklearn.svm.LinearSVC approaches the zero-one loss as the parameter C goes to infinity. Setting C = 1000 gives me what I was originally hoping for. Not posting this as an answer, because the original question was about LDA.
picture:
LDA models each class as a Gaussian, so the model for each class is determined by the class' estimated mean vector and covariance matrix.
Judging by the eye only, your blue and red classes have approximately the same mean and same covariance, which means the 2 Gaussians will 'sit' on top of each other, and the discrimination will be poor. Actually it also means that the separator (the blue-pink border) will be noisy, that is it will change a lot between random samples of your data.
Btw your data is clearly not linearly-separable, so every linear model will have a hard time discriminating the data.
If you must use a linear model, try using LDA with 3 components, such that the top-left blue blob is classified as '0', the bottom-right blue blob as '1', and the red as '2'. This way you will get a much better linear model. You can do it by preprocessing the blue class with a clustering algorithm with K=2 classes.

Scikit-learn cross val score: too many indices for array

I have the following code
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.cross_validation import cross_val_score
#split the dataset for train and test
combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75
train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False]
et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0)
min_samples_split=10, random_state=0 )
labels = train[list(label_columns)].values
tlabels = test[list(label_columns)].values
features = train[list(columns)].values
tfeatures = test[list(columns)].values
et_score = cross_val_score(et, features, labels, n_jobs=-1)
print("{0} -> ET: {1})".format(label_columns, et_score))
Checking the shape of the arrays:
features.shape
Out[19]:(43069, 34)
And
labels.shape
Out[20]:(43069, 1)
and I'm getting:
IndexError: too many indices for array
and this relevant part of the traceback:
---> 22 et_score = cross_val_score(et, features, labels, n_jobs=-1)
I'm creating the data from Pandas dataframes and I searched here and saw some reference to possible errors via this method but can't figure out how to correct?
What the data arrays look like:
features
Out[21]:
array([[ 0., 1., 1., ..., 0., 0., 1.],
[ 0., 1., 1., ..., 0., 0., 1.],
[ 1., 1., 1., ..., 0., 0., 1.],
...,
[ 0., 0., 1., ..., 0., 0., 1.],
[ 0., 0., 1., ..., 0., 0., 1.],
[ 0., 0., 1., ..., 0., 0., 1.]])
labels
Out[22]:
array([[1],
[1],
[1],
...,
[1],
[1],
[1]])
When we do cross validation in scikit-learn, the process requires an (R,) shape label instead of (R,1). Although they are the same thing to some extend, their indexing mechanisms are different. So in your case, just add:
c, r = labels.shape
labels = labels.reshape(c,)
before passing it to the cross-validation function.
It seems to be fixable if you specify the target labels as a single data column from Pandas. If the target has multiple columns, I get a similar error. For example try:
labels = train['Y']
Adding .ravel() to the Y/Labels variable passed into the formula helped solve this problem within KNN as well.
try target:
y=df['Survived']
instead , i used
y=df[['Survived']]
which made the target y a dateframe, it seems series would be ok
You might need to play with the dimensions a bit, e.g.
et_score = cross_val_score(et, features, labels, n_jobs=-1)[:,n]
or
et_score = cross_val_score(et, features, labels, n_jobs=-1)[n,:]
n being the dimension.

How to use feature hasher to convert non-numerical discrete data so that it can be passed to SVM?

I am trying to use the CRX dataset from the UCI Machine Learning repository. This particular dataset contains some features which are not continuous variables. Therefore I need to convert them into numerical values before they can be passed to an SVM.
I initially looked into using the one-hot decoder, which takes integer values and converts them into matrices (e.g. if a feature has three possible values, 'red' 'blue' and 'green', this would be converted into three binary features: 1,0,0 for 'red', '0,1,0 for 'blue' and 0,0,1 for 'green'. This would be ideal for my needs, except for the fact that it only can deal with integer features.
def get_crx_data(debug=False):
with open("/Volumes/LocalDataHD/jt306/crx.data", "rU") as infile:
features_array = []
reader = csv.reader(infile,dialect=csv.excel_tab)
for row in reader:
features_array.append(str(row).translate(None,"[]'").split(","))
features_array = np.array(features_array)
print features_array.shape
print features_array[0]
labels_array = features_array[:,15]
features_array = features_array[:,:15]
print features_array.shape
print labels_array.shape
print("FeatureHasher on frequency dicts")
hasher = FeatureHasher(n_features=44)
X = hasher.fit_transform(line for line in features_array)
print X.shape
get_crx_data()
This returns
Reading CRX data from disk
Traceback (most recent call last):
File"/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 38, in <module>
get_crx_data()
File "/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 32, in get_crx_data
X = hasher.fit_transform(line for line in features_array)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/base.py", line 426, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 129, in transform
_hashing.transform(raw_X, self.n_features, self.dtype)
File "_hashing.pyx", line 44, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1649)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 125, in <genexpr>
raw_X = (_iteritems(d) for d in raw_X)
File "/Volumes/LocalDataHD/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
return d.iteritems() if hasattr(d, "iteritems") else d.items()
AttributeError: 'numpy.ndarray' object has no attribute 'items'
(690, 16)
['0' ' 30.83' ' 0' ' u' ' g' ' w' ' v' ' 1.25' ' 1' ' 1' ' 1' ' 0' ' g'
' 202' ' 0' ' +']
(690, 15)
(690,)
FeatureHasher on frequency dicts
Process finished with exit code 1
How can I use feature hashing (or an alternative method) to convert this data from classes (some of which are strings, others are discrete numerical values) into data which can be handled by an SVM? I have also looked into using one-hot coding, but that only takes integers as input.
The issue is that the FeatureHasher object expects each row of input to have a particular structure -- or really, one of three different possible structures. The first possibility is a dictionary of feature_name:value pairs. The second is a list of (feature_name, value) tuples. And the third is a flat list of feature_names. In the first two cases, the feature names are mapped to columns in the matrix, and given values are stored at those columns for each row. In the last, the presence or absence of a feature in the list is implicitly understood as a True or False value. Here are some simple, concrete examples:
>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
... non_negative=True,
... input_type='dict')
>>> X_new = hasher.fit_transform([{'a':1, 'b':2}, {'a':0, 'c':5}])
>>> X_new.toarray()
array([[ 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 5., 0., 0.]])
This illustrates the default mode -- what the FeatureHasher will expect if you don't pass input_type, as in your original code. As you can see, the expected input is a list of dictionaries, one for each input sample or row of data. Each dictionary contains an arbitrary number of feature names, mapped to values for that row.
The output, X_new, contains a sparse representation of the array; calling toarray() returns a new copy of the data as a vanilla numpy array.
If you want to pass pairs of tuples instead, pass input_type='pairs'. Then you can do this:
>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
... non_negative=True,
... input_type='pair')
>>> X_new = hasher.fit_transform([[('a', 1), ('b', 2)], [('a', 0), ('c', 5)]])
>>> X_new.toarray()
array([[ 1., 2., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 5., 0., 0.]])
And finally, if you just have boolean values, you don't have to pass values explicitly at all -- the FeatureHasher will simply assume that if a feature name is present, then its value is True (represented here as the floating point value 1.0).
>>> hasher = sklearn.feature_extraction.FeatureHasher(n_features=10,
... non_negative=True,
... input_type='string')
>>> X_new = hasher.fit_transform([['a', 'b'], ['a', 'c']])
>>> X_new.toarray()
array([[ 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.]])
Unfortunately, your data doesn't seem to consistently be in any one of these formats. However, it shouldn't be too hard to modify what you have to fit the 'dict' or 'pair' format. Let me know if you need help with that; in that case, please say more about the format of the data you're trying to convert.

Categories