I am using Sklearn Decision Tree for some classification and I have two types of data: categorical and continuous. I used pd.get_dummies for my categorical values and ended up with over 90 features. Which is, of course, quite a lot.
The thing is that I then iterate over max_features parameter to get the best score for my model, and having more than 20 features is too time-consuming. So I thought that Sklearn could use sparse matricies for my categorical features, instead of 70 columns with 0 and 1.
The question is: can Sklearn use a mix of sparse matricies and regular arrays or no? If yes - how do I do that? Currently I get error: setting an array element with a sequence
Here is some code to get the idea. df_with_dummies is what I currently use, but I hope there is a way to use df_with_sparse
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
a = np.random.randn(10,3)
b = np.random.random((10,1))
df = pd.DataFrame(a, columns = "A B C".split())
df['temp'] = b
df['dum1'] = np.where(df.temp < 0.5, 1, 0)
df['dum2'] = np.where(df.temp >= 0.5, 1, 0)
del df['temp']
df_with_dummies = df.copy()
a = df[['dum1', 'dum2']]
dums = csr_matrix(a)
df['dums'] = dums
df_with_sparse = df.copy()
When you do:
df['dums'] = dums
dums being a sparse matrix is not correctly handled by the pandas DataFrame and it will be broadcasted to each row. pandas does not complain about it, because it thinks of sparse matrix as an non-array object.
That means that each element in the df['dums'] column will point to the whole sparse matrix dums. So essentially, each array element is being set with an array hence the error setting an array element with a sequence when it is being processed in scikit-learn estimators.
For that you can do:
from scipy.sparse import hstack
df_with_sparse = hstack([df[['A', 'B', 'C']].values, dums])
Now you can pass this further.
Related
I want to apply MinmaxScaler on a number of pandas DataFrame 'together'. Meaning that I want the scaler to perform on all data in those columns, not separately on each column.
My DataFrame has 20 columns. I want to apply the scaler on 12 of the columns at the same time. I have already read this. But it does not solve my problem since it acts on each column separately.
IIUC, you want the sklearn scaler to fit and transform multiple columns with the same criteria (in this case min and max definitions). Here is one way you can do this -
You can save the initial shape of the columns and then transform the numpy array of those columns into a 1D array from a 2D array.
Next you can fit your scaler and transform this 1D array
Finally you can use the old shape to reshape the array back into the n columns you need and save them
The advantage of this approach is that this works with any of the sklearn scalers you need to use, MinMaxScaler, StandardScaler etc.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})
cols = ['A','B']
old_shape = dfTest[cols].shape #(5,2)
dfTest[cols] = scaler.fit_transform(dfTest[cols].to_numpy().reshape(-1,1)).reshape(old_shape)
print(dfTest)
A B C
0 0.000000 0.884188 big
1 0.756853 0.926301 small
2 0.764303 0.956992 big
3 0.817143 0.995530 small
4 0.766885 1.000000 small
you can extract the "min" and "max" statistics from those columns and perform the scaling yourself:
# columns of interest
cols = [...]
# get the minimum and maximum values in that region
vals = df[cols].to_numpy()
min_val = vals.min()
max_val = vals.max()
# scale the region using them
df[cols] = df[cols].sub(min_val).div(max_val - min_val)
(sub is method way of doing "-" and div is for "/".)
Above, df is your training dataframe; to scale the testing dataframe, you replace df with that in the last line, e.g.,
test_df[cols] = test_df[cols].sub(min_val).div(max_val - min_val)
instead of extracting min/max of it separately which would leak information from the test set.
I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700
Considering data like:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)
I want to exclude the text column using the OHE functionality.
Why does the following not work?
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'
It says in the documentation:
categorical_features: “all” or array of indices or mask :
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.
I'm using a mask, yet it still tries to convert to float.
Even using
ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool),
dtype=dt)
ohe.fit(d)
Same error.
And also in the case of "array of indices":
ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)
ohe.fit(d)
You should understand that all estimators in Scikit-Learn were designed only for numerical inputs. Thus from this point of view there is no sense to leave text column in this form. You have to transform that text column in something numerical, or get rid of it.
If you obtained your dataset from Pandas DataFrame - you can take a look at this small wrapper: https://github.com/paulgb/sklearn-pandas. It will help you to transform all needed columns simultaneously (or leave some of rows in numerical form)
import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})
# number_1 number_2 text
# 0 1 2 aaa
# 1 1 2 bbb
# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
('text', SomeEncoder),
(['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)
I think there's some confusion here. You still need to enter the numerical values, but within the encoder you can specify which values are categorical which are not.
The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features.
So in the example below I change aaa to 5 and bbb to 6. This way it will distinguish from the 1 and 2 numerical values:
d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)
Now you can check your feature categories:
ohe.active_features_
Out[22]: array([5, 6], dtype=int64)
I encountered the same behavior and found it frustrating. As others have pointed out, Scikit-Learn requires all data to be numerical before it even considers selecting the columns provided in the categorical_features parameter.
Specifically, the column selection is handled by the _transform_selected() method in /sklearn/preprocessing/data.py and the very first line of that method is
X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).
This check fails if any of the data in the provided dataframe X cannot be successfully converted to a float.
I agree that the documentation of sklearn.preprocessing.OneHotEncoder is rather misleading in that regard.
How do I create a sparse matrix in CSR/COO format for a huge feature vector (50000 x 100000) from categorical data stored in Pandas DataFrame? I am creating the feature vector using Pandas get_dummies() function, but it returns a MemoryError. How do I avoid that and rather generate it in a sparse matrix CSR format?
Possibly useful links:
Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix
http://pandas.pydata.org/pandas-docs/stable/sparse.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.SparseSeries.to_coo.html#pandas.SparseSeries.to_coo
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
Use:
scipy.sparse.coo_matrix(df_dummies)
but do not forget to create df_dummies sparse in the first place...
df_dummies = pandas.get_dummies(df, sparse=True)
This answer will keep the data as sparse as possible and avoids memory issues when using Pandas get_dummies.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
df = pd.DataFrame({'rowid':[1,2,3,4,5], 'category':['c1', 'c2', 'c1', 'c3', 'c1']})
print 'Input data frame\n{0}'.format(df)
print 'Encode column category as numerical variables'
print LabelEncoder().fit_transform(df.category)
print 'Encode column category as dummy matrix'
print OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1)).todense()
print 'Concat with the original data frame as a matrix'
dummy_matrix = OneHotEncoder().fit_transform(LabelEncoder().fit_transform(df.category).reshape(-1,1))
df_as_sparse = sparse.csr_matrix(df.drop(labels=['category'], axis=1).as_matrix())
sparse_combined = sparse.hstack((df_as_sparse, dummy_matrix), format='csr')
print sparse_combined.todense()
I want to calculate the standard deviation for values below and above the average of a matrix of n_par parameters and n_sample samples. The fastest way I found so far is:
stdleft = numpy.zeros_like(mean)
for jpar in xrange(mean.shape[1]):
stdleft[jpar] = p[p[:,jpar] < \
mean[jpar],jpar].std()
where p is a matrix like (n_samples,n_par). Is there a smarter way to do it without the for loop? I have roughly n_par = 200 and n_samples = 1e8 and therefore these three lines take ages to be performed.
Any idea would be really helpfull!
Thank you
As I understand it, you want to calculate the standard deviation of each column where the values are below the mean for that column.
In numpy, it's easiest to use masked arrays for this.
As an example:
import numpy as np
# 10 samples, 3 columns
p = np.random.random((10, 3))
# Calculate the mean of each column
colmeans = p.mean(axis=0)
# Make a boolean array where our condition is True
mask = p < colmeans
# Find the standard deviation of values in each column below the column's mean.
# For masked arrays, the True values will be masked, so we'll invert the array.
stdleft = np.ma.masked_where(~mask, p).std(axis=0)
You can also use pandas for this as #SudeepJuvekar mentioned. The performance should be broadly similar, but pandas should be a bit faster for this particular operation (untested).
Pandas is your friend. Convert your matrix in pandas Dataframe and index the Dataframe logically. Something like this
mat = pandas.DataFrame(p)
This creates a DataFrame from original numpy matrix p. Then we compute the column means for the DataFrame.
m = mat.mean()
Creates n_par sized array of all column means of mat. Finally, index the mat matrix using < logical operation and apply std to that.
stdleft = mat[mat < m].std()
Similarly for stdright. Take a couple of minutes to compute on my machine.
Here's the doc page for pandas: http://pandas.pydata.org/
Edit: Edited using the comment below. You can do almost similar indexing using the original p.
m = p.mean(axis=0)
logical = p < m
logical contains a boolean matrix of same size as p. This is where pandas comes handy. You can directly index a pandas matrix using logical of same size. Doing so in numpy is slightly hard. I guess looping is the best way to achieve it?
for i in range(len(p)):
stdleft[i] = p[logical[:, i], i].std()