Normalization sklearn - python

Let's say i have a pandas data frame, and i want to normalize only some attributes, but not the whole data frame with the help of this function:
preprocessing.normalize
And i want to inplace these normalized columns to my data frame.But i can't because it has different format(numpy array).
I have already seen how to do the normalization other ways, for example i did like this:
s0 = X.iloc[:,13:15]
X.iloc[:,13:15] = (s0 - s0.mean()) / (s0.max() - s0.min())
X.head()
But i really need to do it using sklearn.
Thanks, Stack!

What you are doing is Min-max scaling. "normalize" in scikit has different meaning then what you want to do.
Try MinMaxScaler.
And most of the sklearn transformers output the numpy arrays only. For dataframe, you can simply re-assign the columns to the dataframe like below example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['A', 'B', 'C'])
Now lets say you only want to min-max scale the columns A and C:
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
df[['A', 'C']] = minmax.fit_transform(df[['A', 'C']])

(s0 - s0.mean()) / (s0.max() - s0.min()) is called Mean normalization and as far as I am aware, there is no transformer in Scikit-learn to carry out this transformation.
The MinMaxScaler transforms following this formula: (s0 - s0.min()) / (s0.max() - s0.min())
You can do this transformation on selected variables with scikit-learn as follows:
dirty way:
scaler = MinMaxScaler() # or any other scaler from sklearn
scaler.fit(X[[var1, var2, var20]])
X_transf[[var1, var2, var20]] = scaler.transform(X[[var1, var2, var20]])
better way using the ColumnTransfomer:
features_numerical = [var1, var2, var20]
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[('numerical', numeric_transformer, features_numerical)], remainder='passthrough'}) # to keep all other features in the data set
preprocessor.fit_transform(X)
The returned variable is a numpy array, so needs re-casting into pandas dataframe and addition of variable names.
More information on how to use column transformer from sklearn here.
You need to import the ColumnTransformer and the Pipeline from sklearn, as well as the scaler of choice.

Related

Feature selection and categorical variables

I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()
It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.

How to transform some columns only with SimpleImputer or equivalent

I am taking my first steps with scikit library and found myself in need of backfilling only some columns in my data frame.
I have read carefully the documentation but I still cannot figure out how to achieve this.
To make this more specific, let's say I have:
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
And that I would like to fill in the second column with the mean but not the third. How can I do this with SimpleImputer (or another helper class)?
An evolution from this, and the natural follow up questions is: how can I fill the second column with the mean and the last column with a constant (only for cells that had no values to begin with, obviously)?
There is no need to use the SimpleImputer.
DataFrame.fillna() can do the work as well
For the second column, use
column.fillna(column.mean(), inplace=True)
For the third column, use
column.fillna(constant, inplace=True)
Of course, you will need to replace column with your DataFrame's column you want to change and constant with your desired constant.
Edit
Since the use of inplace is discouraged and will be deprecated, the syntax should be
column = column.fillna(column.mean())
Following Dan's advice, an example of using ColumnTransformer and SimpleImputer to backfill the columns is:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
A = [[7,2,3],[4,np.nan,6],[10,5,np.nan]]
column_trans = ColumnTransformer(
[('imp_col1', SimpleImputer(strategy='mean'), [1]),
('imp_col2', SimpleImputer(strategy='constant', fill_value=29), [2])],
remainder='passthrough')
print(column_trans.fit_transform(A)[:, [2,0,1]])
# [[7 2.0 3]
# [4 3.5 6]
# [10 5.0 29]]
This approach helps with constructing pipelines which are more suitable for larger applications.
This is methode I use, you can replace low_cardinality_cols by cols you want to encode. But this works also justt set value unique to max(df.columns.nunique()).
#check cardinalité des cols a encoder
low_cardinality_cols = [cname for cname in df.columns if df[cname].nunique() < 16 and
df[cname].dtype == "object"]
Why thes columns, it's recommanded, to encode only cols with cardinality near 10.
# Replace NaN, if not you'll stuck
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # feel free to use others strategy
df[low_cardinality_cols] = imp.fit_transform(df[low_cardinality_cols])
# Apply label encoder
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
df[col] = label_encoder.fit_transform(df[col])
```
I am assuming you have your data as a pandas dataframe.
In this case, all you need to do to use the SimpleImputer from scikitlearn is to pick the specific column your looking to impute nan's using say using the 'most_frequent' values, convert it to a numpy array and reshape into a column vector.
An example of this is,
## Imputing the missing values, we fill the missing values using the 'most_frequent'
# We are using the california housing dataset in this example
housing = pd.read_csv('housing.csv')
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
#Simple imputer expects a column vector, so converting the pandas Series
housing['total_bedrooms'] = imp.fit_transform(housing['total_bedrooms'].to_numpy().reshape(-1,1))
Similarly, you can pick any column in your dataset convert into a NumPy array, reshape it and use the SimpleImputer

How can I get the feature names from sklearn TruncatedSVD object?

I have the following code
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
df = df = pd.DataFrame(np.random.randn(1000, 25), index=dates, columns=list('ABCDEFGHIJKLMOPQRSTUVWXYZ'))
def reduce(dim):
svd = sklearn.decomposition.TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
fitted = reduce(5)
how do i get the column names from fitted?
In continuation of Mikhail post.
Assume that you already have feature_names from vectorizer.get_feature_names() and after that you have called svd.fit(X)
Now you can also extract sorted best feature names using the following code:
best_fearures = [feature_names[i] for i in svd.components_[0].argsort()[::-1]]
The above code, try to return the arguement of descending sort of svd.components_[0] and find the relative index from feature_names (all of the features) and construct the best_features array.
Then you can see for example the 10 best features:
In[21]: best_features[:10]
Out[21]:
['manag',
'develop',
'busi',
'solut',
'initi',
'enterprise',
'project',
'program',
'process',
'plan']
fitted column names would be SVD dimensions.
Each dimension is a linear combination of input features. To understand what a particular dimension mean take a look at svd.components_ array - it contains a matrix of coefficients input features are multiplied by.
Your original example, slightly changed:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
feature_names = list('ABCDEF')
df = pd.DataFrame(
np.random.randn(1000, len(feature_names)),
columns=feature_names
)
def reduce(dim):
svd = TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
svd = reduce(3)
Then you can do something like that to get a more readable SVD dimension name - let's compute it for 0th dimension:
" ".join([
"%+0.3f*%s" % (coef, feat)
for coef, feat in zip(svd.components_[0], feature_names)
])
It shows +0.170*A -0.564*B -0.118*C +0.367*D +0.528*E +0.475*F - this is a "feature name" you can use for a 0th SVD dimension in this case (of course, coefficients depend on data, so feature name also depends on data).
If you have many input dimensions you may trade some "precision" with inspectability, e.g. sort coefficients and use only a few top of them. A more elaborate example can be found in https://github.com/TeamHG-Memex/eli5/pull/208 (disclaimer: I'm one of eli5 maintainers; pull request is not by me).

kNN Classifier - importance of DataFrame column order - is this a scikit bug, pandas bug or by design?

We have this sci-py code:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
df = pd.DataFrame({'Category':['X','X','X','X','X','X','Y','Y','Y','Y','Y']
,'Age':[10,20,30,35,32,33,27,70,40,50,60]
,'Weight':[15,16,21,33,7,8,9,11,31,38,25]
,'Exercise':[2,0,0,1,7,6,9,11,2,0,5]})
classifier_3NN = KNeighborsClassifier(n_neighbors=3, metric='minkowski')
train_df = df[['Age','Weight','Exercise']]
target_ss = df['Category']
classifier_3NN.fit(train_df, target_ss)
test_df = pd.DataFrame({'Age':[11,27,39]
,'Weight':[21,9,36]
,'Exercise':[7,6,0]})
Intuitively we would expect to be able to feed the test data into the classifier in any order of the columns of its dataframe and the algorithm will take acount of the column headers but we are getting the following:
In [21]: classifier_3NN.predict(test_df[['Age','Weight','Exercise']])
Out[21]: array(['X', 'X', 'Y'], dtype=object)
When I swap the ordering:
In [22]: classifier_3NN.predict(test_df[['Exercise','Weight', 'Age']])
Out[22]: array(['X', 'X', 'X'], dtype=object)
Is this by design or a bug? If it's a bug then where is the bug happening - which package? If it is by design then where is it documented?
I don't think there is a bug, but I agree it could be better documented. You have to provide the dataframe in the correct order.
As scikit was built with numpy in mind the Dataframe is converted to a numpy 2d array (this also during the fit part), and it does not save the headers order.
The array is checked and converted before proceeding with the algorithm, through check_array, in which if there are no problems with the dtypes it basically returns numpy.array(thedataframe).
This happens in the utils.validation module.
It would work fine if you called .fit() again changing the order of the labels as well.
SKLearn classifiers don't keep track of headers at all.

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Categories